[Attempto] Attempto lexicons : more of them

Jean-Marc Vanel jeanmarc.vanel at gmail.com
Tue Jul 19 17:00:07 CEST 2011


Hi

I think that the most important missing features in ACE are related to
lexicons. There are several issues. The first is that there are
several formats for ACE lexicons : one on core ACE, one in ACEWiki,
another within OWL files in ACEView.

The second is that currently available lexicons are quite small.
I have 2 proposals for that.

I can start from this word list available on Linux (Ubuntu) :
/usr/share/dict/american-english
It's a plain word list with 1 word per line. For example it has both
"studies" and "study" .
Then, for each word, I can use WordNet Lemmatizer in NLTK [1] to
lemmatize it ( that is, associate "study" to "studies" ). Each such
pair will be added to an associative array. After processing the file
this way, it's easy to output the lexicon formats for ACE tools. The
WordNet API in NLTK will also be used to find out if it's a noun or a
verb.

Example of using NLTK for lemmatization:

import nltk
nltk.download()
wornet
from nltk.stem.wordnet import WordNetLemmatizer
lmtzr = WordNetLemmatizer()
lmtzr.lemmatize('cars')


The 2nd proposal is to reuse OWL ontologies to obtain ACE lexicons.
That will be in another mail ...

[1] NLTK Natural Language Toolkit - http://www.nltk.org/

--
Jean-Marc Vanel
Déductions SARL - Consulting, services, training,
Rule-based programming, Semantic Web
http://jmvanel.free.fr/ - EulerGUI, a turntable GUI for Semantic Web +
rules, XML, UML, eCore, Java bytecode
+33 (0)6 89 16 29 52 -- +33 (0)1 39 55 58 16
chat :  irc://irc.freenode.net#eulergui


More information about the attempto mailing list