MontyTagger
is no longer maintained separately. It is available as part of the
MontyLingua NLP Toolkit, available here.
Download!
NEW!
Faster! Monty Tagger v1.2: A Brill-based
Part-of-Speech Tagger for Python/Java (2003)... free!
[view gpl license]
(for commercial licensing, please contact us
for more information)
download version 1.2 now!
python
api + interactive command line (1.70mb) [.zip]
java
api + interactive command line (1.96mb) [.zip]
What's
New in Version 1.2?
# released 8/11/2003
#
- lexicon reimplemented; additional optimizations
# - 100% tagging speed improvement
# - python: (v1.0: 200words/s, v1.2: 500words/s)
# - java: (v1.0: 80words/s, v1.2: 200words/s)
#
# - 160%-400% memory usage improvement
# - python: (v1.0: 20mb, v1.2: 5mb)
# - java: (v1.0: 40mb, v1.2: 25mb)
#
# - 400%-1000% improvement in tagger loading
time
# - python: (v1.0: 10secs, v1.2: 1sec)
# - java: (v1.0: 22secs, v1.2: 5secs)
#
Description:
Monty
Tagger is a rule-based part-of-speech tagger based on Eric Brill's
1994 transformational-based learning POS tagger, and uses Brill-compatible
lexicon and rule files. (The distribution includes Brill's original
Penn Treebank trained lexicon and rule files.) It also includes
a tokenizer for English and tools for performance evaluation.
Because
it is implemented in classic, portable Python, it will run on virtually
any system (i.e. Mac, Unix, Windows) for which a python implementation
exists. The API is also available in Java via montytagger.jar.
Version
1.2 of MontyTagger (running in classic Brill mode) has been benchmarked
at 500 words/sec, running in python 2.2 on a pentium-III, 1Ghz Wintel
box. Word-level tagging accuracy on typical US English non-fiction
is approximately 95% (comparable to Brill).
Footnotes
Author:
Hugo Liu <hugo at media dot mit dot edu>
Project Page (please link to:) <http://web.media.mit.edu/~hugo/montytagger>
Copyright
(c) 2002, 2003 by Hugo Liu, MIT Media Lab
Original Brill POS Tagger and Data Files (c) Eric Brill, UPenn,
M.I.T.
Please
send bug reports, feedback, comments, suggestions, and feature requests
via email to: hugo at media dot mit dot edu. Also, if you found
this useful either in a system, project, or as a teaching resource,
I would love to hear from you!
FAQ:
1)
What does MontyTagger do?
MontyTagger
annotates English text with part-of-speech information, e.g. "dog"
as a noun, or "dog" as a verb. You give MontyTagger a
bit of text, e.g. "Jack likes apples" and you get back
the same text where each word is annotated with its part-of-speech,
e.g. "Jack/NNP likes/VBZ apples/NNS". Part-of-speech tagging
is an indispensible part of natural language processing systems.
2)
What do those part-of-speech tags mean?
NN
= common, singular noun; JJ = adjective; VB = root verb; etc. MontyTagger
uses the Penn
Treebank tagset. Meaning, there is documentation where the meanings
of these tags are explained. Here
is a quick table of tags and their meanings.
3)
I'm getting an error message when i compile and run my program.
By
far, the most common problem that's been reported was caused by
misplaced datafiles. Any file which is in all capital letters is
a datafile which must be in the directory where your program calls
MontyTagger. In version 1.2, I've decided to create two separate
distributions, each with its own copy of the data files, just to
avoid confusion.
4)
Are the Java and Python versions the same?
Yes,
and no. Yes, they are running the same code. But no, these are not
equivalent in terms of features. The java version was compiled using
the tool Jython. Therefore, the java
api is quite limited in terms of features. In the Python version,
there are far more options available to the developer, and it should
be quite easy for even a programming novice to figure out. Python
is a great language for beginners and experts alike! If you need
to modify the Java API to include features more suited to your needs,
I recommend you twiddle with the python version and then build your
own .jar file using Jython! Happy hunting!
Research
Agenda :
One
question I have recently wanted to answer is how the wealth of common
sense knowledge we are gathering today can be used to improve natural
language processing tools. Ideally, commonsense will play a large
role in that final 10% of accuracy for word sense disambiguation,
parsing, and semantic interpretation. However, the state of generic
commonsense knowledge bases are still quite brittle. Open Mind Commonsense
also suffers from mass ambiguity itself. Nonetheless, we think that
commonsense can be helpful to NLP tasks even if used in a somewhat
statistical and shallow manner. In this project, we've built a part-of-speech
tagger for English, based on Eric Brill's 1994 transformation-based
tagger. To add a kick, we've added verb-argument selectional preferences
from OMCS, in hopes of improving performance.
For
example, given the input:
* The/DT dog/NN bit/NN the/DT man/NN ./.
Monty
Tagger identifies that there is probably a missing verb and it is
probably bit/NN. Open Mind tells us through many redundant entries
that something that dogs frequently do is that they bite. From this
piece of knowledge, we retag bit/NN to be bit/VBD.
A future
version will hybridize Brill tagging with linguistically motivated
techniques aimed at improving POS tagging accuracy in English, including
handling of phrasal verbs, idiomatic phrases, shallow parsing, handling
of long-distance dependencies, verb selectional preferences, and
commonsense (reasoning, and selectional preferences).
Additional
Links and Downloads:
|