hugo :: research
"only as an æsthetic phenomenon is
existence and the world justified"

- nietzsche

 

             
 
       
               
 
   
emotus ponens picture
::: Monty Tagger v1.2:::
Commonsense-Informed Part-of-Speech Tagging
 
 


monty tagger picture

MontyTagger is no longer maintained separately. It is available as part of the MontyLingua NLP Toolkit, available here.

 

 

 

 

 

 

 

 

 

 

 


Download!

NEW! Faster! Monty Tagger v1.2: A Brill-based
Part-of-Speech Tagger for Python/Java (2003)... free!

[view gpl license]
(for commercial licensing, please contact us for more information)

download version 1.2 now!

python api + interactive command line (1.70mb) [.zip]
java api + interactive command line (1.96mb) [.zip]

What's New in Version 1.2?
# released 8/11/2003
# - lexicon reimplemented; additional optimizations
# - 100% tagging speed improvement
# - python: (v1.0: 200words/s, v1.2: 500words/s)
# - java: (v1.0: 80words/s, v1.2: 200words/s)
#
# - 160%-400% memory usage improvement
# - python: (v1.0: 20mb, v1.2: 5mb)
# - java: (v1.0: 40mb, v1.2: 25mb)
#
# - 400%-1000% improvement in tagger loading time
# - python: (v1.0: 10secs, v1.2: 1sec)
# - java: (v1.0: 22secs, v1.2: 5secs)
#


Description:

Monty Tagger is a rule-based part-of-speech tagger based on Eric Brill's 1994 transformational-based learning POS tagger, and uses Brill-compatible lexicon and rule files. (The distribution includes Brill's original Penn Treebank trained lexicon and rule files.) It also includes a tokenizer for English and tools for performance evaluation.

Because it is implemented in classic, portable Python, it will run on virtually any system (i.e. Mac, Unix, Windows) for which a python implementation exists. The API is also available in Java via montytagger.jar.

Version 1.2 of MontyTagger (running in classic Brill mode) has been benchmarked at 500 words/sec, running in python 2.2 on a pentium-III, 1Ghz Wintel box. Word-level tagging accuracy on typical US English non-fiction is approximately 95% (comparable to Brill).

Footnotes

Author: Hugo Liu <hugo at media dot mit dot edu>
Project Page (please link to:) <http://web.media.mit.edu/~hugo/montytagger>

Copyright (c) 2002, 2003 by Hugo Liu, MIT Media Lab
Original Brill POS Tagger and Data Files (c) Eric Brill, UPenn, M.I.T.

Please send bug reports, feedback, comments, suggestions, and feature requests via email to: hugo at media dot mit dot edu. Also, if you found this useful either in a system, project, or as a teaching resource, I would love to hear from you!


FAQ:

1) What does MontyTagger do?

MontyTagger annotates English text with part-of-speech information, e.g. "dog" as a noun, or "dog" as a verb. You give MontyTagger a bit of text, e.g. "Jack likes apples" and you get back the same text where each word is annotated with its part-of-speech, e.g. "Jack/NNP likes/VBZ apples/NNS". Part-of-speech tagging is an indispensible part of natural language processing systems.

2) What do those part-of-speech tags mean?

NN = common, singular noun; JJ = adjective; VB = root verb; etc. MontyTagger uses the Penn Treebank tagset. Meaning, there is documentation where the meanings of these tags are explained. Here is a quick table of tags and their meanings.

3) I'm getting an error message when i compile and run my program.

By far, the most common problem that's been reported was caused by misplaced datafiles. Any file which is in all capital letters is a datafile which must be in the directory where your program calls MontyTagger. In version 1.2, I've decided to create two separate distributions, each with its own copy of the data files, just to avoid confusion.

4) Are the Java and Python versions the same?

Yes, and no. Yes, they are running the same code. But no, these are not equivalent in terms of features. The java version was compiled using the tool Jython. Therefore, the java api is quite limited in terms of features. In the Python version, there are far more options available to the developer, and it should be quite easy for even a programming novice to figure out. Python is a great language for beginners and experts alike! If you need to modify the Java API to include features more suited to your needs, I recommend you twiddle with the python version and then build your own .jar file using Jython! Happy hunting!


Research Agenda :

One question I have recently wanted to answer is how the wealth of common sense knowledge we are gathering today can be used to improve natural language processing tools. Ideally, commonsense will play a large role in that final 10% of accuracy for word sense disambiguation, parsing, and semantic interpretation. However, the state of generic commonsense knowledge bases are still quite brittle. Open Mind Commonsense also suffers from mass ambiguity itself. Nonetheless, we think that commonsense can be helpful to NLP tasks even if used in a somewhat statistical and shallow manner. In this project, we've built a part-of-speech tagger for English, based on Eric Brill's 1994 transformation-based tagger. To add a kick, we've added verb-argument selectional preferences from OMCS, in hopes of improving performance.

For example, given the input:
* The/DT dog/NN bit/NN the/DT man/NN ./.

Monty Tagger identifies that there is probably a missing verb and it is probably bit/NN. Open Mind tells us through many redundant entries that something that dogs frequently do is that they bite. From this piece of knowledge, we retag bit/NN to be bit/VBD.

A future version will hybridize Brill tagging with linguistically motivated techniques aimed at improving POS tagging accuracy in English, including handling of phrasal verbs, idiomatic phrases, shallow parsing, handling of long-distance dependencies, verb selectional preferences, and commonsense (reasoning, and selectional preferences).


Additional Links and Downloads:

 

                                                                           

H U G O . . L I U ...
POSTDOCTORAL ASSOCIATE

program in comparative media studies, mit

the media laboratory, mit
if you like my work, please link to me
hugo at media dot mit dot edu