Tagging text with Stanford POS Tagger in Java Applications

I was looking for a way to extract “Nouns” from a set of strings in Java and I found, using Google, the amazing stanford NLP (Natural Language Processing) Group POS.

The library provided lets you “tag” the words in your string. That is, for each word, the “tagger” gets whether it’s a noun, a verb ..etc. and then assigns the result to the word. For example:

“This is a sample sentence”

will be output as


This/DT is/VBZ a/DT sample/NN sentence/NN

To do this, the tagger has to load a “trained” file that contains the necessary information for the tagger to tag the string. This “trained” file is called a model and has the extension “.tagger”. There are several trained models provided by Stanford NLP group for different languages.

In this post I will show you how to use such library in your Java application using Eclipse IDE.

  1. Create a new project.
  2. Create a new folder called “taggers”.
  3. Download the zip file provided by stanford group.
  4. Extract the zip file and Open the extracted folder.
  5. You will find a folder called models, open it and copy the model you want to the “taggers” folder we created earlier + its corresponding (with the same name) “.props” file.
  6. Now we need to import the library to our project so that Eclipse does not complain when we use it in our code. So, right click your project > Build Path > Configure Build Path.
    In the new window, Open the libraries tab (from the top) and click the Add External Jars button.
    Locate the “stanford-postagger.jar” file that is found in the extracted folder.

  7. Now enough with the configuration and let’s start coding. In your project create a new Class and in its main method write:
    // Initialize the tagger
    
    MaxentTagger tagger = new MaxentTagger(
    
    "taggers/left3words-distsim-wsj-0-18.tagger");
    
    

    The MaxentTagger constructor takes the path to the model (trained file) as a parameter:

    “NAME_OF_FOLDER/NAME_OF_MODEL.tagger”.

    Once you write the code, Eclipse will tell you to import the MaxentTagger and inform you that it throws some exceptions. Use eclipse to add all that to the code.

    Finally, we tag the string we want:

    
    // The sample string
    
    String sample = "This is a sample text";
    
    // The tagged string
    
    String tagged = tagger.tagString(sample);
    
    // Output the result
    
    System.out.println(tagged);

    This will output the same result that’s mentioned at the begining of the post.

    Here’s my entire class

    import java.io.IOException;
    
    import edu.stanford.nlp.tagger.maxent.MaxentTagger;
    
    public class TagText {
    	public static void main(String[] args) throws IOException,
    			ClassNotFoundException {
    
    		// Initialize the tagger
    		MaxentTagger tagger = new MaxentTagger(
    				"taggers/left3words-distsim-wsj-0-18.tagger");
    
    		// The sample string
    		String sample = "This is a sample text";
    
    		// The tagged string
    		String tagged = tagger.tagString(sample);
    
    		// Output the result
    		System.out.println(tagged);
    	}
    }
    

Finally, We need to know what these “abbreviations” mean. For example in this output:


This/DT is/VBZ a/DT sample/NN sentence/NN

What does “NN” or “DT” mean? The tagger uses the Penn Treebank tag set for English language as stated on the library’s homepage. For a list of the abbreviations click here. See the included README-Models.txt in the models directory for more information about the tagsets for the other languages.

For memory problems (quoting Akash’s comment below):

It turns out that the problem is that eclipse allocates on 256MB of memory by default. RightClick on the Project->Run as->Run Configurations->Go to the arguments tab-> under VM arguments type -Xmx2048m This will set the allocated memory to 2GB and all the tagger files should run now.

Updated:
Click here to download a sample project (for usage with Eclipse). It contains a tagger and a GUI example.

References
http://nlp.stanford.edu/software/tagger.shtml
http://www.englishclub.com/grammar/parts-of-speech_1.htm

111 thoughts on “Tagging text with Stanford POS Tagger in Java Applications

  1. Can you guide me how to initialize tagger as when i run the java application, it is unable to find open the file when i pass the folder name
    to the .tagger file.

    i am passing “models/left3words-distsim-wsj-0-18.tagger” as the file is under models folder.

    should this folder be in the same working space as of my project?

  2. hey.. m getting this following error.. pls help asap. :)

    “Loading default properties from trained tagger taggers/left3words-distsim-wsj-0-18.tagger
    Error: No such trained tagger config file found.
    java.io.IOException: Unable to resolve “taggers/left3words-distsim-wsj-0-18.tagger” as either class path, filename or URL
    at edu.stanford.nlp.io.IOUtils.getInputStreamFromURLOrClasspathOrFileSystem(IOUtils.java:331)
    at edu.stanford.nlp.tagger.maxent.TaggerConfig.getTaggerDataInputStream(TaggerConfig.java:724)
    at edu.stanford.nlp.tagger.maxent.TaggerConfig.(TaggerConfig.java:186)
    at edu.stanford.nlp.tagger.maxent.TaggerConfig.(TaggerConfig.java:131)
    at edu.stanford.nlp.tagger.maxent.MaxentTagger.(MaxentTagger.java:240)
    at TagText.main(TagText.java:11)
    java.io.IOException: Unable to resolve “taggers/left3words-distsim-wsj-0-18.tagger” as either class path, filename or URL”

  3. hey i got an error while using this code. :(
    th error is: “usage: Relation treebank numberRanges”
    Can you please guide me what should I do ?

  4. Whenever I am running this program, it just gives this output.:

    dep: []
    pred: [Root (S|SINV <# VP=target )]
    aux: [Root (VP < VP < /^(?:TO|MD|VB.*|AUXG?|POS)$/=target ), Root (SQ|SINV < (/^(?:VB|MD|AUX)/=target $++ /^(?:VP|ADJP)/ )), Root (CONJP < TO=target < VB ), Root (SINV < (VP=target < (/^(?:VB|AUX|POS)/ < /^(?i:am|is|are|be|being|'s|'re|'m|was|were|been|s|ai)$/ )$– (VP < VBG )))]
    auxpass: [Root (VP < (/^(?:VB|AUX|POS)/=target < /^(?i:am|is|are|be|being|'s|'re|'m|was|were|been|s|ai|seem|seems|seemed|seeming|appear|appears|appeared|become|becomes|became|becoming|get|got|getting|gets|gotten|remains|remained|remain)$/ )< (VP|ADJP [< VBN|VBD |< (VP|ADJP < VBN|VBD )< CC ])), Root (SQ|SINV < (/^(?:VB|AUX|POS)/=target < /^(?i:am|is|are|be|being|'s|'re|'m|was|were|been|s|ai)$/ $++ (VP < /^VB[DN]$/ ))), Root (SINV < (VP=target < (/^(?:VB|AUX|POS)/ < /^(?i:am|is|are|be|being|'s|'re|'m|was|were|been|s|ai)$/ )$– (VP < /^VB[DN]$/ ))), Root (SINV < (VP=target < (VP < (/^(?:VB|AUX|POS)/ < /^(?i:am|is|are|be|being|'s|'re|'m|was|were|been|s|ai)$/ ))$– (VP < /^VB[DN]$/ )))]
    cop: [Root (VP < (/^(?:VB|AUX)/=target < /^(?i:am|is|are|be|being|'s|'re|'m|was|were|been|s|ai|seem|seems|seemed|seeming|appear|appears|appeared|stay|stays|stayed|remain|remains|remained|resemble|resembles|resembled|resembling|become|becomes|became|becoming)$/ [$++ (/^(?:ADJP|NP$|WHNP$)/ !< VBN|VBD ) |$++ (S <: (ADJP < JJ ))])), Root (SQ|SINV < (/^(?:VB|AUX)/=target < /^(?i:am|is|are|be|being|'s|'re|'m|was|were|been|s|ai|seem|seems|seemed|seeming|appear|appears|appeared|stay|stays|stayed|remain|remains|remained|resemble|resembles|resembled|resembling|become|becomes|became|becoming)$/ [$++ (ADJP !< VBN|VBD ) |$++ (NP $++ NP ) |$++ (S <: (ADJP < JJ ))]))]
    conj: [Root (VP|S|SBAR|SBARQ|SINV|SQ < (CC|CONJP $– !/^(?:“|-LRB-|PRN|PP|ADVP|RB)/ $+ !/^(?:PRN|“|''|-[LR]RB-|,|:|.)$/=target )), Root (VP|S|SBAR|SBARQ|SINV|SQ < (CC|CONJP $– !/^(?:“|-LRB-|PRN|PP|ADVP|RB)/ $+ (ADVP $+ !/^(?:PRN|“|''|-[LR]RB-|,|:|.)$/=target ))), Root (VP|S|SBAR|SBARQ|SINV|SQ < (CC|CONJP $– !/^(?:“|-LRB-|PRN|PP|ADVP|RB)/ )< (/^(?:PRN|“|''|-[LR]RB-|,|:|.)$/ $+ /^S$|^(?:A|N|V|PP|PRP|J|W|R)/=target )), Root (/^(?:ADJP|JJP|PP|QP|(?:WH)?NP(?:-TMP|-ADV)?|ADVP|UCP|NX|NML)$/ < (CC|CONJP $– !/^(?:“|-LRB-|PRN)$/ $+ !/^(?:PRN|“|''|-[LR]RB-|,|:|.)$/=target )), Root (/^(?:ADJP|PP|(?:WH)?NP(?:-TMP|-ADV)?|ADVP|UCP|NX|NML)$/ < (CC|CONJP $– !/^(?:“|-LRB-|PRN)$/ $+ (ADVP $+ !/^(?:PRN|“|''|-[LR]RB-|,|:|.)$/=target ))), Root (/^(?:ADJP|PP|(?:WH)?NP(?:-TMP|-ADV)?|ADVP|UCP|NX|NML)$/ < (CC|CONJP $– !/^(?:“|-LRB-|PRN)$/ )< (/^(?:PRN|“|''|-[LR]RB-|,|:|.)$/ $+ /^S$|^(?:A|N|V|PP|PRP|J|W|R)/=target )), Root (NX|NML < (CC|CONJP $- __ )< (/^,$/ $- /^(?:A|N|V|PP|PRP|J|W|R|S)/=target )), Root (/^(?:VP|S|SBAR|SBARQ|ADJP|PP|QP|(?:WH)?NP(?:-TMP|-ADV)?|ADVP|UCP|NX|NML)$/ < (CC $++ (CC|CONJP $+ !/^(?:PRN|“|''|-[LR]RB-|,|:|.)$/=target )))]
    cc: [Root (/^(?:S|VP|(?:WH)?NP(?:-TMP|-ADV)?|QP|ADJP|PP|ADVP|UCP|NX|SBAR|SBARQ|SINV|SQ|JJP|NML|CONJP)/ [< (CC=target !< /^(?i:either|neither|both)$/ ) |< (CONJP=target !< (RB < /^(?i:not)$/ $+ (RB|JJ < /^(?i:only|just|merely)$/ )))])]
    punct: [Root (__ < /^(?:.|:|,|''|“|-LRB-|-RRB-)$/=target )]
    arg: []
    subj: [] ….
    more like this!!

    Can you please possibly say where is it going wrong?

  5. Thanks! It is working when I am using it this way but I am getting an error when I am trying to use this part of the code in a different applet program where the input text to be tagged is taken from the text box of the applet!! The error shown is as follows:

    Error: No such trained tagger config file found.

    And also IO exception is shown.

    Kindly help me if you can.

  6. Hey,

    Thanks for the tutorial, really helped me get started. Just wondering, does the POS Tagger help with identifying phrases (not just words alone)?

  7. Hi Galal,

    Thanks for your nice presentation.
    Could you please let me know how can I use POS tagger using NetBeans? It will be very helpful for me. Waiting for your reply.
    Thanks again

    Shuvo

  8. We have this project to convert english sentences to first order logic form. Your code to tag text in POS helped us. Can you please help us with the java code to convert these tagged sentences into FOL(First order logic form). A little guidance would matter a lot.
    :)

  9. We are working on our final year project on the concept of OPINION MINING…so we are in need of POS tagger..but we dnt know how to use..your instructions are simple and understandable..but still we got some errors on implementing your code

    here they are

    /////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
    Loading default properties from trained tagger taggers/left3words-distsim-wsj-0-18.tagger
    Error: No such trained tagger config file found.
    java.io.FileNotFoundException: taggersleft3words-distsim-wsj-0-18.tagger (The system cannot find the file specified)
    at java.io.FileInputStream.open(Native Method)
    at java.io.FileInputStream.(Unknown Source)
    at java.io.FileInputStream.(Unknown Source)
    at edu.stanford.nlp.tagger.maxent.TaggerConfig.getTaggerDataInputStream(TaggerConfig.java:736)
    at edu.stanford.nlp.tagger.maxent.TaggerConfig.(TaggerConfig.java:184)
    at edu.stanford.nlp.tagger.maxent.MaxentTagger.(MaxentTagger.java:240)
    at TagText.main(TagText.java:9)
    Exception in thread “main” java.io.FileNotFoundException: taggersleft3words-distsim-wsj-0-18.tagger (The system cannot find the file specified)
    at java.io.FileInputStream.open(Native Method)
    at java.io.FileInputStream.(Unknown Source)
    at java.io.FileInputStream.(Unknown Source)
    at edu.stanford.nlp.tagger.maxent.TaggerConfig.getTaggerDataInputStream(TaggerConfig.java:736)
    at edu.stanford.nlp.tagger.maxent.MaxentTagger.readModelAndInit(MaxentTagger.java:667)
    at edu.stanford.nlp.tagger.maxent.MaxentTagger.(MaxentTagger.java:280)
    at edu.stanford.nlp.tagger.maxent.MaxentTagger.(MaxentTagger.java:240)
    at TagText.main(TagText.java:9)

      • Galal,

        This tutorial is great! Thanks for taking the time to put it together Unfortunately, I’m receiving the same error as the people above. Here is a link to a screenshot of my folder structure. I used your example, but it doesn’t seem to be working.

        http://i.imgur.com/BE20l.png

        Thanks again,
        Adam

        • Galal,

          I got your example to work, using the 2011-04-20 release. In addition, you need to change:

          MaxentTagger tagger = new MaxentTagger(
          “taggers/left3words-distsim-wsj-0-18.tagger”)

          to:

          MaxentTagger tagger = new MaxentTagger(“taggers/left3words-wsj-0-18.tagger”);

          It also required refreshing the project after setting up the libraries.

          Any ideas about how to make it work with the latest release of the POS tagger?

          Thanks,
          Adam

  10. i would try with an arabic example the model left3words-wsj-0-18.tagger can not resolved the problem of arabic i try with an arabic models but same errors was generated
    Loading default properties from trained tagger sources/arabic-fast.tagger
    Reading POS tagger model from sources/arabic-fast.tagger … Exception in thread “main” java.lang.OutOfMemoryError: Java heap space
    at java.util.Arrays.copyOfRange(Arrays.java:3209)
    at java.lang.String.(String.java:215)
    at java.io.DataInputStream.readUTF(DataInputStream.java:644)
    at java.io.DataInputStream.readUTF(DataInputStream.java:547)
    at edu.stanford.nlp.tagger.maxent.FeatureKey.read(FeatureKey.java:79)
    at edu.stanford.nlp.tagger.maxent.MaxentTagger.readModelAndInit(MaxentTagger.java:758)
    at edu.stanford.nlp.tagger.maxent.MaxentTagger.readModelAndInit(MaxentTagger.java:702)
    at edu.stanford.nlp.tagger.maxent.MaxentTagger.(MaxentTagger.java:286)
    at edu.stanford.nlp.tagger.maxent.MaxentTagger.(MaxentTagger.java:244)
    at taging.Main.main(Main.java:45)
    Java Result: 1
    BUILD SUCCESSFUL (total time: 8 seconds)

    • I had the same problem with almost all the tagger types. It turns out that the problem is that eclipse allocates on 256MB of memory by default. RightClick on the Project->Run as->Run Configurations->Go to the arguments tab-> under VM arguments type -Xmx2048m This will set the allocated memory to 2GB and all the tagger files should run now.

  11. Dear friends,
    I have a question about using the POS Tagger. I’m doing my PhD research and I need to extract N+N combinations from the texts. I’ve got a general idea about how to use a program but what I need to know is how many nouns you have in the dictinary of a program? I need my project data to be as accurate as possible.. Shall it be able to find 99% of them in the text?

  12. Thanks for you guide :)

    Did you get this run time error before :
    usage: Relation treebank numberRanges

    I get it once i run the project.
    Thanks.