Center for Language Engineering






[ Localization ] [ Language Processing ] [ Linguistic Resources ]

  Statistical Part of Speech Tagger for Urdu v1.0  
  Release Notes  

Part of speech tagging system consists of two main phases which are tagset design and implementation of disambiguation technique. Urdu shares its large vocabulary with Arabic and Persian and morphology and syntactic structure with Hindi. However, there are standard tagging guidelines which aims at standardizing the tag sets of all languages of the world. The corpus of training was manually checked to separate the words by space. Corpus was prepared by applying normalization, and by removing diacritics and non-Urdu words. Tagger showed an accuracy of 97.2% while testing on the data of 10,000 words.

The Part of Speech Tagger tags the given text using Urdu Part of Speech Tagset. The tagger takes the input from "input.txt" file. A file "Tags.txt" containing open class tags will be used as candidate tag for unknown word. Output will be saved in a text file named "results.txt".

The Statistical POS tagger requires Microsoft .Net Framework 2.0.

  Download (This file has been accessed: times, since 23 August 2011)  

Statistical Part of Speech Tagger for Urdu