Essential Urdu Linguistic Resources

This project gives a unique opportunity to further develop collaboration between Pakistan and Germany to mature linguistic research capacity in Pakistan and to concurrently develop the much needed linguistic resources.

It aims to collaborate in three aspects:

The teams will organize joint workshops with researchers from both teams coming together to develop a common understanding of the issues and solutions in core areas relevant for grammar and semantic resources, including POS tagset, WordNet and VerbNet structures, and semantic and mutliword issues related to nouns in particular.

The teams will work to develop multiple layers of annotation on a common corpus of Urdu, including POS tags and semantic senses (i.e., the range of meanings available for a given word).

The common understanding of the range of issues and problems involved together with the annotated corpus to be developed will be used to derive additional information and perform additional analyses. The additional information and analyses will be used to develop algorithms for automatically annotating a corpus with POS tags as well as word senses, thus resulting in a multi-layered automatically annotated corpus, which can then be used to identify and extract further information about the different word classes in order to feed thesauri and databases like WordNet and VerbNet. These then in turn can be used to develop reliable strategies for word sense disambiguation (WSD), that is for strategies which can reliably distinguish between different senses of a word (a classic example illustrating WSD is the English word bank, which as a noun can either mean the bank of a river or a financial institution).

In detail the work will be structured in five steps:

POS tagset and tagging
1. Analysis of issues
2. POS tagset revision
3. Manually tag 100,000 words
4. Automatically tag 5 million words (to extract words, senses, frames, etc.)
Urdu Wordnet
1. Analysis of issues
2. Add 2000 senses for a total of 5000 senses (3000 senses have already been derived from the transliterated Hindi WordNet)
3. Revise and add to existing hierarchical relationships between the senses
Urdu VerbNet
1. Analysis of issues
2. Automatic Acquisition of Sub-categorization Frames
3. Identifying Sets of Verb Classes
Sense tagged Corpus
1. Analysis of issues
2. Manual tagging of 100,000 word corpus (for words in WordNet and VerbNet)
3. Algorithm for identifying nouns vs. names
4. Understanding and identifying the semantics underlying N-V combinations
5. Identification of multiwords
6. Automated tagging of words covered by WordNet and VerbNet and names, multiwords and N-V combinations identified in parts
Design and Specification of new linguistic course

Prof. Hussain currently teaches computational linguistics at the University of Engineering and Technology in Lahore, Pakistan (UET). He also teaches the MPhil students at Kinnaird College in Lahore, where he is on the Board of Studies for Linguistics. The courses developed as part of this project will thus flow directly into existing curricula in Pakistan as well as Germany, where Prof. Butt will integrate them into the existing MAs on General Linguistics and on Speech and Language Processing.