Center for Language Engineering






[ Localization ] [ Language Processing ] [ Linguistic Resources ]

  Urdu-Nepali-English Parallel Corpus  

Center for Research in Urdu Language Processing (CRULP) is pleased to release Urdu and Nepali corpora parallel to 100,000 words of common English source from PENN Treebank corpus, available through Linguistic Data Consortium (LDC).  The text files used are listed in the README files provided for each corpus. The corpora are also tagged for part of speech.

The work has been supported by the Language Resource Association (GSK) of Japan and International Development Research Center (IDRC) of Canada, through PAN Localization project (

  Download (This file has been accessed: times, since 01 September 2010)  
  Urdu Corpus Read me License  
  Urdu Corpus Extended Read Me License  
  POS Tagged Urdu Corpus Read Me License  
  Nepali Corpus Read me License  
  POS Tagged Nepali Corpus Read me License