Center for Language Engineering






[ Localization ] [ Language Processing ] [ Linguistic Resources ]

  Urdu Ligatures from Corpus  
  Release Notes  

The wordlist has been extracted from 19.3 million corpus gathered from a wide range of domains
as mentioned in the following table, keeping in view the end user perspective.



Sub domains

  C1. Sports/Games   C1.1. Sports (special events)
  C2. News
  C2.1. Local and international affairs
  C2.2. Editorials and opinions
  C3. Finance
  C3.1. Business, domestic and
          foreign market
  C4. Culture/Entertainment


  C4.1. Music, theatre,exhibitions,
          review articles on literature
  C4.2. Travel / tourism
  C5. Consumer Information


  C5.1. Health
  C5.2. Popular science
  C5.3. Consumer technology
  C6. Personal communications
  C6.1. Emails, online, discussions,
          editorials, e-zines
  Domain wise corpus size distribution is given in the following table.  


Raw Corpora


Distinct words

  C1. Sports/Games 1666304 23118
  C2. News 8957259 67365
  C3. Finance 1162019 17024
  C4. Culture/Entertainment 3845117 59214
  C5. Consumer Information 1980723 34151
  C6. Personal communications 1685424 30469
Total 19296846 104341
  The list is cleaned for non Urdu characters and is not validated for other issues of the corpus i.e. spelling mistakes, other languages quoted in Urdu text i.e. Arabic, Punjabi.  
  Download (This file has been accessed: times, since 20 January 2011)  

Urdu Ligatures from Corpus