Center for Language Engineering






[ Localization ] [ Language Processing ] [ Linguistic Resources ]

  Urdu Most Frequently Used Ligatures List  
  Release Notes  

The ligature list has been extracted from 19.3 million corpus gathered from a wide range of domains
as mentioned in the following table, keeping in view the end user perspective.



Sub domains

  C1. Sports/Games   C1.1. Sports (special events)
  C2. News
  C2.1. Local and international affairs
  C2.2. Editorials and opinions
  C3. Finance
  C3.1. Business, domestic and
          foreign market
  C4. Culture/Entertainment


  C4.1. Music, theatre,exhibitions,
          review articles on literature
  C4.2. Travel / tourism
  C5. Consumer Information


  C5.1. Health
  C5.2. Popular science
  C5.3. Consumer technology
  C6. Personal communications
  C6.1. Emails, online, discussions,
          editorials, e-zines
  Domain wise corpus size distribution is given in the following table.  


Raw Corpora


Distinct words

  C1. Sports/Games 1666304 23118
  C2. News 8957259 67365
  C3. Finance 1162019 17024
  C4. Culture/Entertainment 3845117 59214
  C5. Consumer Information 1980723 34151
  C6. Personal communications 1685424 30469
Total 19296846 104341
  Download (This file has been accessed: times, since 25 April 2012)  

Urdu most Frequently Used Ligatures