Center for Language Engineering






[ Localization ] [ Language Processing ] [ Linguistic Resources ]

  Phonetically Rich Urdu Speech Corpus  

The Urdu Phonetically Rich Speech Corpus consists of 70 minutes of transcribed read speech consisting of 708 greedily created sentences representing all phonemic and triphonemic combinations in Urdu (based on an 18 million word corpus of Urdu news articles). It consists of 10,101 tokens with 5,656 unique words. In addition to providing phonetic cover for Urdu, the corpus is also phonemically balanced. It also provides triphonemic cover however it is not completely balanced for triphonemes. It contains 60 unique phones and 42,289 phone occurrences. The sentences contained in this corpus are all manually created by trained linguists following a greedy approach to accommodate the words (which were selected using a set cover algorithm) and to prevent additional words as much as possible. Therefore, while correct grammatically, there are some instances where the choice of words in the sentences is unusual.

  Download (This file has been accessed: times, since 24 January 2017)  
  Phonetically Rich Urdu Speech Corpus License