----------------------------------------------------------------------------------
Unicode Urdu to ASCII Transliteration
----------------------------------------------------------------------------------
Transliteration utility maps Unicode Urdu text to ASCII encoding. There are two options available, either to diacirtize the input text before transliteration (Transliterate.bat) or not (TransliterateNoAerab.bat). Both utilities take two parameters and .
Syntax:
----------------------------------------------------------------------------------
Transliterate
TransliterateNoAerab
File Formats:
----------------------------------------------------------------------------------
The input file should be a simple text file (.txt), containing the Urdu words to be transliterated. It should be in Unicode format.
Output file contains the equivalent transliterated text, each word per line. These utilities transliterate only the Urdu text.
The following files are used by these utilities to generate output.
1) NormalizeNFC.txt
This file contains normalization rules for composition (NFC). Each line of file contains one rule. The format of a rule is: replace:pattern (right to left), where replace may be empty.
2) Rules.txt
This file lists the Unicode to Urdu Zabta Takhti (UZT) conversion rules.
3) Transliteration.txt
This file contains the transliteration rules to be used by Xerox Finite-State Tool (XFST)
These utilities also generate some interim files during conversion.
1) Normalized.txt
The input text is normalized and tokenized and stored in Normalized.txt. The input is tokenized on white space and some punctuation marks.
2) Diacritized.txt
This normalized text is then diacritized by looking up the Urdu lexicon (wordformshashtable.lex) developed at Center for Research in Urdu Language Processing (CRULP). It contains 80 thousand diacritized Urdu words. If multiple options are available only first option is selected.
3) uztconverted.txt
This file contains UZT equivalent text of diacritized text.
4) XFSTOut.txt
This file contains the output generated by XFST