Center for Language Engineering






[ Localization ] [ Language Processing ] [ Linguistic Resources ]

  Urdu Normalization Utility v1.0  

Normalization is a process to convert multiple equivalent representations of data to consistent underlying normal forms. Normalized data may have two forms: composed or decomposed. Composition is a process to combine the characters wherever possible, for example (0627+0653) ا+ٓ  will assume (آ (0622. Decomposition is an opposite process, breaking pre-composed characters back into their constituents.

The Unicode Normalization Standard defines two equivalences between characters: canonical equivalence and compatibility equivalence. And four normalization forms have been defined by Unicode standard, that are:

1. Normalization form D (NFD) or canonical decomposition
2. Normalization form C (NFC) or canonical decomposition followed by canonical composition
3. Normalization form KD (NFKD) or compatibility decomposition
4. Normalization form KC (NFKC) or compatibility decomposition followed by canonical composition

Urdu Normalization Utility v1.0 provides support for three normalization forms: Normalization form D (NFD) Normalization form C (NFC) and Normalization form KD (NFKD). The normalization form KD (NFKD) provided by utility is a non-reversal process (the result may not be converted back to its original form).

  Download (This file has been accessed: times, since 23 August 2011)  
  Urdu Normalization API v1.0 Release Notes License  
  Urdu Normalization Application v1.0 Release Notes License  
  Urdu Normalization Source Code v1.0 Release Notes License