Current Projects

Sr #	Project Title	Principal Investigator	Funding Organization	Status
32	Automatic Meeting Minutes and Analytics Generation for Urdu	PI Dr. Farah Adeeba Co-PI Ms. Sana Shams	TTSF-HEC	Ongoing

Project Details
Start Date:	1st October, 2022
Duration:	24 months
Funding Organization:	Technology Transfer Support Fund (TTSF) Higher Education Commission (HEC).
Principle Investigator(s):	Dr. Farah Adeeba
Co-Principle Investigator(s):	Ms. Sana Shams
Collaborations:	Virtual Force
Project Status:	Ongoing
Objectives:	To design and develop an intelligent meeting notes-taker for Punjabi accent Urdu & English speakers. To design and develop speech separation systems to detect and separate overlapping speech which can further be used for speech-based applications. To design and develop a meeting speech corpus consisting of 15 hours of speech from multiple speakers.
Scope:	System will provide automatic methods to extract and analyze the information content of meetings in various ways, including automatic transcription, decisions (action items), targeted browsing, topic identification, meeting summary and meeting minutes. Product will allow the businesses and individuals to use the proposed meeting assistant as a browser extension with existing web-based video conferencing applications like Zoom, MS Teams, Google meet and Slack to get meeting insights. In addition to the business industry, the system would be beneficial for the government to get insights of the meeting without compromising the data privacy. Moreover, processing of enormous amounts of video data will accelerate the research and development of new Urdu-based AI systems.
Deliverables:	Project ramp-up. System functional specification & design. Overlapped-speech detection R&D. Meeting data design, collection & annotation. Prototype System. Overlapped-speech detection and separation. Topic Identification Meeting minutes. Meeting summarization with acoustic and lexical features. User acceptance testing.
Project URL(s):

31

Enhancement of Nafees Nastaleeq Font

PI

Forum for Language Initiatives (FLI)

Completed

Project Details
Start Date:	1st July, 2022
Duration:	6 months
Funding Organization:	Forum for Language Initiatives (FLI)
Principle Investigator(s):
Collaborations:
Project Status:	Completed
Objectives:	Enhancement of Nafees Nastaleeq font by adding support for rendering 43 additional characters.
Scope:	The project aims to enhance the Nafees Nastaleeq font which currently does not support some important characters of Northern languages of Pakistan. This effort will support the use of Nafees Nastaleeq font in different local languages.
Deliverables:	Enhancement of the Nafees Nastaleeq font by adding support of rendering 43 additional characters selected from different languages spoken in Northern Pakistan. Literature survey to understand the Nafees Nastaleeq font and the shapes of the characters occurring in different contexts i.e., initial, middle and final positions. Development and modification of different shapes of 43 characters. Development of rules for writing and joining the shapes in different contexts. Implementation of the writing rules.
Project URL(s):

30

Development of Urdu Test Material for Hearing Impaired Community

PI Dr. Sarmad Hussain

Fogarty International Center of the National Institute of Health, USA

Completed

Project Details
Start Date:	9th May, 2022
Duration:	10 months
Funding Organization:	Fogarty International Center of the National Institute of Health, USA
Principle Investigator(s):	Dr. Sarmad Hussain
Collaborations:	Hamza Foundation Academy for the Deaf
Project Status:	Completed
Objectives:	To develop the standardized test batteries for the speech audiometric testing in Pakistan To develop the standardized Urdu word lists by inculcating the linguistic and speech audiometry knowledge. To facilitate the proper audiological and speech diagnosis by developing the test materials for SRT and WRS. To facilitate excellent education systems for the deaf through the development of these test materials.
Scope:	100 native Urdu speakers/participants (50 boys & 50 girls) with normal hearing will be selected for this study (native speakers of Punjabi using Urdu at their homes and institutions can also be selected). The data will be collected from the children of age range 5 to 15. This age range will be divided into 3 groups (5-8, 9-12, and 13-15), and participants from each group will be selected.
Deliverables:	An interim list of 50 bi-syllabic words for SRT. An interim list of 50 mono-syllabic words for WRS. Progress Report, Financial report and invoice Final list of 25 bi-syllabic words for SRT. Final list of 25 mono-syllabic words for WRS. Final Financial Report and Project Report.
Project URL(s):

29

Expanding E-Khidmat Center’s Services and Delivery to Broader Citizen Base Using Intelligent Urdu Language Technology

PI Dr. Sarmad Hussain

FCDO

Completed

Project Details
Start Date:	1st January, 2022
Duration:	15 months
Funding Organization:	FCDO
Principle Investigator(s):	Dr. Sarmad Hussain
Collaborations:	Punjab Information and Technology Board (PITB)
Project Status:	Completed
Objectives:	Identify informational services required by poor and marginalized community in Punjab through effective engagement and feedback Design and develop intelligent local language speech enabled service delivery platform to maximize access of poor and marginalized community of Punjab from a single window of operation, more effectively and efficiently Assess end-users’ perception about the speech enabled service delivery platform and feedback regarding the quality and relevance of information provided to help improve the service
Scope:	Conduct a need assessment survey from 700 respondents from Lahore, Faisalabad and Sargodha districts to determine the need of the various EKM services for the end-users. A comprehensive list of EKM services is available in Appendix A Develop a mobile-based speech-enabled automated service delivery platform, able to recognize speech in Urdu by the population residing in the Punjab area (having Mahji and Shahpuri dialects in the current phase of work). Enable 12 most-needed services determined by the need assessment survey through this platform. For each of the service, provide information regarding service fee, location, timing, required documents and procedure will be communicated to the citizen in a female voice in Urdu Do an evaluation and impact assessment survey to investigate the following questions: End-users’ perception about the new speech based mobile platform, their level of satisfaction and issues regarding the use of mobile platform End-users’ feedback regarding the quality and relevance of information provided by the system. Identification of possible informational gaps related to the services provided through the system
Deliverables:	Report on Review of the Literature and Finalized Data Collection Tool for Need Assessment Survey Report on Design of speech enabled mobile based E-Government Informational Service Delivery platform Report on Citizens’ Informational Needs regarding E-government Service Speech enabled mobile based E-government Informational Service Delivery platform for 12 services Report on Citizens’ Perception and User Experience of Speech enabled E-government Services’ Delivery Platform Final Project Report
Project URL(s):

28

Empowering Public Policy on Socio-Economic Issues through Citizens Sentiment Analysis

PI Dr. Kashif Javed
Co-PI Dr. Sarmad Hussain

PIRCA-PHEC

Completed

Project Details
Start Date:	1st July, 2021
Duration:	2 years
Funding Organization:	Punjab Innovation and Research Challenge Award (PIRCA) Punjab Higher Education Commission (PHEC)
Principle Investigator(s):	Dr. Kashif Javed
Co-Principle Investigator(s):	Dr. Sarmad Hussain
Collaborations:	Desider (Pvt) Ltd
Project Status:	Completed
Objectives:	To develop a helpful analytical tool for the government, agencies and different departments to identify socio-economic issues and the affected areas. To develop comprehensive multilingual (English, Urdu and Roman Urdu) datasets with standard entity, aspect and sentiment level annotations. To create research opportunities for the graduate students working in the area of Urdu NLP. To contribute to the advancement of Urdu NLP.
Scope:	With the growing population of Pakistan, it has become difficult yet extremely important to address the issues of the general populous on emergent basis. Despite of the prominent advancement and reach of electronic media, a lot of areas and issues get unattended due to the unavailability or lag of information. Our project will provide the spatiotemporal analysis of the areas in Pakistan where people will have negative sentiments about the performance of the sitting govt. or respective departments in the domain of justice, social issues or economics, thereby, identifying the areas where government needs to work in order to satisfy and meet the expectations of the people.
Deliverables:	After following through the steps of data collection and aspect-based sentiment classification models building, the final outcome of this project is an interactive user-friendly website. The resulting interface will allow the user to explore a bunch of statistical reports regarding the semantic orientation of the data posted by the general populous on social media related to selected domains and aspects of socio-economic issues. In addition to that, the website will also provide various visualization on the map e.g. the origin of social media texts in general, the origin of positive sentiment bearing social media texts stream and most importantly the highlighted areas where majority if the people have negative sentiments about the socio-economic issues and the performance of the government. It will be a complete and authentic data analytical tool for the government and will empower the authorities in policymaking, funds disbursement and thus improving public happiness index, Moreover, it will also be a very helpful source of information for the media houses to highlight the important issues of the day.
Project URL(s):

27

Automated Urdu Broadcast Media Content Extraction and Analytics

PI Dr. Sarmad Hussain

TTSF-HEC

Completed

Project Details
Start Date:	1st July, 2021
Duration:	2 years
Funding Organization:	Technology Transfer Support Fund (TTSF) Higher Education Commission (HEC)
Principle Investigator(s):	Dr. Sarmad Hussain
Co-Principle Investigator(s):
Collaborations:	Desider (Pvt) Ltd Pakistan Electronic Media Regulatory Authority (PEMRA)
Project Status:	Completed
Objectives:	To enable consumers of news media to be able to effectively extract and manage spoken and written Urdu content in digital videos published online To develop advanced content processing and intelligent analytics extraction techniques for Urdu To provide broader user access to Urdu content in videos through online search engine
Scope:	This project aims to develop content-based audio-visual analysis system which will facilitate users in effective browsing and searching of the relevant video from the enormous amount of the video data online. This system will use audio and video processing techniques to transcribe the spoken and written text with timestamping and generate various analytics and reports from the transcribed text.
Deliverables:	System design document Enhanced Urdu speech recognition system including test report Speaker identification system including test report Video transcription using auditory stream Text area detection and segmentation system including test report Urdu OCR system including test report Trending topic identification system An automated Urdu Broadcast Media Content Extraction and Analytics system Testing report
Project URL(s):

26

A Digital Diachronic Urdu Corpus from the 19th and Early 20th Centuries

PI Dr. Sarmad Hussain
Co-PI Dr. Miriam Butt

DAAD

Completed

Project Details
Start Date:	1st January, 2020
Duration:	2 years
Funding Organization:	DAAD
Principle Investigator(s):	Dr. Sarmad Hussain
Co-Principle Investigator(s):	Dr. Miriam Butt
Project Status:	Completed
Objectives:	Create a digital collection of historical Urdu texts with source references Acquire and create diachronic digital Urdu corpus annotated at the word and sentence levels. Develop an Urdu handwriting recognition system and digitize the text corpus to make it available for historical linguists and literary analysts. Extend the existing relationship between German and Pakistan universities, to continue to include more senior and junior researchers on both sides. Extend course curricula in historical linguistics and theoretical linguistics specifically language change studies, reflecting the results of the project. Conduct workshops and seminars to train the teachers and the young researchers in the field of historical and corpus linguistics.
Scope:	The focus of this project is to collect and create a digital syntactically annotated corpus for Urdu, by capturing representative texts from the nineteenth and early twentieth centuries. This project aims to develop a diachronic digital corpus of Urdu so as to enable a comparative and contrastive analysis of Urdu orthography and syntax in light of contemporary Urdu, thereby opening up further opportunities for scientific research in historical linguistics, corpus linguistics and theoretical linguistics and Urdu historical and literary studies. A significant challenge of the project is also the digitization of classical text, in particular what kinds of information (variant texts, text notations, variant characters, quotations, writing styles, etc.) should be added in the digitization of these texts.
Deliverables:	Design of diachronic Urdu corpus in terms of coverage of genres, time periods based on publication date, writers, publishers and Geographical via online calls, email and Skype interactions between the UET and KN groups. Use of field research funds to conduct targeted studies and experiments investigating particular aspects of development and digitization of historical corpus. Face-to-face interaction in conjunction with workshops on the design and development of historical corpora, and findings of document image processing and recognition for handwritten Urdu. Face-to-face interaction in conjunction with the Conference on Language and Technology (CLT) in Pakistan to provide an opportunity to team members from KN and the students who have finished doctoral work from KN and other places in Germany to visit Pakistan and interact further with the team at UET and participate in CLT. Formal release of diachronic digital Urdu text corpus. Workshop on historical corpus development and analysis of Old Urdu text for the larger dissemination of knowledge for the Pakistani teachers and young researchers and the international scholarly community.
Project URL(s):

25

Urdu Voice Enabled Assistive Technologies for Print Disabled Community of Pakistan

PI Dr. Tania Habib
Co-PI Dr. Sarmad Hussain

NRPU-HEC

Completed

Project Details
Start Date:	24th June, 2019
Duration:	2 years
Funding Organization:	National Research Program for Universities Higher Education Commission (HEC), Pakistan
Principle Investigator(s):	Dr. Tania Habib
Co-Principle Investigator(s):	Dr. Sarmad Hussain
Project Status:	Completed
Objectives:	To improve Urdu TTS in the areas of Natural Language Processing (NLP) and speech synthesis. To improve date formats that contain English slashes (/), commas (,), dashes (-), dots (.) and English names of months (e.g., March, Mar) will be handled. Support for handling English text will be added In addition to the 70,000 words vocabulary, the phonetic lexicon used by the TTS system will be enhanced by adding 10,000 transliterated English words. To enhance Urdu TTS for using a single voice with Urdu accent to read aloud mixed content, containing both Urdu and English. To improve the response time of the Urdu TTS; the speech quality for naturalness and intelligibility. Deep learning based voice will be developed for Urdu TTS. 4 hours Urdu speech corpus will be developed and annotated at intonation level to improve the naturalness and intelligibility of existing Urdu voice of TTS. The usability of Urdu TTS will be enhanced by integrating it with SAPI-5.
Scope:	It will be helpful for disabled community including the following: Visually impaired community of Pakistan. Illiterate populations of Pakistan that can communicate and understand Urdu, but cannot read or write it. Urdu second language speakers and non-residing Pakistani citizens, especially younger generation, interested to learn Urdu Pakistani citizens, interested in listening to Urdu text, available in the form of digital documents, articles, books etc.
Deliverables:	Project ramp up and resource training for the improvement in NLP module Testing corpus development for NLP module Handling of English words Fixation in errors in dates and times formats Enhancement of pronunciation Lexicon Annotation of Urdu speech corpus at intonation level Manual annotation and testing of 4 hours speech at intonation level Comparison of source and reference files using testing utilities Synthesizer enhancement and literature survey Testing of utility to convert Textgrids to Utterance files Development of Integration mechanism of voice with Festival Development of Test Corpus Deep learning based voice building with one hour data for testing and integration of voice with TTS Voice building with 5 hours speech corpus Voice building with 10 hours speech corpus HHM and unit selection voice quality improvement Implementation of volume and rate adjustment functions in Urdu TTS Integration of Urdu TTS with SAPI interfaces
Project URL(s):

24

Language Engineering Lab

PI Dr. Sarmad Hussain
Co-PI Dr. Amir Mehmood
Co-PI Dr. Kashif Javed

Planning Commission,
Govt. of Pakistan

Ongoing

Project Details
Start Date:	1st July, 2018
Duration:	4 years
Funding Organization:	Planning Commission, Govt. of Pakistan & Higher Education Commission (HEC), Pakistan
Principle Investigator(s):	Dr. Sarmad Hussain
Co-Principle Investigator(s):	Dr. Amir Mehmood Dr. Kashif Javed Dr. Tania Habib (2018-2020)
Project Status:	Ongoing
Objectives:	Providing effective access to online and additional Urdu content Releasing Urdu language processing toolkit for accelerating research and development of Urdu language technology Delivering data analytics based on local users and local content for improving decision making for business and e-governance Making available specific cloud based services using the Urdu data analytics available to address current academic and public requirements.
Scope:	The Language Engineering Lab at Al-Khawarizmi Institute of Computer Science (KICS), University of Engineering and Technology (UET) Lahore aims to build an ecosystem for Urdu based content and analytics services to empower Pakistani citizens for their socio-economic benefit and enable business and government for strategic decision making
Deliverables:	Comprehensive Urdu content repository Web and social media content in processed and searchable format Identification services for objectionable Urdu content Basic level query and user analytics from search engine “Search as a service” enterprise model Language Processing Toolkit for Urdu Urdu Plagiarism Detection Service Journal indexing service for Urdu Specialized Urdu OCR for low resolution images and pdf Keyword-based Urdu Speech Recognition System Health (Dengue, Polio, etc.) monitoring surveillance service Hate speech (religious, national secular) detection service Topic based analytics Ad-exchange service
Project URL(s):	http://web.lums.edu.pk/~ncbc/affiliated-labs/language-engineering-lab

23

Local Language Speech Interfaces for Banking Sector of Pakistan

PI Dr. Tania Habib
Co-PI Dr. Sarmad Hussain

TDF-HEC

Completed

Project Details
Start Date:	1st October, 2018
Duration:	2 years
Funding Organization:	Technology Development Fund, Higher Education Commission, Pakistan
Principle Investigator(s):	Dr. Tania Habib
Co-Principle Investigator(s):	Dr. Sarmad Hussain
Collaborations:	Virtual Force
Project Status:	Completed
Objectives:	Design and develop a continuous speech corpus of 15 hours of read and spontaneous speech from multiple speakers. Develop Urdu speech recognition system using state of the art machine learning techniques. Design and develop advanced telephony and dialog framework Design and develop a speech interface for telephone banking services in Urdu, which would be extendable to other local languages of Pakistan. Design and develop Urdu speech recognition software as a service (SaaS), for the development of further similar speech based banking services.
Scope:	Design and introduce new areas of computational linguistics; acoustics and phonetics, speech and audio processing at graduate level Sustaining the trained resources in area of human language technology that enable students to conduct advance research Expansion of spoken dialogue systems in the field of banking sector under the light suggestions of PDM
Deliverables:	Advanced speech processing systems for Urdu language Urdu based dialogue system Urdu based continuous speech recognition system A large speech corpus covering Punjabi accent of Urdu Urdu CSR API, for the development of further similar speech based applications
Project URL(s):

22

Automatic Pakistani Postal Address Recognition and Parcel Routing

PI Dr. Sarmad Hussain
Co-PI Qurat-ul-ain Akram

TDF-HEC

Completed

Project Details
Start Date:	1st June, 2018
Duration:	2 years
Funding Organization:	Technology Development Fund, Higher Education Commission, Pakistan
Principle Investigator(s):	Dr. Sarmad Hussain
Co-Principle Investigator(s):	Qurat-ul-Ain Akram
Collaborations:	TCS Pvt. Ltd.
Project Status:	Completed
Objectives:	Develop an android application to assist the user, to capture, enhance and validate the images containing addresses dispatched on the envelop Develop address text area detection system by resolving skew, horizontal and vertical perspective distortion in camera captured images Develop address recognition system using deep learning algorithms Develop address fields extraction system using NLP and deep learning algorithms Develop a framework for automatic address recognition and parcel routing against barcode number
Scope:	Reducing time and human effort to manually enter the recipient and sender addresses Improvement in the growth of postal and courier service providing industry in Pakistan Capacity building in the area of pattern recognition and natural language processing Data development to the aid researcher to do R&D in the field of pattern recognition and NLP Capacity building for the development of other related projects such as SUI GAS meter reading and electricity meter reading which will have great impact on the government and Pakistani community to do automation of these major billing systems
Deliverables:	An android application to assist the user to capture high quality of image of envelope containing addresses Image validation and enhancement system Address text area detection system Optical character recognition system of camera captured Pakistani address images using deep learning Address fields extraction system using natural language processing and deep learning techniques
Project URL(s):

21

Urdu Text to Speech: Integrating Prosody of Emotions

PI Dr. Sarmad Hussain
Co-PI Dr. Miriam Butt

DAAD

Completed

Project Details
Start Date:	1st January, 2018
Duration:	3 years
Funding Organization:	University of Konstanz, Germany
Principle Investigator(s):	Dr. Sarmad Hussain Dr. Miriam Butt
Project Status:	Completed
Objectives:	Develop emotional speech corpus for Urdu that is prosodically and syntactically annotated. Develop linguistic resources to concretely represent such analyses, for further research and development in linguistics and computational work. Conduct joint research and produce joint, collaborative publications with former and current PhD students at UET and KN. Extend the existing relationship between German and Pakistan universities, to include more senior and junior researchers on both sides. Extend course curricula in linguistics and computational linguistics reflecting the results of the project. Conduct workshops and seminars to train the teachers and the young researchers in the field of prosody.
Scope:	Urdu TTS: An integrated automatic dialog system enables information access to thenon-literate community for their socio-economic benefit. Urdu TTS with emotional understanding can be integrated into a screen reader, to enable access of digitized Urdu for the visual impaired community. Urdu TTS with emotion expressiveness can also improve the user experience of newspaper and book reading.
Deliverables:	Designing and recording of the emotional speech corpus Evaluation of recorded corpus in context of emotion Revision of annotation scheme Annotation of one hour happy, sad, excited and angry speech corpus at syllable, word and break index and stress levels Identification and categorization of the lexical items and phrases which contribute to the emotions in Urdu Organization of first workshop emotional speech in Pakistan to discuss the progress and design of the annotation of emotional speech and to provide feedback. Field Research and Experiments on the expression of emotions in Urdu. Annotation of happy, sad, excited and angry speech corpus at stress levels (1 hour of speech) Analysis of acoustic parameters (i.e. F0, duration, energy and duration of pauses) in accordance to emotion of happiness, sadness, excitement and anger Analyses of the syntactic patterns which are used for conveying emotions Organization of a workshop on the emotional speech and syntax interface at Konstanz Develop initial machine learning models for emotion modeling in synthesized speech Emotions and canonical word order Annotation of one hour angry speech corpus at break index and stress levels Finalize machine learning model for emotion modeling in synthesized speech Workshop in Konstanz on prosody and machine learning. Testing and Evaluation of machine learning models for emotion synthesis Integration of finalized machine learning models in Urdu TTS for emotion synthesis Final research writing and publishing Design new linguistic course on emotional speech prosody Organize workshop on emotional speech prosody in Pakistan
Project URL(s):

20

Urdu Search Engine

PI Dr. Amir Mahmood
Co-PI Dr. Sarmad Hussain

IGNITE

Completed

Project Details
Start Date:	8th August, 2016
Duration:	2 years 4 months
Funding Organization:	IGNITE
Principle Investigator(s):	Dr. Amir Mahmood
Co-Principle Investigator(s):	Dr. Sarmad Hussain
Project Status:	Completed
Objectives:	To provide access patterns of our communities to commercial content development market To strengthen industry-academia ties by providing solutions to Urdu language specific projects as well as the projects that demand larger distributed storage and computation infrastructure. To initiate a content search industry in Pakistan and promote Urdu Content Industry. To appropriately market our product by attracting traffic on USE website, working with the local ISPs. To provide necessary incentive like “improved revenue by reducing the traffic load through the optimized search query of Urdu text” to local ISPs and other stake holder. To provide incentives to the entities who are interested to have filtered access to the World Wide Web content.
Scope:	The project will focus on developing search engine specialized for Urdu language. The search engine will be available in two interfaces Web based and Mobile based SMS service. Like any conventional search engine a web interface will be provided to users for sending query to the system and obtaining the relevant results. It will also help out the mobile based users especially low end mobile users to send query to the systems using a number and receive information over the mobile phone. This system will be comprised of following components: Cloud Infrastructure (CI), Information management (IM) and Search management (SM)
Deliverables:	HR Hiring for the three teams Initial challenges and possible solutions/workaround in setting up three teams’ required hardware /software development / test infrastructure The specialized training for the three teams. Reporting architectural details of filtering and query response system. Related code and test case report. Technical Progress Report, explaining basic crawler development and testing. Related code and test case report. A technical report comparing other language based search engines with our search engine in terms of design and implementation A conference paper on data analysis on regional web data. A report on local cluster development and crawling system. Reporting progress on language and age filter development. Research Report on Content Summarization. This report will also explain our prototype regarding summary techniques for mobile, computer and SMS. Related code and test case report. A report explaining Urdu specific issues and solution for query response system. Release of crunched data. Releasing age filter. Releasing a prototype system for Indexing. Releasing a prototype system for query response. Releasing Language filter Releasing a design of query response system. Releasing a design of Indexing system. Related code and test case report. Reporting on frequency of crawling and checkpoint restarting Releasing size filter Release of query processing and indexing system. Related code and test case report. A report explaining ‘content filter design’ and ‘testing and analysis of candidate generation and ranking systems’ Related code and test case report. A conference paper on natural language processing. Reporting on incremental IM and SM processes. Releasing query response system. Releasing content filter. Related code and test case report. A report on ‘design and development of text analytic system’. A report on ‘Design and development of Text Summarization system’. A report on ‘Implementing query response system’. Architectural report of Text Summarization system. Architectural report of Text Analytic system. Architectural report of Search Analytic system. Releasing integrated filtering system. Release Integrated Text Summarization System. Related code and test case report. Releasing analytics systems. Releasing Urdu Search Engine. Related code and test case report. A detailed report describing precision, efficiency and effectiveness of ‘Urdu Search Engine’ A journal paper on page ranking algorithm in local context. Training material /workshop presentation will be provided.
Project URL(s):	https://www.humkinar.pk/

19

Text to Speech for Urdu Understanding Intonation

Dr. Sarmad Hussain
Dr. Miriam Butt
Dr. Tafseer Ahmed

DAAD

Completed

Project Details
Start Date:	1st Jan, 2015
Duration:	3 years
Funding Organization:	University of Konstanz, Germany
Principle Investigator(s):	Dr. Sarmad Hussain Dr. Miriam Butt Dr. Tafseer Ahmed
Project Status:	Completed
Objectives:	Develop the capacity to do grammatical and semantic analyses of languages Develop critical linguistic resources for Urdu to concretely represent such Analyses, for further use in linguistics, psycholinguistic and computational work Develop a formal relationship between German and Pakistani universities for collaborative research Develop a course curriculum in linguistics based on work in Urdu
Deliverables:	First version of prosodic annotation scheme Create the parser and evaluate the results Create tools for speeding up process of manual annotation Revising the syntactic annotation guidelines Complete annotation of 10,000 sentences Create and evaluate the parser and integrate parser into prosody analyzer
Project URL(s):	http://ling.uni-konstanz.de/pages/home/pargram_urdu/DAADlex/

18

Translation of Websites

Dr. Sarmad Hussain

Texpo, Pakistan

Completed

Project Details
Start Date:	16 November, 2015
Duration:	1 year
Funding Organization:	TEXPO Pakistan Pvt. Limited
Principle Investigator(s):	Dr. Sarmad Hussain
Project Status:	Completed
Objectives:	Translation and grammatical correction of the text, news, advertisement banners, sms msgs, email msgs and labels of the emirates post group EPG website and mobile application from English to Urdu language.
Scope:	Translate the source text in English into corresponding text in Urdu Translation of 25K- 30K words and provide translation in UTF8 format Annual maintenance of service for 1 year
Deliverables:	Translation of 30K words in six weeks
Related URL(s):	https://www.epg.gov.ae/portal/_ur/index.xhtml;jis=59E1CA1D74A0589EDA594A32A7DF8CA6

17

Investigating the Impact of OER on Secondary and tertiary Education in Pakistan

PI Dr. Sarmad Hussain
PI Dr. Naveed Malik
Co-PI Ms. Sana Shams
Co-PI Dr. Yasira Waqar

IDRC, Canada

Completed

Project Details
Start Date:	1st March, 2015
Duration:	1.5 years
Funding Organization:	International Development and Research Center (IDRC), Canada
Principle Investigator(s):	Dr. Sarmad Hussain Dr. Naveed Malik
Co-Principle Investigator(s):	Ms. Sana Shams Dr. Yasira Waqar
Project Status:	Completed
Objectives:	Ascertain the extent of OER use by secondary and tertiary students and teachers in Pakistan. Specific objectives of the project include: Deepen understanding of the ways in which OER engagement influences teacher educators’ epistemological and pedagogical stance and to examine the nature of any shifts in their practice Identify contextual factors which support or constrain sustained engagement with OER by teacher educators Communicate research outputs appropriately in multiple spaces to inform future policy and practice on the use of OER to improve teacher education
Research Objectives:	What is the extent of OER use by secondary and tertiary students and teachers? What are the factors that enable and constrain OER use by secondary and tertiary students and teachers? What is the perception of secondary and tertiary students and teachers regarding the impact of OER use on student learning, as demonstrated in the development of critical thinking skills, class participation, improved grades, and collaboration among learners?
Project URL(s):	http://roer4d.org/sp10-5-impact-of-oer-on-teaching-and-learning-in-pakistan

16

Computerized Corpus of Persian texts along with their commentaries

Dr. Sarmad Hussain

University of Chicago

Completed

Project Details
Start Date:	August, 2013
Duration:	2 months
Funding Organization:	University of Chicago, USA
Principle Investigator(s):	Dr. Sarmad Hussain
Project Status:	Completed
Objectives:	Keying and proofreading of Persian texts and their annotations
Deliverables:	Double typing of text Persian text verification and data management
Project URL(s):

15

Digital Dictionaries of South Asia

Dr. Sarmad Hussain

University of Chicago

Completed

Project Details
Start Date:	1st May, 2014
Duration:	2 years
Funding Organization:	University of Chicago, USA
Principle Investigator(s):	Dr. Sarmad Hussain
Project Status:	Completed
Objectives:	Make available the highest quality electronic dictionaries for South Asian languages as free public services via the Internet. Encompasses South Asian languages of Pakistan: Kashmiri, Punjabi and Sindhi
Scope:	Increased availability of scholarly resources in electronic form
Deliverables:	Digitizing transcription and quality control on the following dictionaries: Kashmiri, Punjabi and Sindhi Conversion of text from printed source with double keying and verifying techniques and using Unicode as the default character set and text in coding initiative Convert lexical data
Project URL(s):	http://otd.cle.org.pk/ http://dsal.uchicago.edu/dictionaries/

14

NDA Text Classification

Dr. Sarmad Hussain

Industrial Partner

Completed

Project Details
Start Date:	1st June, 2014
Duration:	9 months
Funding Organization:	Industrial Partner
Principle Investigator(s):	Dr. Sarmad Hussain
Project Status:	Completed
Scope:	Create a solution to apply natural language processing on NDA Create an application that allows users to create annotated NDA’s based on NDA golden rules.
Project URL(s):

13

Enabling Information Access through Mobile Based Dialog Systems and Screen Readers for Urdu (ASR)

PI Dr. Sarmad Hussain
Co-PI Dr. Tania Habib

IGNITE

Completed

Project Details
Start Date:	1st July, 2013
Duration:	3 years
Funding Organization:	IGNITE
Principle Investigator(s):	Dr. Sarmad Hussain
Co-Principle Investigator(s):	Dr. Tania Habib
Project Status:	Completed
Objectives:	Enabling Information Access through Mobile Based Dialog Systems and Screen Readers for Urdu project aims to equitable information access for the marginalized community in Pakistan, especially the non-literate, semi-literate and print impaired population for their socio-economic benefit. Specific objectives of the project are: Develop a framework for Urdu dialog systems, extensible to other domains and Pakistani languages Develop domain specific Urdu automatic speech recognition system for weather and location services Develop Urdu text to speech system based on earlier work, including SAPI interface Develop an Urdu screen reader compatible with a web browser, in order to enable visually impaired people to access the web using the Urdu TTS developed using SAPI interface Develop spoken language corpus including (i) district names of Pakistan in six accents of Urdu, (ii) location names of Lahore, and (iii) speech corpus of 10 hours from single professional speaker of Urdu for TTS training Develop practical speech based user interface in Urdu for the visually impaired community
Scope:	Speech Recognition System for Urdu Language Text to Speech System for Urdu Language Mobile Based Dialog System in Urdu Urdu Screen Reader
Deliverables:	Project initiation and training reports on TTS NLP, USR, ASR and weather domain Prototype and demonstration of TTS NLP module, weather dialog system, ASR system Domain finalization for screen readers Baseline speech recognition system for weather domain Speech recognition system for weather domain Dialog manager for weather domain Localized screen reader compatible with a web browser Baseline TTS system for 5000 words ASR speech recognition system for location finder domain Dialog system dialog manager for location finder domain Screen reader field tested screen reader (intermediate version) Spoken dialog system for weather domain Improved TTS system for 10,000 words and test report Screen reader integrated with TTS system and test report Demonstration of developed modules Final ASR system for weather and location finder domain and test report Spoken dialog system for location finder domain General framework for dialog system User Guides for dialog system, TTS and Screen Readers. Demonstration of developed modules
Project URL(s):	http://cle.org.pk/dialog/ https://tech.cle.org.pk/asr https://tech.cle.org.pk/services/speech/tts http://www.cle.org.pk/usr/

12

Language Resources Production (Lexicon)

Dr. Sarmad Hussain

ELDA

Completed

Project Details
Start Date:	2013
Duration:	1 year
Funding Organization:	European Language Research Association (ELDA), France
Principle Investigator(s):	Dr. Sarmad Hussain
Project Status:	Completed
Objectives:	Production of a phonetic lexicon of Pashto
Deliverables:	Pashto Phonetic Lexicon of 20,000 words
Related URL(s):	Pashto Phonetic Lexicon

11

Urdu Nastalique Optical Character Recognition System

Dr. Sarmad Hussain

IGNITE

Completed

Project Details
Start Date:	31st August, 2012
Duration:	3 years
Funding Organization:	IGNITE
Principle Investigator(s):	Dr. Sarmad Hussain
Project Status:	Completed
Objectives:	To develop and mature algorithms for analyzing and recognizing Urdu text images based on segmentation-based and ligature-based methods To develop automatic scaling algorithms for Urdu ligatures to make font size independent system To develop Urdu OCR for Nastalique style of writing To develop post-processing algorithms in computational linguistics for output generation and error correction of Urdu OCR To identify future research directions for graduate research in this area To develop capacity in the area of Human Language Technology To create and release an Urdu text image corpus with open license for further development and testing of OCR for Urdu and other Pakistani languages by other interested research organizations and universities To provide access to textual information to print disable communities
Scope:	The following character set will be recognized by the Urdu OCR: Urdu alphabet given in Figure 1 below Latin digits (0, 1, 2, 3, 4, 5, 6, 7, 8, 9) Urdu digits ( ۰ , ۱ , ۲ , ۳ , ۴ , ۵ , ۶ , ۷ , ۸ , ۹ ) Urdu aerab (ً َ ُ ِ ّٰ ٓ ٔ ) Other symbols of Urdu, as follows: ( ؁ ؀ ؂ ؃ ؎ ؏ ؐ ؑ ؒ ؓ ؔ ؟ () ' " ۔ ؛ : ، ) The text written in Noori Nastalique font style with font size range between14 and 44 will be recognized. Smaller or larger font sizes will not be processed at this time. Newspapers, normally published in smaller font sizes, will not be recognized by the OCR. This application will process plain text, and not process advanced formatting, e.g. Italic, bold, and underline, etc. The system will handle up to 2 columns of text. Urdu OCR will identify the Latin script written with Times New Roman, Arial and Courier font styles, within the font size range proposed for Urdu. The Latin script will be identified by current project and will be forwarded to the existing Tesseract (open source) OCR system. The speed and accuracy rate of recognizing Latin script will depend on Tesseract. Beyond this, Latin script recognition will not be processed in the current project. The system will handle salt and pepper noise. It will also detect page frame (text written in the page is called page frame). The system will also detect skew in the page. An image will be rejected if skew is present in the page. The system will output plain Urdu text in Unicode format.
Deliverables:	Training Text Image corpus at 14 & 16 point size Design Document of Core Urdu OCR Framework using Tesseract Text Corpus Collection, Cleaning and Tagging tools Training Text image corpus at 20-30 font size and 30-44 font size Research Report of Prototype ligature-based OCR for 14 point size. Prototype system of ligature-based OCR for 14 point size Research Report on Corpus Collection, Design and Release Release of Overall Image Corpus Research Report on Binarization System Binarization System Research Report of Text Area, Figure Identification Urdu Word Segmentation System Text Area and Figure Identification System Research Report of Page Frame Detection System Page Frame Detection System Research Report on Segmentation-based OCR for 14 Point Size Segmentation-based OCR for 14 point size Research Report of Ligature to Word Mapping System Ligature to Word Mapping System Research Report of 14 point size ligature-based OCR Ligature-based OCR for 14 point size Cleaned and Tagged Urdu Text Corpus Research Report of Noise Removal System Noise Removal System Research Report on segmentation-based OCR for 16 point size Segmentation-based OCR for 16 point size Research Report of ligature-based OCR for 16 point size Ligature-based OCR System for 16 point size Research Report of Font size independent OCR for 16-24 point sizes Font size independent OCR system for 16-24 point sizes Research Report on Skew Detection System Skew Detection System Research Report on Nastalique and Latin Script Detection Nastalique and Latin Script Detection system Research Report on Run Marking System Run Marking System Research Report on Text Image Segmentation Text Image Segmentation System Research Report on segmentation-based OCR for 24 point size Segmentation-based OCR for 24 point size Research Report on segmentation-based OCR for 36 point size Segmentation-based OCR for 36 point size Research Report on font size independent OCR for 24-36 font sizes Font size independent OCR System for 24- 36 font sizes Research Report on font size independent OCR for 36-44 font sizes Font size independent OCR for 36-44 font sizes Complete Urdu OCR System
Project URL(s):	http://202.142.159.36:8080/ocr/ http://www.cle.org.pk/ocr/

10

Investigating the Long Term Residual Impact of ICT Integration across Gender for a Sustainable Project Design

Ms. Sana Shams

Global Development Network

Completed

Project Details
Start Date:	12th September, 2012
Duration:	1 year and 3 months
Funding Organization:	Global Development Network (GDN)
Principle Investigator(s):	Ms. Sana Shams
Project Status:	Completed
Objectives:	To improve literacy level and overall education To investigate the impact of ICT at formal and non formal educational institutes by paying special attention to the gender
Deliverables:	Recommend a bottom up feedback system for devising a sustainable project design based on the ICT user, facilitator and community feedback Gathering socio-demographic characteristics of users accessing the rural ICT centers Examining their socialization experiences Identifying the factors inhibiting their access and use of ICTs Determining the environmental factors facilitating their use To conduct this study 16 schools from Pakistan will be examined
Project URL(s):

9

Essential Linguistic Research Capacity and Resource Development for Urdu

PI Dr. Sarmad Hussain
Co-PI Dr. Miriam Butt

University of Konstanz,Germany

Completed

Project Details
Start Date:	1st January, 2012
Duration:	3 years
Funding Organization:	DAAD, Germany
Principle Investigator(s):	Dr. Sarmad Hussain
Co-Principle Investigator(s):	Dr. Miriam Butt
Project Status:	Completed
Objectives:	Develop the capacity to do grammatical and semantic analyses of languages Develop critical linguistic resources for Urdu to concretely represent such analyses for further use in linguistics psycholinguistic and computational work Develop a formal relationship between German and Pakistani universities for collaborative research Develop a course curriculum in linguistics based on work in Urdu
Scope:	Common understanding of the issues and solutions in core areas relevant for grammar and semantic resources, including POS tag set, WordNet and VerbNet structures, and semantic and multiword issues related to nouns in particular. Multiple layers of annotation on a common corpus of Urdu, including POS tags and semantic senses (i.e., the range of meanings available for a given word). Algorithms for automatically annotating a corpus with POS tags as well as word senses, thus resulting in a multi-layered automatically annotated corpus, which can then be used to identify and extract further information about the different word classes in order to feed thesauri and databases like WordNet and VerbNet
Deliverables:	Urdu Text Corpus CLE Urdu Digest Corpus 100K CLE Urdu Digest Corpus 500K CLE Urdu Digest Corpus 1M Urdu WordNet Urdu WordNet 1.0 Wordlist Urdu POS Tagged Corpus Urdu POS Tagset CLE Urdu Digest POS Tagged Corpus 100K Urdu VerbNet Sense Tagged Corpus Based on Urdu
Project URL(s):	http://www.cle.org.pk/eulr/

8

Enabling Information Access for Rural Population through Urdu Dialog System

Dr. Sarmad Hussain

Asia Pacific Tele Community, Thailand

Completed

Project Details
Start Date:	1st February, 2011
Duration:	11 months
Funding Organization:	Asia Pacific Tele Community, Thailand
Principle Investigator(s):	Dr. Sarmad Hussain
Project Status:	Completed
Objectives:	Undertake research and development of applications to provide access to relevant online content by Pakistani citizens using Urdu dialogue system with mobile phones, addressing the current literacy, language and connectivity barriers Develop research capacity in advanced speech and language technology at CLE through collaboration, and develop curriculum for MS and PhD programs at UET, for sustainable research and development in this area in Pakistan
Scope:	Speech technology Trained speech technology resources Urdu TTS system (2 hours of speech) Urdu ASR system (80 speakers, 40 hours of speech) Advanced speech technology curriculum developed for MS and PhD programs
Deliverables:	Learning of researchers to use automatic speech recognition (ASR) and text to speech synthesis (TTS) tools for the development of Urdu systems Training session of researchers on advanced tools used for speech recognition, fine tuning of statistical parameters required for improving recognition. Text to speech synthesis, tagging of speech data used for the unit selection based systems Develop a TTS system based and ASR system based techniques Improved ASR and TTS models through a series of tests and improvements made under guidance Finalized ASR and TTS system based
Project URL(s):

7

Pashto Translation Project

Dr. Sarmad Hussain

ELDA

Completed

Project Details
Start Date:	21st December, 2011
Duration:	1 year
Funding Organization:	Evaluation and Language Distribution Resources Agency (ELDA), France
Principle Investigator(s):	Dr. Sarmad Hussain
Project Status:	Completed
Objectives:	Develop speech database Develop monolingual corpus for Pashto language
Deliverables:	Monolingual corpus of 100 million words Transcribed speech corpus of 100 hours of broadcast news data
Related URL(s):	TRAD Pashto Broadcast News Speech Corpus TRAD Pashto Monolingual Text Corpus

6

IDRC Research Chair in Multi Lingual Computing

Dr. Sarmad Hussain

IDRC, Canada

Completed

Project Details
Start Date:	1st January, 2011
Duration:	2 year
Funding Organization:	International Development Research Center (IDRC), Canada
Principle Investigator(s):	Dr. Sarmad Hussain
Collaborating Organizations:	Afghan Computer Science Association, Afghanistan Center for Research on Bangla Language Processing, BRAC University, Bangladesh D.Net, Bangladesh Department of Information Technology, Bhutan Institute of Technology, Cambodia Ministry of Education, Youth, and Sports, Cambodia National Information Communications Technology Development Authority (NIDA) , Cambodia Institute of Science and Technology, China Tibet Academy of Agricultural and Animal Husbandry Sciences, China Tibet University, China University of Indonesia, Indonesia Agency for the Assessment and Application of Technology (BPPT), Indonesia National Authority for Science and Technology, Indonesia InfoCon Co. Ltd., Mongolia Mongolian University of Science and Technology, Mongolia National University of Mongolia, Mongolia E-Network Research and Development, Nepal Madan Puraskar Pustakalaya, Nepal Language Technology Research Center, University of Colombo School of Computing, Sri Lanka
Project Status:	Completed
Objectives:	To support, sustain and grow the online PAN Localization network for multilingual computing To formalize the PAN Localization network for sustained research collaboration through self or externally funded projects To explore commercial models for research and development at national and global level to support the network To provide national and regional support to policy makers through the PAN Localization network
Deliverables:	Set up virtual network website including content management system Draft network constitution MOU signed by partners, formally setting up the network Annual Technology review for PAN L10n network Poll results on technology, policy and use needs Annual policy and Use review for PAN L10n network 1st Evaluative report on the self-financing Research Chair model to sustain the network in research capacity building and research outcomes Legal formalities completed for network registration; network bank account opened Annual Technology Review for PAN L10n network 2nd evaluative report on self financing Research Chair model to sustain the network in research capacity building and research outcomes
Project URL(s):	http://panl10n.cle.org.pk/

5

Online Torwali Dictionary

Mr. Inamullah

National Geographic
IDRC, Canada

Completed

Project Details
Start Date:	2010
Duration:	1-Year
Funding Organization:	National Geographic IDRC, Canada
Principle Investigator(s):	Mr. Inamullah
Project Status:	Completed
Objectives:	To resolve issues of orthography, documentation, and language preservation for future generations and promotes the community's identity and culture
Scope:	Familiar people with Torwali language because it is unwritten and is a less studied language previously
Deliverables:	Online Torwali Dictionary contains 7,988 entries. These entries comprise of: Words = 7387 Idioms = 402 Proverbs = 199
Project URL(s):	http://otd.cle.org.pk/

4

Subh-e-nau

Ms. Huda Sarfraz
Ms. Sana Shams

Bytes for All, Pakistan

Completed

Project Details
Start Date:	1st March, 2011
Duration:	1 year
Funding Organization:	Bytes for All, Pakistan
Principle Investigator(s):	Ms. Sana Shams Ms. Huda Sarfaraz
Project Status:	Completed
Objectives:	Design and organize an intense training specifically customized for empowering women survivors of violence by equipping them with the necessary information communication technology (ICT) tools to voice their stories and for self healing.
Scope:	Creating strong awareness and a rising urge for safety violence against women
Deliverables:	General Computer Usage (log in, log off, and general word processing) Digital Stories’ Development Conceptualizing and Narrating Stories Story Boarding Image Creating using multimedia tools (Digital Cameras, Picture editors, etc.) Audio Production (using sound recorders, audio editing software, etc.) Movie Editing and Production Developing Blog Posts for self expression Creating online communication networks for consolidating their agency
Project URL(s):	http://cle.org.pk/subhenau/

3

Asian Language Support on Mobile Platform

Dr. Sarmad Hussain

Completed

Project Details
Start Date:	August 2010
Duration:	1 year
Funding Organization:	IDRC, Canada
Principle Investigator(s):	Dr. Sarmad Hussain
Project Status:	Completed
Objectives:	To enable Asian languages on mobile platform To enabling complex Asian writing systems and input methods
Scope:	Provide a scalable and universal solution, instead of developing script based rules for all different writing systems and deploying them separately as applications with bit map fonts
Deliverables:	Challenges of deploying open source rendering engine onto mobile platform
Project URL(s):	http://www.cle.org.pk/research/projects/PANL10n_mobile/

2

PAN Localization Project

Dr. Sarmad Hussain

IDRC, Canada

Completed

Project Details
Start Date:	1st May, 2010
Duration:	1.5 years
Funding Organization:	International Development Research Centre (IDRC), Canada
Principle Investigator(s):	Dr. Sarmad Hussain
Project Status:	Completed
Objectives:	To examine effective means to develop digital literacy through the use of local language computing and content. To explore development of sustainable human resource capacity for R&D in local language computing as a means to raise current levels of technological support for Asian languages. To advance policy for development and use of local language computing and content. To study and develop coherent instruments to gauge the effectiveness of multi-disciplinary research concerning the adoption of local language technology by rural communities
Scope:	A wide variety of activities need to be orchestrated across various countries for different languages and cultures to carry out the objectives of the second phase of the PAN Localization project. As content is an important ingredient to catalyze the use of technology. The scope of Phase II of PAN Localization includes looking at challenges in publishing online content in local languages. Technology, even in local language, is not easily usable without adequate training. Therefore, the project will look at effective strategies to develop and conduct technology training to not only access content but also to generate it. The project aims the development of local language computing technology. The project will continue the necessary research to consolidate the earlier work on standards to enable local language computing in Asian languages, looking at fonts, collation, locale, keyboards, encodings, etc. The project will also look into emerging standards for access, including keypads, Internationalized Domain Names (IDNs), etc. The project also aims to work with policy-makers, scientists and end users and develop and report on the partnerships developed across these groups.
Deliverables:	Language Resources e.g. Tagged and Parallel Corpora and Lexica, Open Source Software Localization, Speech Recognition Systems, Text to Speech Systems, Linux Distributions, Application for Word Segmentation, Morphological and Syntactic Parsers, Local Language Mobile Interface, Local Language Content Creation, End User Training and Outreach
Project URL(s):	http://panl10n.cle.org.pk/

1

Punjab Government's Flood Relief Website-Urdu Version

Dr. Sarmad Hussain

Completed

Project Details
Start Date:	10th August, 2010
Duration:	6 months
Collaborating Organization:	Punjab Information Technology Board
Principle Investigator(s):	Dr. Sarmad Hussain
Project Status:	Completed
Objectives:	The Punjab Flood Relief and Rehabilitation portal (http://floodrelief.punjab.gov.pk) has been set up in the wake of the recent floods by the Punjab Provincial Disaster Management Authority (PDMA). The portal has been developed by the Punjab Information Technology Board (PITB) in English language. A variety of information, including financial information, damage assessment estimates, loss statistics and details of relief activities are being updated on a daily basis and being publicized through this portal. In order to make this information accessible on a wider scale, in particular, to the large portion of the population that is not literate in English, the portal is also being localized and made available in Urdu (http://floodrelief.punjab.gov.pk/urdu) by the Center for Language Engineering, Al-Khawarizmi Institute of Computer Science, University of Engineering and Technology. This is a vital measure to ensure that this critical information is made available in a language medium that is accessible for most of the population.
Scope:	Translation of English website text to Urdu Translation of English text in images to Urdu Reversal of Image layout from left to right to right to left Complete development and maintenance of the website
Deliverables:	Development and maintenance of Urdu version of the Flood Relief website in real time
Project URL(s):	http://floodrelief.punjab.gov.pk/urdu