Center for Language Engineering

 
 



 

 

KICS
KICS-UET


 
 

[ Text Corpora ] [ Image Corpora ] [ Speech Corpora ] [ Lexical Resources ] [ NLP Applications ]

 
 

[ How to Order ]

 
   
 

CLE is making these linguistic resources available without cost for supporting academic, non-commercial research. The processing fees being charged will be used to maintain these resources. You are requested to contact CLE directly for any discounts (applicable only for selective public organizations in Pakistan) or for commercial licensing options.

 
     
  CLE Urdu Broadcast Speaker Identification Corpus
   
 
CLE Catalog #: CLE23S011
Release Date: 26 September 2023
First Language of Speakers: Urdu
Duration: 276 Hours
Distribution: Free Download
Processing Fee (Pakistan): 0 PKR
Processing Fee (International): 0 USD
License: Yes
   
  Introduction
  This package comprises a compilation of speaker identification data sourced from major Urdu broadcast news channels in Pakistan, primarily from their YouTube channels. Data is selected from a diverse range including talk shows, interviews, press conferences, and addresses from national assemblies. Total speakers covered are 1184 and with a total duration of audio 276 hours.
   
  Data Source
  Data is collected from Youtube channels of Geo News, ARY News, Samaa TV, Dunya News, BOL Network, PTV News, Aaj News, Express News, 92 News, Hum News, GNN, Dawn Media Group, Abb Takk News, and 24 Digital.
   
  Dataset
  The dataset package comprises three essential CSV files named "speaker_mapping," "Channel_tags," and "UBCSpk_detail_sheet," along with a folder labeled "Tdfs," all of which collectively provide comprehensive information about the dataset.
  1. Speaker Mapping (speaker_mapping.csv): It contains detailed information about individual speakers, including their names and unique identification numbers. Moreover, it provides a count of how many times each speaker appears across different selected audio sources.
  2. Channel Tags (Channel_tags.csv): Details about the channels from which the speaker data was sourced is specified. Each entry in this file includes the name of the channel and relevant tags associated with it.
  3. UBCSpk Detail Sheet (UBCSpk_detail_sheet.csv): This file offers a systematic overview of the dataset's audio and video content. It lists the names of the files alongside corresponding Youtube URL links for reference. Additionally, it provides the duration of each video.
  4. Tdfs Folder: The "Tdfs" folder houses a collection of TDF files. These TDF files contain valuable metadata about the segments, making it possible to generate segments.
   
  Download
  CLE Broadcast Speech SID
   
 
 
 

webmaster@cle.org.pk