Smart little artificial intelligence-based assistants can be found everywhere these days. For us Pakistanis, however, the biggest hurdle to using these systems in our everyday lives remains the language barrier.
Most assistants are programmed to recognize speech in English and even today, there is a lack of programs that can recognize and translate Urdu speech. That may be about to change soon, thanks to a group of Pakistani researchers at ITU’s Center for Speech and Language Technologies (CSaLT) laboratory.
For any language to be recognizable to a computer, there needs to be a corpus of words, the most basic ingredient of a language. The corpus is a database of all the basic distinct sounds (phenome) used in everyday speech in a specific language.
Dr. Agha Ali Raza, an assistant professor at Information Technology University, Lahore, and a PhD in Language Technologies, has, along with his team, released a corpus of Urdu sentences that covers all possible distinct sounds for public.
Called the “CSaLT Phonetically Rich Urdu Speech Corpus”, it consists of a 70-minutes transcribed read speech consisting of 708 sentences covering all the possible 63 phenomes. In total, it consists of 5,656 unique words and is available for download at the research center’s website.
“Speech recognition is a two-step process. The corpus will give the computer application access to all possible phonemes used in the formation of meaningful Urdu words from everyday speech,” said Dr. Raza.
He further elaborates that although there are 63 distinct phonemes in Urdu, these don’t correspond to 63 distinct sounds in everyday speech. He also explained that a sound made for a phoneme may vary from one utterance to another depending on the phoneme used before and after it in a word. As a result, for any phoneme x, there will be 63*x*63 possible (tri-phoneme) sounds. The corpus he is releasing covers for all these possible sounds.
Dr.Raza’s work on this corpus started under the supervision of Dr. Sarmad Hussain as part of his master’s’ thesis at the National University of Computer and Emerging Sciences FAST, Lahore. Later on, he and Dr. Hussain were also helped by Huda Sarfraz, Inaam Ullah and Zahid Sarfaraz.
Thanks to this corpus, the process of making a speech recognition program for the Urdu language has just gotten a lot easier. All that is needed is a repository of the words used every day in the Urdu language.
“We hope that release of this corpus will also prove beneficial for regional languages in the country and languages lacking ample linguistic resources all over the world. Those interested in working on those languages can follow our technique to develop similar corpora of sentences in those languages,” he says.
“The technique used in development of this corpus will work for any language for which written material is available.”