In a world which is quickly digitizing itself and generating treasure troves of data as a result, the need for developing modern techniques to obtain useful insights for data is bigger than ever before. However, while other languages like English have managed to garner a global audience by developing advanced Natural Language Processing modules to extract useful insights from unstructured data, not much work has been done for our local languages in Pakistan.
Urdu, which is the national language of Pakistan, has not seen any work done in NLP. While there are several reasons for this, one big reason is the lack of basic tools and a framework to process the Urdu language.
A Pakistani duo, Ikram Ali and Mujadad Rao, plans to change that by developing an open source Python library for Urdu called UrduHack. It is the first and pretty much the only Natural Language Processing library for Urdu we have come across yet.
Ikram, the lead on this project, has over 7 years of experience in the software industry and is an avid Machine Learning practitioner. A Bachelors in Computer Science from Virtual University, Pakistan, he is currently working as the Principal Software Engineer at Arbisoft, Lahore.
Meanwhile, Mujadad, a Bachelors and Gold Medalist in Computer Science from the University of Central Punjab, is the secondary on this project. A freelancer and a developer, he currently works as a Machine Learning Developer at Arbisoft.
After seeing some of their work on the Pakistan.AI Forum, a budding community of Pakistani AI enthusiasts, I got in touch with Mujadad to get some more information about the project.
What is UrduHack?
UrduHack is the vision of Ikram Ali, who is a passionate reader with a love for books. Back in 2015, he was reading some articles and research papers on the topic: Urdu, the dying language, when he decided to research on the work being done on Urdu in the field of Natural Language Processing. However, while he did find a few organizations working on making applications for Urdu, most of the work was commercial and left him disappointed. As a result, he set about to change that by making a full-fledged Urdu library.
“Our plan is to achieve the maximum possible heights with UrduHack. We want to make it a full-fledged Urdu NLP library which people can use to make thousands of interesting applications for desktop, mobile or web,” shared Mujadad.
We want to help Urdu get back on its feet and work just as good as the English language with the modern technology stack.
As of now, they have managed to develop two core modules of the library, Normalization and Tokenization. Both of these modules are essential in cleaning and converting data from a cluttered form to a standard form that they have established for UrduHack. It is still very much a work in progress but they are planning to use TensorFlow v2 in their upcoming modules later this month. After that, here is the roadmap for modules they plan to implement next,
- Sentimental Analysis
- Sentence Classification
- Documents Classification
- Named Entity Recognition
- Image to Text
- Speech to Text
They have faced a number of technical problems while developing UrduHack, like the use of Unicode for the Urdu script. As Urdu is derived from Arabic script, the codes for the Urdu language were infused inside the Arabic Unicode block. For example, the letter alif ‘ا’ has two Unicodes, 0627 and FE8D, for Urdu and Arabic respectively. Most of the applications were using these two codes interchangeably because both appear the same to the human eye but not to the computer.
This redundancy of characters was the first challenge for us. So what we did is that we contacted Unicode Consortium and demanded a separate fixed Unicode block for Urdu and they provided us that in no time.
The second big problem is one that they are still tackling — how to get Urdu data? To make a good corpus of Urdu data, they scraped some websites for books and articles written in Urdu but most of the data is not in a proper form. The inconsistency of data harms the performance of UrduHack. As a result, a lot of their time is wasted to improve the quality of data and clean it.
The UrduHack team is actively looking for Urdu data available in digital form. If anyone in our readers would like to help them out, they can contact Mujadad at [email protected].
Contribute to UrduHack
I asked Mujadad about a piece of advise he would like to give young programmers looking to make a name for themselves in the field of Machine Learning and NLP and he said that the best way to learn is to contribute to the community.
“Most young programmers (including me) have the urge to get popular quickly by developing something unique. There hasn’t been much progress for the Urdu Language in NLP so you have the chance to be the first to the moon by developing interesting applications for Urdu and you will also help Urdu gain popularity again by doing so,” he added.