Microsoft helps lesser spoken languages ​​survive and thrive in the digital world – News Center

Boa senior was the last link to the 65,000-year-old pre-Neolithic culture of the Andaman Islands in the Indian Ocean. When she died, her tongue did with her. And this is not an isolated case.

Every two weeks a language is lost somewhere in the world. This is the case of the Munda, a community of about one million people spread across the eastern Indian states of Jharkhand, Orissa and West Bengal.

,I learned Mundri very late because my parents lived in another state for work, so we didn’t speak the language at home»explains Dr. Meenakshi Munda, a member of the Munda community and assistant professor in the department of anthropology at a university in Ranchi, Jharkhand. ,I understand how important identity is to a community and the younger generation is losing its identity because they don’t know their own language.”

Kalika Bali, a researcher at MSR India, is an expert in natural language processing and leads the Ellora project. Photography by Praveen Pillai for Microsoft.

The Munda community is concerned about the longevity of their language, as only some of the more dominant languages ​​such as Bengali, Hindi and Oriya are taught in schools. Although there is a written alphabet for Mundari, there is almost no digital content or presence on the Internet, which provides even less incentive for people to invest in learning the language.

in this situation, researchers of Microsoft Research (MSR) in India. He has worked to create digital ecosystem for various languages ​​like Mundari which do not have enough presence in the digital world. ,The purpose of my work is that no one in this world is excluded from the use of technology because they speak a different language»Kalika Bali of MSR India ensures.

Kalika Bali—an expert in natural language processing, linguistics, and the artificial intelligence (AI) subfield that focuses on training computer systems to understand spoken and written languages—and her team used local resources to create the data set. Works with communities and native speakers. As the foundation for building AI technology. By involving the community in the process, they create accurate and culturally relevant sets of information.

Turing India, Principal Data Scientist and Applied Science at Microsoft, Monojit Chowdhary started research on the Ellora project with Kalika Bali.

The language of the Internet has been English since its inception. With improved access to and demand for content in native languages, the other seven widely spoken languages—such as Chinese or Spanish—can be compared to English in terms of technical compatibility. But these represent only eight of the world’s approximately 6,000. Therefore, 88% of languages ​​do not have a sufficient presence on the Internet or, what is the same, 1,200 million people – 20% of the world’s population – cannot use their own language to navigate the digital world.

“As a result, the gap between the haves and have-nots is huge,” Monojit Chowdhary, Principal Data Scientist for Turing India at Microsoft and a Fellow in Bali, explains.

Ellora Project

Within the framework of the Ellora Project (Enabling Low Resource Languages), the creation of digital resources has a dual purpose: Preserving a language for posterity and ensuring that its users can participate and interact in a digital world,

Ellora ProjectLaunched in 2015, it starts with the basics. The first step was to determine what resources were already available, such as print or literature, and the extent of the digital presence. In a 2020 paper, the experts created a six-tiered classification, with the top tier representing resource-rich languages ​​(such as English and Spanish), and the lower tiers reflecting those with little or no resources.

Their task is to mobilize the necessary resources for these languages ​​and to create linguistic models that meet the digital needs of their speakers. To achieve this, its researchers work with communities. ,No language technology can be separated from the people who are going to use it»Bali says.

In the case of Mundari, researchers worked with the Indian Institute of Technology Kharagpur in 2018 to sponsor a study to find out what the community needed to keep the language alive. What began as a simple vocabulary game for students to learn a language soon grew into sophisticated technical projects.

MSR researchers are currently working on a Hindi to Mundari text translation as well as a speech recognition model that will give the community access to more content in their language. Work on a text-to-speech model is also underway, funded by the German Ministry for Economic Cooperation and Development under the Deutsche Gesellschaft für Internationale Zusammenarbeit (GIZ) initiative “Forward – Artificial Intelligence for All”.

The team, led by faculty from IIT Kharagpur, initially worked with community members to manually translate phrases from Hindi to Mundari to speed up the process. specialist in Microsoft Developed a new technique called “Interneural Machine Translation” (INMT) which helps in predicting the next word when one is translating from Hindi to Mundari, ,INMT makes it possible to translate from one language to another more efficiently. If I am translating from Hindi to Mundri, when I start typing in Mundri, it gives me predictive suggestions in the language itself. It’s like predictive text on a smartphone keyboard, but in two languages.”Bali tells.

And to create the text-to-speech dataset, they collaborated with Karya, a digital work platform that enables data capture, labeling and annotation to build models. machine learning and artificial intelligence. The team identified a male Mundari speaker and Dr. Munda as the female speaker, to whom they recorded translated phrases through the Karya app on an Android phone. Recordings, along with associated text, are securely uploaded to the cloud and can be used by researchers to train text-to-speech models.

,The idea is that between Microsoft Research, Karya and the Indian Institute of Technology Kharagpur, we have data for machine translation, speech recognition and text-to-speech synthesis, so that these three technologies can be incorporated into Mundari.”Bali says.

Members of the Idu Mishmi community collaborate with MSR India Research Fellow Pamir Gogoi (second from right) in Hunli, Arunachal Pradesh. Photo by Niyaldeep Borua for Microsoft.

The relationship between language and technology is important because, over time, it could enable sophisticated translation systems across all languages ​​on government websites or online platforms. streaming, in others. For the language in which you are reading this material, these systems are already a reality.

In addition to the work they are doing with the Munda community, the Ellora Project is undertaking other initiatives:

  • Help Gondi speakers, of whom very few understand other languages, to obtain information. The Ellora Project, along with partners CGNETSwara and IIIT Naya Raipur, has created Adivasi Radio, a hub for news, videos and books. The team created 60,000 parallel phrases between Gondi and Hindi, leading to the development of the machine translation service.
  • Working with the Idu Mishmi community of Arunachal Pradesh, northeast India, to create a digital dictionary of their language, which now has fewer than 12,000 speakers. Digital dictionary will be used to teach children in schools.

“WhyWe want to reduce the time it may take for these languages ​​to get enough data to take advantage of the technology, if done otherwise“, said Bali. «If AI can do amazing things for English speakers, it should be able to do the same for any other human who doesn’t speak that language.”,

tag: i a, artificial intelligence, machine learning, Microsoft Research

Leave a Comment