ΑΙhub.org
 

Natural Language Processing for low-resource languages


by
18 January 2023



share this:

A black keyboard at the bottom of the picture has an open book on it, with red words in labels floating on top, with a letter A balanced on top of them. The perspective makes the composition form a kind of triangle from the keyboard to the capital A. The AI filter makes it look like a messy, with a kind of cartoon style.Teresa Berndtsson / Better Images of AI / Letter Word Text Taxonomy / Licenced by CC-BY 4.0.

The majority of natural language processing (NLP) datasets and research at present focus on a small number of high-resource languages, with studies on English dominating the field. Clearly, such an imbalance is undesirable, putting those who do not use English at a disadvantage.

In this article, we highlight some of the work and initiatives being carried out on low-resource languages.

Lanfrica

Africa is one of the most linguistically diverse regions in the world. Despite this, African languages are barely represented in technology and research. Lanfrica aims to mitigate the difficulty encountered in the discovery of African language resources by creating a centralised hub. The team at Lanfrica have built a language-focused search engine that makes it fast and easy to find information on the internet about resources relating to African languages. Now with more than 1000 resources, their aim is to catalogue and connect all African language resources, one record at a time.

As well as this platform, Lanfrica also hosts regular online talks where you can hear from researchers in the field. This talk series provides a platform for anyone to share/showcase their efforts (research, projects, software, applications, datasets, models, initiatives, etc.) in NLP.

Masakane

Masakhane is a grassroots organisation whose mission is to strengthen and spur NLP research in African languages. The organisation is currently engaged in a number of projects, including:

Urdu

In this paper, Maaz Amjad, Sabur Butt, Hamza Imam Amjad, Alisa Zhila, Grigori Sidorov and Alexander Gelbukh outline their approach when taking part in the shared task UrduFake@FIRE2021, which centred on fake news detection in Urdu. This shared task aimed to attract and encourage researchers working in different NLP domains to address the automatic fake news detection task and help to mitigate the proliferation of fake content on the web.

The team have also looked into tweets in Urdu, in their paper Threatening Language Detection and Target Identification in Urdu Tweets.

Indian regional languages

B. S. Harish and R. Kasturi Rangan provide a comprehensive survey on Indian regional language processing, looking at tasks such as machine translation, named entity recognition, sentiment analysis and parts-of-speech tagging.

Bengali

Md. Rajib Hossain and Mohammed Moshiul Hoque study Bengali word embedding in their paper Towards Bengali Word Embedding: Corpus Creation, Intrinsic and Extrinsic Evaluations. They presents three embedding techniques with different hyperparameters implemented on a Bengali corpus with consists of 180 million words.

Indigenous languages of the Americas

Introducing QuBERT: A Large Monolingual Corpus and BERT Model for Southern Quechua, by Rodolfo Zevallos et al., introduces a large combined corpus for deep learning of Quechua. The authors also provide a public, pre-trained, BERT model called QuBERT. They have tested their corpus and its corresponding BERT model on two major tasks: (1) named-entity recognition (NER) and (2) part-of-speech (POS) tagging.

In this paper you can read about the AmericasNLP 2021 shared task on open machine translation for indigenous languages of the Americas. Manuel Mager et al. report on the 214 submissions from eight teams, which focussed on 10 different languages: Asháninka, Aymara, Bribri, Guarani, Nahuatl, Otomí, Quechua, Rarámuri, Shipibo-Konibo, and Wixarika.

Axolotl: a Web Accessible Parallel Corpus for Spanish-Nahuatl, by Ximena Gutierrez-Vasques, Gerardo Sierra and Isaac Hernandez Pompa, presents a project which comprises a Spanish-Nahuatl parallel corpus and its search interface.

Gina Bustamante, Arturo Oncevay, Roberto Zariquiey introduce monolingual corpora for four indigenous and endangered languages from Peru (Shipibo-konibo, Ashaninka, Yanesha and Yine) in their paper No data to crawl? Monolingual corpus creation from PDF files of truly low-resource languages in Peru.

Dysarthric speech recognition

Karima Kadaoui is researching how to help speech-impaired people communicate. Part of her project is to build an application to “translate” speech which may by unclear. She talks about the inspiration behind her work, and what she plans to achieve, in this video.

Sign language

Steven Kolawole created a dataset for Nigerian sign language with the help of a TV sign language broadcaster and two schools. Using this dataset, he built a sign-to-speech model for the language. You can find out more in this interview.

In their position paper, Including Signed Languages in Natural Language Processing, Kayo Yin, Amit Moryossef, Julie Hochgesang, Yoav Goldberg, and Malihe Alikhani call on the NLP community to include signed languages as a research area with high social and scientific impact. They discuss the linguistic properties of signed languages, review the limitations of current sign language processing models, and identify the open challenges to extend NLP to signed languages.

In her paper Approaches to the Anonymisation of Sign Language Corpora, Amy Isard considers the state-of-the-art for the anonymisation of sign language corpora. She explores the motivations behind anonymisation, and details the processes which can be used to anonymise both the video and the annotations belonging to a corpus.

Further reading


The AI Around the World series is supported through a donation from the Mohamed bin Zayed University of Artificial Intelligence (MBZUAI). AIhub retains editorial freedom in selecting and preparing the content.



tags:


Lucy Smith is Senior Managing Editor for AIhub.
Lucy Smith is Senior Managing Editor for AIhub.




            AIhub is supported by:


Related posts :



Visualizing research in the age of AI

  14 Mar 2025
Felice Frankel discusses the implications of generative AI when communicating science visually.

#IJCAI panel on communicating about AI with the public

  13 Mar 2025
A recording of this session at IJCAI2024 is now available to watch.

Interview with Tunazzina Islam: Understand microtargeting and activity patterns on social media

  11 Mar 2025
Hear from Doctoral Consortium participant Tunazzina about her research on computational social science, natural language processing, and social media mining and analysis

Microsoft cuts data centre plans and hikes prices in push to make users carry AI costs

  10 Mar 2025
Microsoft is trying to recoup the costs by raising prices, putting ads in products, and cancelling data centre leases

Report on the future of AI research

  07 Mar 2025
Find out more about a report released by the AAAI 2025 Presidential Panel.

Andrew Barto and Richard Sutton win 2024 Turing Award

  06 Mar 2025
Pair are recognised for their pioneering reinforcement learning research.

#AAAI2025 social media round-up: part two

  05 Mar 2025
What did the participants get up to during the second half of the conference?

Visualizing nanoparticle dynamics using AI-based method

  04 Mar 2025
A team of scientists has developed a method to illuminate the dynamic behavior of nanoparticles.




AIhub is supported by:






©2024 - Association for the Understanding of Artificial Intelligence


 












©2021 - ROBOTS Association