Natural Language Processing for low-resource languages

18 January 2023

share this:

A black keyboard at the bottom of the picture has an open book on it, with red words in labels floating on top, with a letter A balanced on top of them. The perspective makes the composition form a kind of triangle from the keyboard to the capital A. The AI filter makes it look like a messy, with a kind of cartoon style.Teresa Berndtsson / Better Images of AI / Letter Word Text Taxonomy / Licenced by CC-BY 4.0.

The majority of natural language processing (NLP) datasets and research at present focus on a small number of high-resource languages, with studies on English dominating the field. Clearly, such an imbalance is undesirable, putting those who do not use English at a disadvantage.

In this article, we highlight some of the work and initiatives being carried out on low-resource languages.


Africa is one of the most linguistically diverse regions in the world. Despite this, African languages are barely represented in technology and research. Lanfrica aims to mitigate the difficulty encountered in the discovery of African language resources by creating a centralised hub. The team at Lanfrica have built a language-focused search engine that makes it fast and easy to find information on the internet about resources relating to African languages. Now with more than 1000 resources, their aim is to catalogue and connect all African language resources, one record at a time.

As well as this platform, Lanfrica also hosts regular online talks where you can hear from researchers in the field. This talk series provides a platform for anyone to share/showcase their efforts (research, projects, software, applications, datasets, models, initiatives, etc.) in NLP.


Masakhane is a grassroots organisation whose mission is to strengthen and spur NLP research in African languages. The organisation is currently engaged in a number of projects, including:


In this paper, Maaz Amjad, Sabur Butt, Hamza Imam Amjad, Alisa Zhila, Grigori Sidorov and Alexander Gelbukh outline their approach when taking part in the shared task UrduFake@FIRE2021, which centred on fake news detection in Urdu. This shared task aimed to attract and encourage researchers working in different NLP domains to address the automatic fake news detection task and help to mitigate the proliferation of fake content on the web.

The team have also looked into tweets in Urdu, in their paper Threatening Language Detection and Target Identification in Urdu Tweets.

Indian regional languages

B. S. Harish and R. Kasturi Rangan provide a comprehensive survey on Indian regional language processing, looking at tasks such as machine translation, named entity recognition, sentiment analysis and parts-of-speech tagging.


Md. Rajib Hossain and Mohammed Moshiul Hoque study Bengali word embedding in their paper Towards Bengali Word Embedding: Corpus Creation, Intrinsic and Extrinsic Evaluations. They presents three embedding techniques with different hyperparameters implemented on a Bengali corpus with consists of 180 million words.

Indigenous languages of the Americas

Introducing QuBERT: A Large Monolingual Corpus and BERT Model for Southern Quechua, by Rodolfo Zevallos et al., introduces a large combined corpus for deep learning of Quechua. The authors also provide a public, pre-trained, BERT model called QuBERT. They have tested their corpus and its corresponding BERT model on two major tasks: (1) named-entity recognition (NER) and (2) part-of-speech (POS) tagging.

In this paper you can read about the AmericasNLP 2021 shared task on open machine translation for indigenous languages of the Americas. Manuel Mager et al. report on the 214 submissions from eight teams, which focussed on 10 different languages: Asháninka, Aymara, Bribri, Guarani, Nahuatl, Otomí, Quechua, Rarámuri, Shipibo-Konibo, and Wixarika.

Axolotl: a Web Accessible Parallel Corpus for Spanish-Nahuatl, by Ximena Gutierrez-Vasques, Gerardo Sierra and Isaac Hernandez Pompa, presents a project which comprises a Spanish-Nahuatl parallel corpus and its search interface.

Gina Bustamante, Arturo Oncevay, Roberto Zariquiey introduce monolingual corpora for four indigenous and endangered languages from Peru (Shipibo-konibo, Ashaninka, Yanesha and Yine) in their paper No data to crawl? Monolingual corpus creation from PDF files of truly low-resource languages in Peru.

Dysarthric speech recognition

Karima Kadaoui is researching how to help speech-impaired people communicate. Part of her project is to build an application to “translate” speech which may by unclear. She talks about the inspiration behind her work, and what she plans to achieve, in this video.

Sign language

Steven Kolawole created a dataset for Nigerian sign language with the help of a TV sign language broadcaster and two schools. Using this dataset, he built a sign-to-speech model for the language. You can find out more in this interview.

In their position paper, Including Signed Languages in Natural Language Processing, Kayo Yin, Amit Moryossef, Julie Hochgesang, Yoav Goldberg, and Malihe Alikhani call on the NLP community to include signed languages as a research area with high social and scientific impact. They discuss the linguistic properties of signed languages, review the limitations of current sign language processing models, and identify the open challenges to extend NLP to signed languages.

In her paper Approaches to the Anonymisation of Sign Language Corpora, Amy Isard considers the state-of-the-art for the anonymisation of sign language corpora. She explores the motivations behind anonymisation, and details the processes which can be used to anonymise both the video and the annotations belonging to a corpus.

Further reading

The AI Around the World series is supported through a donation from the Mohamed bin Zayed University of Artificial Intelligence (MBZUAI). AIhub retains editorial freedom in selecting and preparing the content.


Lucy Smith , Managing Editor for AIhub.
Lucy Smith , Managing Editor for AIhub.

            AIhub is supported by:

Related posts :

The Machine Ethics Podcast: featuring Marc Steen

In this episode, Ben chats to Marc Steen about AI as tools, the ethics of business models, writing "Ethics for People Who Work in Tech", and more.
06 June 2023, by

On privacy and personalization in federated learning: a retrospective on the US/UK PETs challenge

Studying the use of differential privacy in personalized, cross-silo federated learning.
05 June 2023, by

VISION AI Open Day: Trustworthy AI

Watch the roundtable discussion on trustworthy AI, with a focus on generative models, from the AI Open Day held in Prague.
02 June 2023, by

PeSTo: an AI tool for predicting protein interactions

The model can predict the binding interfaces of proteins when they bind other proteins, nucleic acids, lipids, ions, and small molecules.
01 June 2023, by

Tetris reveals how people respond to an unfair AI algorithm

An experiment in which two people play a modified version of Tetris revealed that players who get fewer turns perceive the other player as less likeable, regardless of whether a person or an algorithm allocates the turns.
31 May 2023, by

AIhub monthly digest: May 2023 – mitigating biases, ICLR invited talks, and Eurovision fun

Welcome to our monthly digest, where you can catch up with AI research, events and news from the month past.
30 May 2023, by

©2021 - Association for the Understanding of Artificial Intelligence


©2021 - ROBOTS Association