ΑΙhub.org
 

Natural Language Processing for low-resource languages


by
18 January 2023



share this:

A black keyboard at the bottom of the picture has an open book on it, with red words in labels floating on top, with a letter A balanced on top of them. The perspective makes the composition form a kind of triangle from the keyboard to the capital A. The AI filter makes it look like a messy, with a kind of cartoon style.Teresa Berndtsson / Better Images of AI / Letter Word Text Taxonomy / Licenced by CC-BY 4.0.

The majority of natural language processing (NLP) datasets and research at present focus on a small number of high-resource languages, with studies on English dominating the field. Clearly, such an imbalance is undesirable, putting those who do not use English at a disadvantage.

In this article, we highlight some of the work and initiatives being carried out on low-resource languages.

Lanfrica

Africa is one of the most linguistically diverse regions in the world. Despite this, African languages are barely represented in technology and research. Lanfrica aims to mitigate the difficulty encountered in the discovery of African language resources by creating a centralised hub. The team at Lanfrica have built a language-focused search engine that makes it fast and easy to find information on the internet about resources relating to African languages. Now with more than 1000 resources, their aim is to catalogue and connect all African language resources, one record at a time.

As well as this platform, Lanfrica also hosts regular online talks where you can hear from researchers in the field. This talk series provides a platform for anyone to share/showcase their efforts (research, projects, software, applications, datasets, models, initiatives, etc.) in NLP.

Masakane

Masakhane is a grassroots organisation whose mission is to strengthen and spur NLP research in African languages. The organisation is currently engaged in a number of projects, including:

Urdu

In this paper, Maaz Amjad, Sabur Butt, Hamza Imam Amjad, Alisa Zhila, Grigori Sidorov and Alexander Gelbukh outline their approach when taking part in the shared task UrduFake@FIRE2021, which centred on fake news detection in Urdu. This shared task aimed to attract and encourage researchers working in different NLP domains to address the automatic fake news detection task and help to mitigate the proliferation of fake content on the web.

The team have also looked into tweets in Urdu, in their paper Threatening Language Detection and Target Identification in Urdu Tweets.

Indian regional languages

B. S. Harish and R. Kasturi Rangan provide a comprehensive survey on Indian regional language processing, looking at tasks such as machine translation, named entity recognition, sentiment analysis and parts-of-speech tagging.

Bengali

Md. Rajib Hossain and Mohammed Moshiul Hoque study Bengali word embedding in their paper Towards Bengali Word Embedding: Corpus Creation, Intrinsic and Extrinsic Evaluations. They presents three embedding techniques with different hyperparameters implemented on a Bengali corpus with consists of 180 million words.

Indigenous languages of the Americas

Introducing QuBERT: A Large Monolingual Corpus and BERT Model for Southern Quechua, by Rodolfo Zevallos et al., introduces a large combined corpus for deep learning of Quechua. The authors also provide a public, pre-trained, BERT model called QuBERT. They have tested their corpus and its corresponding BERT model on two major tasks: (1) named-entity recognition (NER) and (2) part-of-speech (POS) tagging.

In this paper you can read about the AmericasNLP 2021 shared task on open machine translation for indigenous languages of the Americas. Manuel Mager et al. report on the 214 submissions from eight teams, which focussed on 10 different languages: Asháninka, Aymara, Bribri, Guarani, Nahuatl, Otomí, Quechua, Rarámuri, Shipibo-Konibo, and Wixarika.

Axolotl: a Web Accessible Parallel Corpus for Spanish-Nahuatl, by Ximena Gutierrez-Vasques, Gerardo Sierra and Isaac Hernandez Pompa, presents a project which comprises a Spanish-Nahuatl parallel corpus and its search interface.

Gina Bustamante, Arturo Oncevay, Roberto Zariquiey introduce monolingual corpora for four indigenous and endangered languages from Peru (Shipibo-konibo, Ashaninka, Yanesha and Yine) in their paper No data to crawl? Monolingual corpus creation from PDF files of truly low-resource languages in Peru.

Dysarthric speech recognition

Karima Kadaoui is researching how to help speech-impaired people communicate. Part of her project is to build an application to “translate” speech which may by unclear. She talks about the inspiration behind her work, and what she plans to achieve, in this video.

Sign language

Steven Kolawole created a dataset for Nigerian sign language with the help of a TV sign language broadcaster and two schools. Using this dataset, he built a sign-to-speech model for the language. You can find out more in this interview.

In their position paper, Including Signed Languages in Natural Language Processing, Kayo Yin, Amit Moryossef, Julie Hochgesang, Yoav Goldberg, and Malihe Alikhani call on the NLP community to include signed languages as a research area with high social and scientific impact. They discuss the linguistic properties of signed languages, review the limitations of current sign language processing models, and identify the open challenges to extend NLP to signed languages.

In her paper Approaches to the Anonymisation of Sign Language Corpora, Amy Isard considers the state-of-the-art for the anonymisation of sign language corpora. She explores the motivations behind anonymisation, and details the processes which can be used to anonymise both the video and the annotations belonging to a corpus.

Further reading


The AI Around the World series is supported through a donation from the Mohamed bin Zayed University of Artificial Intelligence (MBZUAI). AIhub retains editorial freedom in selecting and preparing the content.



tags:


Lucy Smith is Senior Managing Editor for AIhub.
Lucy Smith is Senior Managing Editor for AIhub.




            AIhub is supported by:



Related posts :



Identifying patterns in insect scents using machine learning

  19 Dec 2025
Scientists will use machine learning to predict what types of molecules interact with insect olfactory receptors.

2025 AAAI / ACM SIGAI Doctoral Consortium interviews compilation

  18 Dec 2025
We collate our interviews with the 2025 cohort of doctoral consortium participants.

A backlash against AI imagery in ads may have begun as brands promote ‘human-made’

  17 Dec 2025
In a wave of new ads, brands like Heineken, Polaroid and Cadbury have started celebrating their work as “human-made”.

AIhub blog post highlights 2025

  16 Dec 2025
As the year draws to a close, we take a look back at some of our favourite blog posts.

Using machine learning to track greenhouse gas emissions

  15 Dec 2025
PhD candidate Julia Wąsala searches for greenhouse gas emissions in satellite data.

AAAI 2025 presidential panel on the future of AI research – video discussion on AGI

  12 Dec 2025
Watch the first in a series of video discussions from AAAI.

The Machine Ethics podcast: the AI bubble with Tim El-Sheikh

Ben chats to Tim about AI use cases, whether GenAI is even safe, the AI bubble, replacing human workers, data oligarchies and more.

Australia’s vast savannas are changing, and AI is showing us how

Improving decision-making for dynamic and rapidly changing environments.



 

AIhub is supported by:






 












©2025.05 - Association for the Understanding of Artificial Intelligence