ΑΙhub.org
 

Natural Language Processing for low-resource languages


by
18 January 2023



share this:

A black keyboard at the bottom of the picture has an open book on it, with red words in labels floating on top, with a letter A balanced on top of them. The perspective makes the composition form a kind of triangle from the keyboard to the capital A. The AI filter makes it look like a messy, with a kind of cartoon style.Teresa Berndtsson / Better Images of AI / Letter Word Text Taxonomy / Licenced by CC-BY 4.0.

The majority of natural language processing (NLP) datasets and research at present focus on a small number of high-resource languages, with studies on English dominating the field. Clearly, such an imbalance is undesirable, putting those who do not use English at a disadvantage.

In this article, we highlight some of the work and initiatives being carried out on low-resource languages.

Lanfrica

Africa is one of the most linguistically diverse regions in the world. Despite this, African languages are barely represented in technology and research. Lanfrica aims to mitigate the difficulty encountered in the discovery of African language resources by creating a centralised hub. The team at Lanfrica have built a language-focused search engine that makes it fast and easy to find information on the internet about resources relating to African languages. Now with more than 1000 resources, their aim is to catalogue and connect all African language resources, one record at a time.

As well as this platform, Lanfrica also hosts regular online talks where you can hear from researchers in the field. This talk series provides a platform for anyone to share/showcase their efforts (research, projects, software, applications, datasets, models, initiatives, etc.) in NLP.

Masakane

Masakhane is a grassroots organisation whose mission is to strengthen and spur NLP research in African languages. The organisation is currently engaged in a number of projects, including:

Urdu

In this paper, Maaz Amjad, Sabur Butt, Hamza Imam Amjad, Alisa Zhila, Grigori Sidorov and Alexander Gelbukh outline their approach when taking part in the shared task UrduFake@FIRE2021, which centred on fake news detection in Urdu. This shared task aimed to attract and encourage researchers working in different NLP domains to address the automatic fake news detection task and help to mitigate the proliferation of fake content on the web.

The team have also looked into tweets in Urdu, in their paper Threatening Language Detection and Target Identification in Urdu Tweets.

Indian regional languages

B. S. Harish and R. Kasturi Rangan provide a comprehensive survey on Indian regional language processing, looking at tasks such as machine translation, named entity recognition, sentiment analysis and parts-of-speech tagging.

Bengali

Md. Rajib Hossain and Mohammed Moshiul Hoque study Bengali word embedding in their paper Towards Bengali Word Embedding: Corpus Creation, Intrinsic and Extrinsic Evaluations. They presents three embedding techniques with different hyperparameters implemented on a Bengali corpus with consists of 180 million words.

Indigenous languages of the Americas

Introducing QuBERT: A Large Monolingual Corpus and BERT Model for Southern Quechua, by Rodolfo Zevallos et al., introduces a large combined corpus for deep learning of Quechua. The authors also provide a public, pre-trained, BERT model called QuBERT. They have tested their corpus and its corresponding BERT model on two major tasks: (1) named-entity recognition (NER) and (2) part-of-speech (POS) tagging.

In this paper you can read about the AmericasNLP 2021 shared task on open machine translation for indigenous languages of the Americas. Manuel Mager et al. report on the 214 submissions from eight teams, which focussed on 10 different languages: Asháninka, Aymara, Bribri, Guarani, Nahuatl, Otomí, Quechua, Rarámuri, Shipibo-Konibo, and Wixarika.

Axolotl: a Web Accessible Parallel Corpus for Spanish-Nahuatl, by Ximena Gutierrez-Vasques, Gerardo Sierra and Isaac Hernandez Pompa, presents a project which comprises a Spanish-Nahuatl parallel corpus and its search interface.

Gina Bustamante, Arturo Oncevay, Roberto Zariquiey introduce monolingual corpora for four indigenous and endangered languages from Peru (Shipibo-konibo, Ashaninka, Yanesha and Yine) in their paper No data to crawl? Monolingual corpus creation from PDF files of truly low-resource languages in Peru.

Dysarthric speech recognition

Karima Kadaoui is researching how to help speech-impaired people communicate. Part of her project is to build an application to “translate” speech which may by unclear. She talks about the inspiration behind her work, and what she plans to achieve, in this video.

Sign language

Steven Kolawole created a dataset for Nigerian sign language with the help of a TV sign language broadcaster and two schools. Using this dataset, he built a sign-to-speech model for the language. You can find out more in this interview.

In their position paper, Including Signed Languages in Natural Language Processing, Kayo Yin, Amit Moryossef, Julie Hochgesang, Yoav Goldberg, and Malihe Alikhani call on the NLP community to include signed languages as a research area with high social and scientific impact. They discuss the linguistic properties of signed languages, review the limitations of current sign language processing models, and identify the open challenges to extend NLP to signed languages.

In her paper Approaches to the Anonymisation of Sign Language Corpora, Amy Isard considers the state-of-the-art for the anonymisation of sign language corpora. She explores the motivations behind anonymisation, and details the processes which can be used to anonymise both the video and the annotations belonging to a corpus.

Further reading


The AI Around the World series is supported through a donation from the Mohamed bin Zayed University of Artificial Intelligence (MBZUAI). AIhub retains editorial freedom in selecting and preparing the content.



tags:


Lucy Smith is Senior Managing Editor for AIhub.
Lucy Smith is Senior Managing Editor for AIhub.




            AIhub is supported by:



Related posts :



monthly digest

AIhub monthly digest: August 2025 – causality and generative modelling, responsible multimodal AI, and IJCAI in Montréal and Guangzhou

  29 Aug 2025
Welcome to our monthly digest, where you can catch up with AI research, events and news from the month past.

Interview with Benyamin Tabarsi: Computing education and generative AI

  28 Aug 2025
Read the latest interview in our series featuring the AAAI/SIGAI Doctoral Consortium participants.

The value of prediction in identifying the worst-off: Interview with Unai Fischer Abaigar

  27 Aug 2025
We hear from the winner of an outstanding paper award at ICML2025.

#IJCAI2025 social media round-up: part two

  26 Aug 2025
Find out what the participants got up to during the main part of the conference.

AI helps chemists develop tougher plastics

  25 Aug 2025
Researchers created polymers that are more resistant to tearing by incorporating stress-responsive molecules identified by a machine learning model.

RoboCup@Work League: Interview with Christoph Steup

  22 Aug 2025
Find out more about the RoboCup League focussed on industrial production systems.

Interview with Haimin Hu: Game-theoretic integration of safety, interaction and learning for human-centered autonomy

  21 Aug 2025
Hear from Haimin in the latest in our series featuring the 2025 AAAI / ACM SIGAI Doctoral Consortium participants.

Congratulations to the #IJCAI2025 distinguished paper award winners

  20 Aug 2025
Find out who has won the prestigious awards at the International Joint Conference on Artificial Intelligence.



 

AIhub is supported by:






 












©2025.05 - Association for the Understanding of Artificial Intelligence