ΑΙhub.org
 

Natural language processing model for African languages


by
17 November 2021



share this:

africa
Researchers have developed an AI model to help computers work more efficiently with a wider variety of languages.

African languages have received relatively little attention from computer scientists, so few natural language processing capabilities have been available to large swaths of the continent. A new language model, developed by researchers at the University of Waterloo’s David R. Cheriton School of Computer Science, begins to fill that gap by enabling computers to analyze text in African languages for many useful tasks.

The new neural network model, which the researchers have dubbed AfriBERTa, uses deep-learning techniques to achieve state-of-the-art results for low-resource languages.

The neural network language model works specifically with 11 African languages, such as Amharic, Hausa, and Swahili, spoken collectively by more than 400 million people. It achieves competitive output quality despite learning from just one gigabyte of text, while other models require thousands of times more data.

“Pretrained language models have transformed the way computers process and analyze textual data for tasks ranging from machine translation to question answering,” said Kelechi Ogueji, a master’s student in computer science at Waterloo. “Sadly, African languages have received little attention from the research community.”

“One of the challenges is that neural networks are bewilderingly text- and computer-intensive to build. And unlike English, which has enormous quantities of available text, most of the 7,000 or so languages spoken worldwide can be characterized as low-resource, in that there is a lack of data available to feed data-hungry neural networks.”

Most of these models work using a technique known as pretraining. To accomplish this, the researcher presents the model with text where some of the words have been covered up or masked. The model then has to guess the masked words. By repeating this process, many billions of times, the model learns the statistical associations between words.

“Being able to pretrain models that are just as accurate for certain downstream tasks, but using vastly smaller amounts of data has many advantages,” said Jimmy Lin, the Cheriton Chair in Computer Science and Ogueji’s advisor. “Needing less data to train the language model means that less computation is required and consequently lower carbon emissions associated with operating massive data centres. Smaller datasets also make data curation more practical, which is one approach to reduce the biases present in the models.”

“This work takes a small but important step to bringing natural language processing capabilities to more than 1.3 billion people on the African continent.”

Assisting Ogueji and Lin in this research is Yuxin Zhu, who recently completed an undergraduate degree in computer science at Waterloo. Together, they presented their research paper, Small data? No problem! Exploring the viability of pretrained multilingual language models for low-resource languages, at the Multilingual Representation Learning Workshop at the 2021 Conference on Empirical Methods in Natural Language Processing.



tags: ,


University of Waterloo




            AIhub is supported by:


Related posts :



Introducing the NASA Onboard Artificial Intelligence Research (OnAIR) platform: an interview with Evana Gizzi

  03 Jul 2025
Find out about the OnAIR platform, some of the particular challenges of deploying AI-based solutions in space, and how the tool has been used so far.

An interview with Nicolai Ommer: the RoboCupSoccer Small Size League

  01 Jul 2025
We caught up with Nicolai to find out more about the Small Size League, how the auto referees work, and how teams use AI.

Forthcoming machine learning and AI seminars: July 2025 edition

  30 Jun 2025
A list of free-to-attend AI-related seminars that are scheduled to take place between 1 July and 31 August 2025.
monthly digest

AIhub monthly digest: June 2025 – gearing up for RoboCup 2025, privacy-preserving models, and mitigating biases in LLMs

  26 Jun 2025
Welcome to our monthly digest, where you can catch up with AI research, events and news from the month past.

RoboCupRescue: an interview with Adam Jacoff

  25 Jun 2025
Find out what's new in the RoboCupRescue League this year.

Making optimal decisions without having all the cards in hand

Read about research which won an outstanding paper award at AAAI 2025.

Exploring counterfactuals in continuous-action reinforcement learning

  20 Jun 2025
Shuyang Dong writes about her work that will be presented at IJCAI 2025.

What is vibe coding? A computer scientist explains what it means to have AI write computer code − and what risks that can entail

  19 Jun 2025
Until recently, most computer code was written, at least originally, by human beings. But with the advent of GenAI, that has begun to change.



 

AIhub is supported by:






©2025.05 - Association for the Understanding of Artificial Intelligence


 












©2025.05 - Association for the Understanding of Artificial Intelligence