ΑΙhub.org
 

Interview with Henok Biadglign Ademtew: Creating an Amharic, Ge’ez and English parallel dataset


by
04 June 2024



share this:

African languages are not well-represented in natural language processing (NLP). This is in large part due to a lack of resources for training models. Henok Biadglign Ademtew and Mikiyas Girma Birbo have created an Amharic, Ge’ez, and English parallel dataset to help advance research into low-resource languages. We spoke to Henok about this project, the creation of the dataset, and some of the challenges faced.

Could you tell us a bit about Ge’ez and give us some background into why it was important to create this dataset?

Most of the languages in Africa are very low-resourced, and not much text data is available. The language of Ge’ez in particular is low-resource in terms of digital availability, but it is one of the most well-prepared in terms of text data. The language is very strongly linked to the church and there are a lot of Ge’ez texts in church here in Ethiopia. Every document that you find is related in one way or another to the church (Ethiopian Orthodox Tewahedo Church). You may well have seen it in some countries in Europe as well. There are BS/MS-level Ge’ez courses in universities like the University of Hamburg under the Ethiopian and Eritrean Studies.

There were two main motivating factors for creating the dataset. Personally, I want to learn Ge’ez but digital resources don’t exist beyond simple dictionaries. If somebody wanted to make an app (like Duolingo) for Ge’ez they wouldn’t find resources to be able to create it. We wanted to bridge that gap. Secondly, we wanted to create a dataset that was outside of the church context. Why not translate news from BBC or CNN to Ge’ez?

Something else we noticed was that, when we came across research papers by different groups in Ethiopia, most of them don’t share their datasets. We wanted to provide an open-source dataset that other researchers could use. Our dataset can be used as a foundation that people can build on, explore, and expand.

How did you go about creating the dataset?

The first part involved collecting all the online resources that we could get. We extracted text in Ge’ez, Amharic, and English from different websites to create a parallel sequence. As I mentioned, this text was mostly related to the church and the bible. We didn’t want the dataset to be Ge’ez-English or Ge’ez-Amharic, we wanted to create the three together where it is Ge’ez-centric, but there are also English and Amharic translations.

The second part was to hire three translators who could focus on the news part – i.e. creating sentences that could be used to translate news articles into Ge’ez. Working with the translators we created a new dataset with 1000 fresh sentences. We also collected around 17,000-18,000 language pairs from the bible.

What were the challenges that you faced when working on the dataset?

One of the challenges is that the language is mainly customised for the church. So, for example, when you want to use swear words, you don’t find them in Ge’ez text!

The second challenge is that when there are international words, like “economics”, in Amharic we say “economics” so it’s fine, but, in Ge’ez, you can’t directly use that. The language is very flexible so you can create “economics” in Ge’ez by combining two words. However, doing that requires a higher level of Ge’ez expertise than we had. We approached this by using the international word. We’re happy for a linguistics expert, or someone with a high level of Ge’ez, to build on our approach.

The third challenge is related to pronunciation and written text. In English, for example, there are instances where there are two similar written words that have the same pronunciation. In Ge’ez, there are words that are the same in terms of pronunciation but that are written differently. These two words will have completely different meanings. For example, ሰዓሊ and ሰአሊ have the same sound but different meanings one meaning “draw a picture” and the other “beg for us”. You need to be an expert in the Ge’ez writing to recognise that in the written language. One challenge is that those people who are experts in Ge’ez writing are not very good when it comes to working digitally.

Another challenge was that the translators were not as fast as we expected them to be, and we had a paper deadline approaching!

Your paper has two parts, the dataset and a model. Could you talk a bit about the model?

When making models you can either make them from scratch, using your datasets, or you can use a pre-trained model and fine-tune that model for your needs. For Ge’ez, making a model from scratch would not move us forward, so we decided to fine-tune an existing model – NLLB (from Facebook research group). We got pretty interesting results. This model in its full form is very big, around 54 billion parameters. Running that requires huge computing power, a minimum of four 32GB GPUs just for inference – I’m not sure we have that in our country, let alone available to us. There is a distilled version, which has the same performance, but it has 600 million parameters. We used that version and fine-tuned it. We got a bleu score of 12.29 and 30.66 for English to Ge’ez and Ge’ez to English, respectively, and 9.39 and 12.29 for Amharic-Ge’ez and Ge’ez-Amharic respectively.

What are your future plans for this work?

We are planning to expand the dataset a bit more and also to address the challenge of translating words that are common across the world but do not exist in Ge’ez. Our primary goal is to expand Ge’ez in terms of other areas besides religious use cases. This will be a case of creating new sentences, taking existing Amharic and English news data or educational content, and translating. We also hope to digitalize different Ge’ez books and prayers from the Ethiopian Orthodox Tewahedo Church. To get everything ready for our translators we’ve created a platform where they can enter their translations and a reviewer that can check that these translations are correct.

About Henok

Henok Biadglign Ademtew is a Machine Learning Engineer based in Ethiopia. He works at the Ethiopian AI Institute. He is a member of different research communities like EthioNLP, Cohere For AI, Deep Learning Indaba, and others.

Read the work in full

AGE: Amharic, Ge’ez and English Parallel Dataset, Henok Biadglign Ademtew, Mikiyas Girma Birbo.


The AI Around the World series is supported through a donation from the Mohamed bin Zayed University of Artificial Intelligence (MBZUAI). AIhub retains editorial freedom in selecting and preparing the content.



tags:


Lucy Smith is Senior Managing Editor for AIhub.
Lucy Smith is Senior Managing Editor for AIhub.

            AIhub is supported by:



Subscribe to AIhub newsletter on substack



Related posts :

A history of RoboCup with Manuela Veloso

  24 Mar 2026
Find out how RoboCup got started and how the competition has evolved, from one of the co-founders.

Information-driven design of imaging systems

  23 Mar 2026
Framework that enables direct evaluation and optimization of imaging systems based on their information content.

Machine learning framework to predict global imperilment status of freshwater fish

  20 Mar 2026
“With our model, decision makers can deploy resources in advance before a species becomes imperiled.”

Interview with AAAI Fellow Yan Liu: machine learning for time series

  19 Mar 2026
Hear from 2026 AAAI Fellow Yan Liu about her research into time series, the associated applications, and the promise of physics-informed models.

A principled approach for data bias mitigation

  18 Mar 2026
Find out more about work presented at AIES 2025 which proposes a new way to measure data bias, along with a mitigation algorithm with mathematical guarantees.

An AI image generator for non-English speakers

  17 Mar 2026
"Translations lose the nuances of language and culture, because many words lack good English equivalents."

AI and Theory of Mind: an interview with Nitay Alon

  16 Mar 2026
Find out more about how Theory of Mind plays out in deceptive environments, multi-agents systems, the interdisciplinary nature of this field, when to use Theory of Mind, and when not to, and more.
coffee corner

AIhub coffee corner: AI, kids, and the future – “generation AI”

  13 Mar 2026
The AIhub coffee corner captures the musings of AI experts over a short conversation.



AIhub is supported by:







Subscribe to AIhub newsletter on substack




 















©2026.02 - Association for the Understanding of Artificial Intelligence