ΑΙhub.org
 

Interview with Henok Biadglign Ademtew: Creating an Amharic, Ge’ez and English parallel dataset

by
04 June 2024



share this:

African languages are not well-represented in natural language processing (NLP). This is in large part due to a lack of resources for training models. Henok Biadglign Ademtew and Mikiyas Girma Birbo have created an Amharic, Ge’ez, and English parallel dataset to help advance research into low-resource languages. We spoke to Henok about this project, the creation of the dataset, and some of the challenges faced.

Could you tell us a bit about Ge’ez and give us some background into why it was important to create this dataset?

Most of the languages in Africa are very low-resourced, and not much text data is available. The language of Ge’ez in particular is low-resource in terms of digital availability, but it is one of the most well-prepared in terms of text data. The language is very strongly linked to the church and there are a lot of Ge’ez texts in church here in Ethiopia. Every document that you find is related in one way or another to the church (Ethiopian Orthodox Tewahedo Church). You may well have seen it in some countries in Europe as well. There are BS/MS-level Ge’ez courses in universities like the University of Hamburg under the Ethiopian and Eritrean Studies.

There were two main motivating factors for creating the dataset. Personally, I want to learn Ge’ez but digital resources don’t exist beyond simple dictionaries. If somebody wanted to make an app (like Duolingo) for Ge’ez they wouldn’t find resources to be able to create it. We wanted to bridge that gap. Secondly, we wanted to create a dataset that was outside of the church context. Why not translate news from BBC or CNN to Ge’ez?

Something else we noticed was that, when we came across research papers by different groups in Ethiopia, most of them don’t share their datasets. We wanted to provide an open-source dataset that other researchers could use. Our dataset can be used as a foundation that people can build on, explore, and expand.

How did you go about creating the dataset?

The first part involved collecting all the online resources that we could get. We extracted text in Ge’ez, Amharic, and English from different websites to create a parallel sequence. As I mentioned, this text was mostly related to the church and the bible. We didn’t want the dataset to be Ge’ez-English or Ge’ez-Amharic, we wanted to create the three together where it is Ge’ez-centric, but there are also English and Amharic translations.

The second part was to hire three translators who could focus on the news part – i.e. creating sentences that could be used to translate news articles into Ge’ez. Working with the translators we created a new dataset with 1000 fresh sentences. We also collected around 17,000-18,000 language pairs from the bible.

What were the challenges that you faced when working on the dataset?

One of the challenges is that the language is mainly customised for the church. So, for example, when you want to use swear words, you don’t find them in Ge’ez text!

The second challenge is that when there are international words, like “economics”, in Amharic we say “economics” so it’s fine, but, in Ge’ez, you can’t directly use that. The language is very flexible so you can create “economics” in Ge’ez by combining two words. However, doing that requires a higher level of Ge’ez expertise than we had. We approached this by using the international word. We’re happy for a linguistics expert, or someone with a high level of Ge’ez, to build on our approach.

The third challenge is related to pronunciation and written text. In English, for example, there are instances where there are two similar written words that have the same pronunciation. In Ge’ez, there are words that are the same in terms of pronunciation but that are written differently. These two words will have completely different meanings. For example, ሰዓሊ and ሰአሊ have the same sound but different meanings one meaning “draw a picture” and the other “beg for us”. You need to be an expert in the Ge’ez writing to recognise that in the written language. One challenge is that those people who are experts in Ge’ez writing are not very good when it comes to working digitally.

Another challenge was that the translators were not as fast as we expected them to be, and we had a paper deadline approaching!

Your paper has two parts, the dataset and a model. Could you talk a bit about the model?

When making models you can either make them from scratch, using your datasets, or you can use a pre-trained model and fine-tune that model for your needs. For Ge’ez, making a model from scratch would not move us forward, so we decided to fine-tune an existing model – NLLB (from Facebook research group). We got pretty interesting results. This model in its full form is very big, around 54 billion parameters. Running that requires huge computing power, a minimum of four 32GB GPUs just for inference – I’m not sure we have that in our country, let alone available to us. There is a distilled version, which has the same performance, but it has 600 million parameters. We used that version and fine-tuned it. We got a bleu score of 12.29 and 30.66 for English to Ge’ez and Ge’ez to English, respectively, and 9.39 and 12.29 for Amharic-Ge’ez and Ge’ez-Amharic respectively.

What are your future plans for this work?

We are planning to expand the dataset a bit more and also to address the challenge of translating words that are common across the world but do not exist in Ge’ez. Our primary goal is to expand Ge’ez in terms of other areas besides religious use cases. This will be a case of creating new sentences, taking existing Amharic and English news data or educational content, and translating. We also hope to digitalize different Ge’ez books and prayers from the Ethiopian Orthodox Tewahedo Church. To get everything ready for our translators we’ve created a platform where they can enter their translations and a reviewer that can check that these translations are correct.

About Henok

Henok Biadglign Ademtew is a Machine Learning Engineer based in Ethiopia. He works at the Ethiopian AI Institute. He is a member of different research communities like EthioNLP, Cohere For AI, Deep Learning Indaba, and others.

Read the work in full

AGE: Amharic, Ge’ez and English Parallel Dataset, Henok Biadglign Ademtew, Mikiyas Girma Birbo.


The AI Around the World series is supported through a donation from the Mohamed bin Zayed University of Artificial Intelligence (MBZUAI). AIhub retains editorial freedom in selecting and preparing the content.



tags:


Lucy Smith , Managing Editor for AIhub.
Lucy Smith , Managing Editor for AIhub.




            AIhub is supported by:


Related posts :



CLAIRE AQuA: AI for citizens

Watch the recording of the latest CLAIRE All Questions Answered session.
06 September 2024, by

Developing a system for real-time sensing of flooded roads

Research fuses multiple data sources with AI model for enhanced sensing of road conditions.
05 September 2024, by

Forthcoming machine learning and AI seminars: September 2024 edition

A list of free-to-attend AI-related seminars that are scheduled to take place between 2 September and 31 October 2024.
02 September 2024, by

Causal inference under incentives: an annotated reading list

This annotated reading list is intended to serve as a brief summary of work on causal inference in the presence of strategic agents.
30 August 2024, by

AIhub monthly digest: August 2024 – IJCAI, neural operators, and sequential decision making

Welcome to our monthly digest, where you can catch up with AI research, events and news from the month past.
29 August 2024, by

Air pollution in South Africa: affordable new devices use AI to monitor hotspots in real time

Creating a cost-effective air quality monitoring system based on sensors, Internet of Things and AI.
28 August 2024, by




AIhub is supported by:






©2024 - Association for the Understanding of Artificial Intelligence


 












©2021 - ROBOTS Association