ΑΙhub.org
 

Interview with Henok Biadglign Ademtew: Creating an Amharic, Ge’ez and English parallel dataset

by
04 June 2024



share this:

African languages are not well-represented in natural language processing (NLP). This is in large part due to a lack of resources for training models. Henok Biadglign Ademtew and Mikiyas Girma Birbo have created an Amharic, Ge’ez, and English parallel dataset to help advance research into low-resource languages. We spoke to Henok about this project, the creation of the dataset, and some of the challenges faced.

Could you tell us a bit about Ge’ez and give us some background into why it was important to create this dataset?

Most of the languages in Africa are very low-resourced, and not much text data is available. The language of Ge’ez in particular is low-resource in terms of digital availability, but it is one of the most well-prepared in terms of text data. The language is very strongly linked to the church and there are a lot of Ge’ez texts in church here in Ethiopia. Every document that you find is related in one way or another to the church (Ethiopian Orthodox Tewahedo Church). You may well have seen it in some countries in Europe as well. There are BS/MS-level Ge’ez courses in universities like the University of Hamburg under the Ethiopian and Eritrean Studies.

There were two main motivating factors for creating the dataset. Personally, I want to learn Ge’ez but digital resources don’t exist beyond simple dictionaries. If somebody wanted to make an app (like Duolingo) for Ge’ez they wouldn’t find resources to be able to create it. We wanted to bridge that gap. Secondly, we wanted to create a dataset that was outside of the church context. Why not translate news from BBC or CNN to Ge’ez?

Something else we noticed was that, when we came across research papers by different groups in Ethiopia, most of them don’t share their datasets. We wanted to provide an open-source dataset that other researchers could use. Our dataset can be used as a foundation that people can build on, explore, and expand.

How did you go about creating the dataset?

The first part involved collecting all the online resources that we could get. We extracted text in Ge’ez, Amharic, and English from different websites to create a parallel sequence. As I mentioned, this text was mostly related to the church and the bible. We didn’t want the dataset to be Ge’ez-English or Ge’ez-Amharic, we wanted to create the three together where it is Ge’ez-centric, but there are also English and Amharic translations.

The second part was to hire three translators who could focus on the news part – i.e. creating sentences that could be used to translate news articles into Ge’ez. Working with the translators we created a new dataset with 1000 fresh sentences. We also collected around 17,000-18,000 language pairs from the bible.

What were the challenges that you faced when working on the dataset?

One of the challenges is that the language is mainly customised for the church. So, for example, when you want to use swear words, you don’t find them in Ge’ez text!

The second challenge is that when there are international words, like “economics”, in Amharic we say “economics” so it’s fine, but, in Ge’ez, you can’t directly use that. The language is very flexible so you can create “economics” in Ge’ez by combining two words. However, doing that requires a higher level of Ge’ez expertise than we had. We approached this by using the international word. We’re happy for a linguistics expert, or someone with a high level of Ge’ez, to build on our approach.

The third challenge is related to pronunciation and written text. In English, for example, there are instances where there are two similar written words that have the same pronunciation. In Ge’ez, there are words that are the same in terms of pronunciation but that are written differently. These two words will have completely different meanings. For example, ሰዓሊ and ሰአሊ have the same sound but different meanings one meaning “draw a picture” and the other “beg for us”. You need to be an expert in the Ge’ez writing to recognise that in the written language. One challenge is that those people who are experts in Ge’ez writing are not very good when it comes to working digitally.

Another challenge was that the translators were not as fast as we expected them to be, and we had a paper deadline approaching!

Your paper has two parts, the dataset and a model. Could you talk a bit about the model?

When making models you can either make them from scratch, using your datasets, or you can use a pre-trained model and fine-tune that model for your needs. For Ge’ez, making a model from scratch would not move us forward, so we decided to fine-tune an existing model – NLLB (from Facebook research group). We got pretty interesting results. This model in its full form is very big, around 54 billion parameters. Running that requires huge computing power, a minimum of four 32GB GPUs just for inference – I’m not sure we have that in our country, let alone available to us. There is a distilled version, which has the same performance, but it has 600 million parameters. We used that version and fine-tuned it. We got a bleu score of 12.29 and 30.66 for English to Ge’ez and Ge’ez to English, respectively, and 9.39 and 12.29 for Amharic-Ge’ez and Ge’ez-Amharic respectively.

What are your future plans for this work?

We are planning to expand the dataset a bit more and also to address the challenge of translating words that are common across the world but do not exist in Ge’ez. Our primary goal is to expand Ge’ez in terms of other areas besides religious use cases. This will be a case of creating new sentences, taking existing Amharic and English news data or educational content, and translating. We also hope to digitalize different Ge’ez books and prayers from the Ethiopian Orthodox Tewahedo Church. To get everything ready for our translators we’ve created a platform where they can enter their translations and a reviewer that can check that these translations are correct.

About Henok

Henok Biadglign Ademtew is a Machine Learning Engineer based in Ethiopia. He works at the Ethiopian AI Institute. He is a member of different research communities like EthioNLP, Cohere For AI, Deep Learning Indaba, and others.

Read the work in full

AGE: Amharic, Ge’ez and English Parallel Dataset, Henok Biadglign Ademtew, Mikiyas Girma Birbo.


The AI Around the World series is supported through a donation from the Mohamed bin Zayed University of Artificial Intelligence (MBZUAI). AIhub retains editorial freedom in selecting and preparing the content.



tags:


Lucy Smith , Managing Editor for AIhub.
Lucy Smith , Managing Editor for AIhub.




            AIhub is supported by:


Related posts :



Training AI requires more data than we have — generating synthetic data could help solve this challenge

The rapid rise of generative AI has brought advancements, but it also presents significant risks.
26 July 2024, by

Congratulations to the #ICML2024 award winners

Find out who won the Test of Time award, and the Best Paper award at ICML this year.
25 July 2024, by

#ICML2024 – tweet round-up from the first few days

We take a look at what participants have been getting up to at the International Conference on Machine Learning.
24 July 2024, by

International collaboration lays the foundation for future AI for materials

Presenting an extended version of the Open databases integration for materials design (OPTIMADE) standard.
23 July 2024, by

#RoboCup2024 – daily digest: 21 July

In the last of our digests, we report on the closing day of competitions in Eindhoven.
21 July 2024, by




AIhub is supported by:






©2024 - Association for the Understanding of Artificial Intelligence


 












©2021 - ROBOTS Association