African languages are not well-represented in natural language processing (NLP). This is in large part due to a lack of resources for training models. Henok Biadglign Ademtew and Mikiyas Girma Birbo have created an Amharic, Ge’ez, and English parallel dataset to help advance research into low-resource languages. We spoke to Henok about this project, the creation of the dataset, and some of the challenges faced.
Most of the languages in Africa are very low-resourced, and not much text data is available. The language of Ge’ez in particular is low-resource in terms of digital availability, but it is one of the most well-prepared in terms of text data. The language is very strongly linked to the church and there are a lot of Ge’ez texts in church here in Ethiopia. Every document that you find is related in one way or another to the church (Ethiopian Orthodox Tewahedo Church). You may well have seen it in some countries in Europe as well. There are BS/MS-level Ge’ez courses in universities like the University of Hamburg under the Ethiopian and Eritrean Studies.
There were two main motivating factors for creating the dataset. Personally, I want to learn Ge’ez but digital resources don’t exist beyond simple dictionaries. If somebody wanted to make an app (like Duolingo) for Ge’ez they wouldn’t find resources to be able to create it. We wanted to bridge that gap. Secondly, we wanted to create a dataset that was outside of the church context. Why not translate news from BBC or CNN to Ge’ez?
Something else we noticed was that, when we came across research papers by different groups in Ethiopia, most of them don’t share their datasets. We wanted to provide an open-source dataset that other researchers could use. Our dataset can be used as a foundation that people can build on, explore, and expand.
The first part involved collecting all the online resources that we could get. We extracted text in Ge’ez, Amharic, and English from different websites to create a parallel sequence. As I mentioned, this text was mostly related to the church and the bible. We didn’t want the dataset to be Ge’ez-English or Ge’ez-Amharic, we wanted to create the three together where it is Ge’ez-centric, but there are also English and Amharic translations.
The second part was to hire three translators who could focus on the news part – i.e. creating sentences that could be used to translate news articles into Ge’ez. Working with the translators we created a new dataset with 1000 fresh sentences. We also collected around 17,000-18,000 language pairs from the bible.
One of the challenges is that the language is mainly customised for the church. So, for example, when you want to use swear words, you don’t find them in Ge’ez text!
The second challenge is that when there are international words, like “economics”, in Amharic we say “economics” so it’s fine, but, in Ge’ez, you can’t directly use that. The language is very flexible so you can create “economics” in Ge’ez by combining two words. However, doing that requires a higher level of Ge’ez expertise than we had. We approached this by using the international word. We’re happy for a linguistics expert, or someone with a high level of Ge’ez, to build on our approach.
The third challenge is related to pronunciation and written text. In English, for example, there are instances where there are two similar written words that have the same pronunciation. In Ge’ez, there are words that are the same in terms of pronunciation but that are written differently. These two words will have completely different meanings. For example, ሰዓሊ and ሰአሊ have the same sound but different meanings one meaning “draw a picture” and the other “beg for us”. You need to be an expert in the Ge’ez writing to recognise that in the written language. One challenge is that those people who are experts in Ge’ez writing are not very good when it comes to working digitally.
Another challenge was that the translators were not as fast as we expected them to be, and we had a paper deadline approaching!
When making models you can either make them from scratch, using your datasets, or you can use a pre-trained model and fine-tune that model for your needs. For Ge’ez, making a model from scratch would not move us forward, so we decided to fine-tune an existing model – NLLB (from Facebook research group). We got pretty interesting results. This model in its full form is very big, around 54 billion parameters. Running that requires huge computing power, a minimum of four 32GB GPUs just for inference – I’m not sure we have that in our country, let alone available to us. There is a distilled version, which has the same performance, but it has 600 million parameters. We used that version and fine-tuned it. We got a bleu score of 12.29 and 30.66 for English to Ge’ez and Ge’ez to English, respectively, and 9.39 and 12.29 for Amharic-Ge’ez and Ge’ez-Amharic respectively.
We are planning to expand the dataset a bit more and also to address the challenge of translating words that are common across the world but do not exist in Ge’ez. Our primary goal is to expand Ge’ez in terms of other areas besides religious use cases. This will be a case of creating new sentences, taking existing Amharic and English news data or educational content, and translating. We also hope to digitalize different Ge’ez books and prayers from the Ethiopian Orthodox Tewahedo Church. To get everything ready for our translators we’ve created a platform where they can enter their translations and a reviewer that can check that these translations are correct.
Henok Biadglign Ademtew is a Machine Learning Engineer based in Ethiopia. He works at the Ethiopian AI Institute. He is a member of different research communities like EthioNLP, Cohere For AI, Deep Learning Indaba, and others. |
AGE: Amharic, Ge’ez and English Parallel Dataset, Henok Biadglign Ademtew, Mikiyas Girma Birbo.
The AI Around the World series is supported through a donation from the Mohamed bin Zayed University of Artificial Intelligence (MBZUAI). AIhub retains editorial freedom in selecting and preparing the content. |