Lanfrica is an online resource centre that catalogues, archives and links African language resources. These resources include research papers, datasets, projects, software and models that have to do with one or more African languages.
The team behind Lanfrica is Chris Emezue, Handel Emezue, and Bonaventure Dossou, with contribution from Daria Yasafova. We caught up with them to find out more about the project, what inspired them to begin, and the potential that Lanfrica offers the AI community and beyond.
Chris: The inspiration came while Bona and I were working as undergraduate students. We joined the Africa NLP [natural language processing] community, and we discovered that most NLP research focuses on English and European high resource languages. There isn’t as much research on African languages, and that was one inspiration. The second is that people are making efforts to work on African languages but they’re not discoverable and it’s very hard to find this research. For example, if you want to work on a language like Fon, it’s hard to find what research or datasets already exist for this language. Just typing into Google will not give you much because a lot of this work is buried somewhere in GitHub and you will never find it easily. We thought that if we could create a network that connects these African languages, and all the research and datasets, it would be easier for researchers and other people to find out what’s happening for these languages. It spurs progress in African languages.
Chris: The idea was born in 2020, and we teamed up with Mr Handel in October 2021. He helped lay the groundwork for the backend and frontend, Daria Yasafova helped in integrating some open-source datasets to Lanfrica, while Bona and I focused on trying to put together this idea of linking. What do you mean when you say you are linking, and how does it work? How do you make it better?
Chris: So, there are organisations that are dedicated to sharing works in African languages. The first thing we did was to look at the repositories of these organisations. First, we needed a way to get the metadata of these resources. The key fact is: we’re not hosting, we are linking. So, the actual data, the actual resources, remain on the original site. We are just getting the metadata. We then wrote some code that can crawl and identify, to some level of accuracy, the African language that the resource is in. The challenging part of this is that there are more than 2000 African languages and we had an ambitious goal to actually host all 2000 plus of them – currently we have 2189 African languages, including all the different variations of their names. This is very challenging. Using a simple text-based search you can get a lot of false positives, so we had to write more sophisticated algorithms that can actually identify the correct African language. Once we’ve identified the language, we take the metadata and put it on Lanfrica and connect the link to the original repository where the resource is hosted.
The second thing we needed to consider is this idea of connecting. Imagine you have a social network of things connected to each other, so basically in this case it’s resources connected to the African languages. Now African languages are interesting because, while they are 2000 plus, they also share some properties and connections. We’re not there yet, but one interesting part of the project was to design a good foundation for different types of connections such that, in the future when new connections come up, we can easily add them. For example, just yesterday we talked with someone who wanted to share some language assessment resources. This is completely different from research papers and datasets but this company is dedicated to providing language assessment tests for some African languages. We now have a new type of record (language assessment record) that we didn’t plan for before. By designing a good foundation for these connections it’s easy to add this record, even though it wasn’t in the initial scope. These were the two main things that we had to work on.
Handel: To add a bit to what Chris said. Usually, as researchers, we intend to solve a problem. The problem we saw was that there was limited access to African datasets. What we’re trying to do is related to what the internet does. The Internet links different computers, different systems. Through the internet you can access a wide range of information. So we are trying to have a kind of internet system for African languages. The way that we can achieve that, since we have to link all these datasets, is to partner with people who provide these datasets – GitHub, companies, individuals. We aim to form a wide range of partnerships. Through partnerships we can discover people, datasets, resources and link them. The next stage is having some kind of search engine. We feel that if we have a search engine, a kind of internet for African languages, it would rapidly promote research in AfricaNLP. We think it could be a breakthrough if we’re able to get sponsorship, and get more people to know what we do.
Handel: It’s been great, very supportive.
Chris: It’s been surreal actually for me, seeing the NLP community being very welcoming of it, and it really showed that this is something that was really needed. When we first launched the website and then posted it on Twitter we had lots of people liking, following and retweeting. We also have a slack community, and we had people joining from different parts of the world.
One feature of Lanfrica is that people can add a resource. Some people actually tried it out and then reached out to me. We got some feedback from researchers, and one of them said that the way to add the resource was really simple. That’s one thing we purposefully did – make it very easy to link a resource. We also got a lot of feedback from the Masakane community. I reached out personally to some specific organisations and some replied back really praising the initiative, and trying to map out ways that we could collaborate.
Chris: Personally, it has helped me. Now, when I need to do some research about a language and I’m looking for, let’s say, text-to-speech data on Kiswahili, I know I can go to Lanfrica to get an overview of the available resources. That’s how it has helped me personally.
Handel: And for me, I didn’t know there were that many languages in Africa, even though I’m in it (Africa). It’s made me look at the possibilities for doing research, and as a way to potentially grow a business. So, if I had a business in the US, and I could extend it to Africa, by being able to deal with different African languages, that would be really beneficial. So, research-wise, company-wise, I think the datasets for languages from this system, which we ourselves have made, could have a big impact. It has been very exciting.
Bonaventure: My experience is a mix of both of the above. I learned about some languages that I didn’t know existed. For example, Chris and I were working on machine translation, and I discovered languages that were completely new to me. Also, it often happens that people will ask “do you have a dataset for this or this?, and I can say “just go and check on Lanfrica”. Recently, some people were looking for a dataset for speech and someone put them in contact with us, and we were able to link them to the right platform and dataset. So, as well as helping me to learn about new languages, it’s a time saver, as you don’t have to go through the whole internet – there’s just one place you need to try.
Chris: For the first stage, we focused on datasets. The next stage is to focus on papers. Then we need to connect these things. So, a paper that uses or introduces a dataset – how do we connect those? So, I’m trying to work on the connections involved, and generally making the whole connection algorithm stronger. For example, the connections within the papers and datasets, and deduplication. Because anyone can add datasets, we need to work on the deduplication. Also working on tests to verify that the links they’ve added actually work. These are things that I am working on on the database side.
Handel: When we released version 1.0, we received a lot of comments and reviews. There are a lot of bugs that we need to fix. So we have bug fixing. Then we also have some enhancements.
We have a backend system that follows links. But, we want to give users the opportunity to add datasets. We plan to have something where a person who wants to add a dataset would sign up with an account, upload an entry, and be able to take ownership of their entry. This is a feature we want to release. We also want to make it very easy for people to look for a resource. We want to make it like a search engine. So we need to optimise to make it fast. We plan to move to a cloud server for speed. We’re looking at expanding the database to handle the multiple connections.
There are two things we need. We need the public to know about our domain. We also need funding and collaboration. Up to now all the funding for this has come out of our own pockets, because we are driven and this project means so much to us that we wanted to put up the finances for it ourselves. As we expand and release version 2.0 we’ll need people to come and help. We’re looking at version 3.0, version 4.0, version 5.0, version 6.0, etc. We’ll keep pushing, as much as we can. We need people to know about us.
Bonaventure: We need to focus on finding papers, or compilations of papers, and extracting them. For example, from the ACL anthology, so we can add those records to Lanfrica.
As we had to pay for everything from our own pockets it would have been helpful if there had been some grants available so we didn’t have to worry about the financial side. We wouldn’t have needed to worry about how expensive or crazy it was going to be. Funding will be a very important part in putting Lanfrica on a bigger scale that we are all aspiring to.
Chris: We’d definitely like to issue a call to the community to help spread the word. Also, if you have a resource – a paper, a dataset, a model, software – that focuses on one or more African languages, please consider linking it to Lanfrica. Lanfrica actually uses a community approach to linking African language resources. That’s why we’re trying to work on the user interface so that users can sign up, log in, and add records easily.
Chris Emezue is the Founder of Lanfrica. He is a Masters student at the Technical University of Munich, studying Mathematics in Data Science. He is dedicated to (and has worked extensively on) natural language processing for African languages (for example MMTAfrica, OkwuGbe). He has worked as a natural language processing (NLP) researcher at Siemens AI Lab, LMU, Mila-Quebec AI Institute and HuggingFace. |
Handel Emezue is a co-founder of Lanfrica. He is a Software Engineer with over eight years of professional software development and a researcher at the Department of Electrical/Electronics Engineering, Alex Ekwueme Federal University, Ndufu-Alike, Nigeria. His research interests include embedded systems, Natural Language Processing (NLP), machine learning, data science and Internet of Things – Smart Devices. |
Bonaventure Dossou is a co-founder of Lanfrica. He is a 2nd-year Masters Student at Jacobs University Bremen, studying Data Engineering. He has worked (and keeps working) on NLP technologies for African languages (e.g. FFRTranslate, MMTAfrica, Okwugbe). He is a NLP Researcher at Mila Quebec AI Institute, Roche Canada, and Google AI. |