New voices in AI: NLP for low resource languages with David Adelani

Masakhane

by Joe Daly

26 January 2022

share this:

Welcome to the first episode of New voices in AI!

In this episode Joe Daly interviews David Adelani about his work on natural language processing with low resource languages, his work with Masakhane, and his journey to working in AI.

You can find David on Twitter @davlanade and find out more about Masakhane here.

The music used is ‘Wholesome’ by Kevin MacLeod, Licensed under Creative Commons

transcript

Daly: Hello and welcome to new voices in AI, this a new series from AIhub where we celebrate the voices PhD students, early career researchers, and those with a new perspective on AI. I’m Joe Daly, engagement manager for AIhub. And without further ado, let’s begin.
First up, a big welcome to our very first guest on “New voices in AI” and if you could introduce yourself, who are you? Where are you from?

Adelani: Thank you very much for having me. I’m David Adelani or David Ifeoluwu Adelani. I’m originally from Nigeria. And I’m a PhD student at the Department of Language Science and Technology in German university, Saarbrücken in Germany. So that’s where I’m from.
I’m also an active member of Maskhane. So, Masakhane is this grassroots organization, whose mission is to strengthen and spur NLP research in African languages, by Africans for Africans, so, and currently the organization we are majorly operating on Slack we already have over 1000 Members. Of course, not everyone is active but we have more than 100 or close to 100 active members as well, yeah.

Daly: That’s great. So how did, how did you get into AI?

Adelani: Oh yeah, it’s a long story, because initially I was interested in image processing when I was coming to the grad school. But after taking the NLP course at my university, natural language process and I also took in parallel a course in social media analysis. Then I wanted to do like NLP for computational social science.
But while working on that, I think that was in 2018, I got attracted to one peculiar problem in Yoruba language, which is about their diacritics.
So Yoruba is a tonal language and also the tones are expressed or are marked on top of the words, so they have diacritics. And that got me interested into how we can automate the process of automatically adding diacritics to the text. And why is this important? Because if you do like speech recognition for this language, the tones if they are not captured properly, then it doesn’t work properly, and also for machine translation when you want to translate into these languages, there are a lot of ambiguities. So this is what really got me interested and I was like OK, this looks like a very simple problem, you have a sentence without your diacritics, another sentence with diacritics, this is like a machine translation problem.
So we could use a sequence to sequence model. Although someone was faster than me (haha) and you already published this idea at Interspeech and we later collaborated together to make it better and we submitted this to AfricaNLP workshop in 2020. So this was really what got me interested in it and I forgot to mention Yoruba language, which is also spoken in the South Western part of Nigeria and is native to the Benin Republic which is also a border country with Nigeria and of course, but it’s also widely spoken of course in West Africa and also the diaspora. We have lots of people in the diaspora, especially in the US and UK, that probably this language. Yeah, it was a long story (haha).

Daly: (haha) Everyone, everyone has their own way to get into AI, so it’s never a straightforward story, is it? So what about beyond sort of AI? Do you have any interesting non AI related facts about you?

Adelani: Yeah definitely yeah. I love watching animal documentaries. I think one of my favorites is beavers, that animal that constructs dams. Ah, I think I’m from Africa, and most of the animals that we have there, like lions or crocodiles, are just powerful animals. But when I came across this documentary on beavers, it was quite interesting, we don’t have it there, but this impressive animal, the way they construct dams and their intelligence really got me thinking, yeah, this is a very interesting animal. Well, I mean probably like this documentary.

Daly: Yeah, and so a lot of engineering skill goes into a beaver’s dam.

Adelani: Yeah, exactly. Right?

Daly: So if you could make an AI to do anything at all, what would you make and why?

Adelani: Uhm, if I can make AI to do anything, I would say it should try to understand all world languages. OK, and this is really difficult. Yeah, but even if it can understand a language with few examples.
So there are a lot of languages that the only corpora they have probably is the Bible. Just because this is probably the most translated piece of text that we have. And even with 1000 sentences, if an AI can understand, so I’m talking more in the context of NLP because this is my background. And if AI can understand more languages with few examples this will be super amazing. We can solve a lot of problems here.

Daly: That would be, it would be a very cool thing to make for sure. And what’s something that really excites you about AI?

Adelani: Ah so in recent times, I think the ability for you to solve complex problems you know. There are some AI models that can solve some math problems. Just like a person. And also when you think about large pre-trained models, like GPT3. I just need a few examples even, something like one to three examples and they could solve sentiment analysis tasks. It really shows the power of artificial intelligence. Maybe they don’t understand everything, yeah, but, I think there’s an ongoing debate to really understand what they are doing, but at least there is a clear trend of improvement in how they can solve downstream NLP tasks. It’s really interesting, it’s one of the best things to work on now.

Daly: Yeah, there’s so many kind of possibilities with the technology that’s coming up. And kind of Speaking of challenges, what do you think will be some of the biggest changes in AI in the next 5 to 10 years?

Adelani: Yeah, I think this is a difficult question, but I would say I would just talk in the context of NLP. I think one of the challenges nowadays is how do we make these AI models to be more multilingual?
They are already multilingual but we just want them to be able to support more languages. Or maybe we have a simpler technique for them to quickly adapt to a new language. And this would be one challenge because the problems we have in English also exist in other languages, right? So we’re not just trying to solve problems for English or other high resourced languages like Chinese and Spanish. And we have some difficult tasks, not even difficult, or simple tasks. OK, maybe difficult or simple is not the right word to use, but we have some tasks that are important to the speakers of the language.
For example, if you want to do hate speech detection for a Tigrinya, maybe you could do it quickly for English language because you have a good language model and you could fine-tune it on that task but for Tigrinya, you do not have these resources. So the question is: how can we make AI models to be able to adapt quickly to the new language with a different script from a different language family.
Apart from hate speech detection, there are also other important AI applications like machine translation. We want to translate from Wolof to French. OK French is also spoken in Senegal, where Wolof is spoken. But another question is like can you do this with few examples. And because maybe you don’t have a lot of examples from Wolof or parallel texts. And the more challenging question is if you could translate in between low resource languages.
So can you translate from Wolof to Swahili, from Yoruba to Swahili, from Yoruba to Zulu and most of these models are more English centric, you’re always translating to and from English. Now we cannot really or we are not really sure of the capabilities of these AI models on other languages until you test them or try to work on the languages. So I think these are more challenges just to make this work for more people so that we have like NLP or a multilingual NLP for everyone or that covers more languages. I think this should be definitely challenging.

Daly: Yeah, it’s definitely. There’s so so many languages out there which is like a beautiful thing and it would be really great to actually be able to do stuff with all of them.

Adelani: Yeah, it’s a big challenge.

Daly: For sure! And So what kind of questions come when you say people to people that you work in AI are working on NLP? What kind of questions do you usually get from people not involved in those areas?

Adelani: I think one of the questions that people ask me, may be because of where I am coming from, there are a lot of economic challenges for example in Nigeria. And people ask, you do a lot of stuff like this and how does this help us? So if I say I trained a machine translation model on my system and I’m not able to deploy it or get some economic benefits for this , it’s like to people it’s less important to them. So one of the questions I get asked is “oh what you’re working on, can it help to create jobs?”
And the answer is yes. If one can create a start-up around this research, it could actually help a lot of developments for these regions where these languages are spoken. Uh, maybe that’s not my focus for now, because I’m still doing research, but I think it’s something that’s important to many people, especially people that do not find it fun to just train models on the command line. They want to see how it benefits.
Daly: It kind of makes sense that people want to know it’s like what is, how is this going to help?

Adelani: Yeah.

Daly: So what do you think are some of the biggest challenges and/ or opportunities in AI?

Adelani: Oh, so again, I’m just talking more about NLP, so this is not general, but I think a challenge is multilingual NLP. It is a challenge and also an opportunity because it has helped communities to work on this problem. And when I was introducing myself, I told you about being a member of Masakhane. It became an opportunity for them, OK, this has not covered their language, can we make AI models to really adapt to African languages or more languages? And people are passionate about thus, so eventually it looks like an opportunity for people to really work and also develop skill and capacity to do things just by working on these problems so it’s a challenge and an opportunity at the same time definitely.

Daly: Absolutely, and so who’s been, who have you been, your kind of biggest inspirations in the field?

Adelani: Ah yeah, there are many people (ha). I think one of my biggest inspirations I think I would say is my supervisor (Dietrich Klakow) because I got interested in this through his NLP course and this got me interested in NLP. And of course, I also did a seminar with the same group and I read this paper on word2vec by Tomáš Mikolov and it’s about word representations. And I really find that idea very novel and interesting, and I try to understand which took me some time. But after that I think I’m super interested in this kind of representation learning and other NLP task and of course generally I do respect like few researchers. For example, Graham Neubig and Sebastian Ruder whose blogs that’s helping out just to simplify some of these technical terms and papers, and other people in the field.

Daly: There’s like so many, so many great people out there to learn and be inspired by. So we’ve kind of talked a little bit about kind of your research already, and this is like, uh, as a brief kind of summary, What are you working on currently?

Adelani: Well, yeah, we’re working on more transfer learning for named entity recognition for African languages. So just trying to understand why does it work? When does it work? How to use it? Also adapt it at the same time, it’s a very popular concept. Now I’m like we really understand how to use it very well. Other things I’m working on, which is also related to few-shot learning for named entity recognition, and I’m also thinking of how to adapt this sort of task like machine translation.
Machine translation initially didn’t interest me. Why? Because it requires a lot of examples. And we spend hours and hours training this model, even days or weeks. But now I think with this new model and pretrained models, what I’m working on is can we do this with a few examples? If so, how many examples do we need? Yeah, so this is more, just kind of analysis work.
Oh yeah, and of course, the other thing I’m pretty interested in is the creation of NLP corpora that we can use for African languages. I’m working more on African languages at the moment. So, we don’t have that part of speech data set from many African languages, can we create this, and some of these efforts have been funded by organizations like Lacuna Fund, where we are creating parallel texts for machine translation and we are annotating data set for parts of speech and named entity recognition. And this is for around 20 languages, and if we can have this, this will be super impressive. Yeah, of course this is only possible because we could collaborate with organizations like Masakhane where we have speakers of each language or connections to people that can help us to develop this.

Daly: Yeah, definitely it sounds like having that community is sort of really helpful for your work as well with that Masakhane.
Again, we’ve kind of touched on this a little bit, could you tell us a little bit about the implications of your research and why it’s an interesting area for study?

Adelani: So, I think one reason why I like this area of research is because it’s very practical, right? So I guess this is the kind of problem we face in industry and business enterprise where we would want to save costs and to do things with few examples because annotation is costly. Right, so this is why it’s interesting, and especially when annotators are not even accessible. So if you are a company in Europe and you only have a few examples from a language, maybe in Cameroon and you need access to the speakers, how would you do this? Maybe there’s some conflict or war happening in that region, how would you do something? You cannot go back and say, now I want to get annotators. The question is with the few examples you have, what can you do with it? And yeah, and this is a question about how a model can adapts to several tasks in a low resource setting. It’s could be to a new language, a new task or a new domain of interest. It could be, you have a review domain and you want to adapt to a Twitter domain and there is the difference, so this kind of problem is really interesting to me.

Daly: And what kind of excites you most about your work?

Adelani: I think what excites me the most is that I could work on my mother tongue and alongside I could also work on other African languages that have been neglected for so long. And they have received less research. And of course there are many factors for this. A lot of factors partially due to how the colonization in the past in the continent. Also also partially due to the speakers of the languages they have not done a lot for their language and also partially due to maybe less government funding to support research.
So the problem is not one, it’s just multifaceted. And depending how we look at it, there are many factors and this is what excites me working on African languages and on my mother tongue. And also, our recent work on creating benchmark data sets. Uh, I think I’m very happy if I see a publication that uses our data set. Because usually most publications, they tend to use well established, baseline corpora. And now if they evaluate on African languages, this really makes me happy.

Daly: Yeah, it’s yeah, it’s definitely super important that there’s like, no, no language left behind you know.

Adelani: Exactly.

Daly: So what would you like to work on next?

Adelani: Ah yeah, I think I want to spend more time on speech. I’ve done a lot of work on text, so like named entity recognition and machine translation, but speech has been also neglected a lot and one big advantage of speech is could reach people of different educational status, some people are not educated and the only way they could communicate is using your voice. And you cannot build AI applications for them if you don’t have tools and models that don’t understand voice in those languages. I think this is an interesting area and I will definitely want to work on. And of course, just continue my research on few-shot learning. Thank you.

Daly: You can imagine that’s probably quite a challenge going from sort of text based to speech based language. So how do you hope your research could be used in the future?

Adelani: So I think my research because it also involves some corpora collection, I think the corpora we create with Masakhane will spur a lot of NLP research in African languages in many tasks. And I talked about named entity recognition, l parts of speech, machine translation and other areas.
And of course, we are also developing techniques alongside that works better in the low resource setting. Because we evaluate a lot in this setting, all these techniques can also be applied in different scenarios and on different languages. If we work on African languages, these techniques can also be used on other languages like South East Asian languages. So these techniques are a little bit more general, so maybe they do not really depend on the language. And I think it would be great if other people can use these techniques.

Daly: Absolutely! So, so far what has been your most memorable research experience so far?

Adelani: I think the most memorable one would be our last paper on Masakhane, which is this creation of the first large scale named entity recognition data set for about 10 African languages, and this is the first that we have it. And it involves a lot of collaboration and I’m really happy with what came out of this collaboration. Now some people from the African continent and people around the world in Europe and America and Asia, and it looks like almost all regions are represented. So yeah, I think it’s very wonderful experience.

Daly : It’s a big group effort going into making that.
And what has been one of your most unusual sources of inspiration for your work?

Adelani: Uh, yeah, this is a bit tricky. I would say God because there’s this desire to just have positive impacts on people. And also a desire to use the time one has to make an impact right? One of my favourite quotes would be In the Bible; “Whatever you find to do with your hands, do it with all your might”, because this is, you just have a short time to live here so you better make use of it and try to do it well and that’s with all your might. And it really encourages me, even when I’m tired and I have to do a lot of things, just to get going and take it out.

Daly: That’s a, that’s a really beautiful quote, but definitely I think words to live by.
So how was, uh, how was your previous or your general kind of life experience influenced your work?

Adelani: So previous life experience how did it influences my work. Well, one thing is where I grew up, it’s in a multilingual society. For example, people speak Yoruba and English together. For example, whether in the Church or mosque or wherever you go to. So there’s always a need for interpretation, because in Nigeria, it is a multilingual society, so if someone is preaching or there’s a broadcast by governments, it has to be translated to other languages that people can understand. So, suddenly there’s this importance of being able to understand other languages and being able to have translation with the speech translation or text translation to get things done in a multilingual society and also you have people that are less educated that you have to communicate to. So I used to have someone that came from another region of Nigeria and came to live with us. He doesn’t speak Yoruba, he only speak English, his mother tongue.
So the question is how can he communicate, so he had to learn Yoruba and sometimes you cannot communicate with some people and so there’s a need to translate. And of course, this also really influence a lot of my work, in NLP and especially in machine translation. And also providing tools for low resource languages.
And of course, some of my experience at Saarland University has helped because I was working on a different project before. More privacy, this is where I got introduced to some other tasks like named entity recognition. And when I was working on African languages I just suddenly discover we don’t even have this kind of corpora for African languages and this is one of the most used corpora of evaluation. Of course, this was one of the motivations for creation of this data set.

Daly: And and very finally, where can people find out more? Where can we find you online?

Adelani: Yeah, I’m not sure. Yeah so I’m Twitter you can find me, @davlanade

Daly: We will also have transcriptions and links to social media. On the site as well.
Well, the final question would be also because you are our very first interview we don’t have a question from a previous person, but I do have a question from one of my colleagues which actually links weirdly nicely to you know your work, which is if you could wake up tomorrow and speak another language, what would you like to be able to speak?

Adelani: I think if I can wake up tomorrow and speak another language, I want to speak German because I’m in Germany. Communication can be difficult. If I can just wake up and start speaking German, that would be super cool, I can go anywhere, and display my talent of speaking German.

Daly: Make day-to-day a bit easier.

Adelani: Yeah, yeah.

Daly: And do you have a potential question for future, the next person to be interviewed?

Adelani: Hmmm yeah, this is a bit of a tricky one. Maybe what are the challenges they faced in their work? Trying to do this and whatever they need to do and what are the challenges they face, and maybe how did they overcome, maybe to resolve some of these challenges before they’re able to have this final result would be an interesting question.

Daly: Nice, that’s a really great question. Thank you so much. And yeah, thank you. Thank you again so much, uh David, for your time, time with us today. That was absolutely brilliant.

And finally I would like to thank you for listening to us today, join us for the next episode where we talk to Isabel Cachola about her research. If you would like to find out more about the series, do check us out at AIhub.org and goodbye for now.

transcript