ΑΙhub.org
 

Natural Language Processing for low-resource languages


by
18 January 2023



share this:

A black keyboard at the bottom of the picture has an open book on it, with red words in labels floating on top, with a letter A balanced on top of them. The perspective makes the composition form a kind of triangle from the keyboard to the capital A. The AI filter makes it look like a messy, with a kind of cartoon style.Teresa Berndtsson / Better Images of AI / Letter Word Text Taxonomy / Licenced by CC-BY 4.0.

The majority of natural language processing (NLP) datasets and research at present focus on a small number of high-resource languages, with studies on English dominating the field. Clearly, such an imbalance is undesirable, putting those who do not use English at a disadvantage.

In this article, we highlight some of the work and initiatives being carried out on low-resource languages.

Lanfrica

Africa is one of the most linguistically diverse regions in the world. Despite this, African languages are barely represented in technology and research. Lanfrica aims to mitigate the difficulty encountered in the discovery of African language resources by creating a centralised hub. The team at Lanfrica have built a language-focused search engine that makes it fast and easy to find information on the internet about resources relating to African languages. Now with more than 1000 resources, their aim is to catalogue and connect all African language resources, one record at a time.

As well as this platform, Lanfrica also hosts regular online talks where you can hear from researchers in the field. This talk series provides a platform for anyone to share/showcase their efforts (research, projects, software, applications, datasets, models, initiatives, etc.) in NLP.

Masakane

Masakhane is a grassroots organisation whose mission is to strengthen and spur NLP research in African languages. The organisation is currently engaged in a number of projects, including:

Urdu

In this paper, Maaz Amjad, Sabur Butt, Hamza Imam Amjad, Alisa Zhila, Grigori Sidorov and Alexander Gelbukh outline their approach when taking part in the shared task UrduFake@FIRE2021, which centred on fake news detection in Urdu. This shared task aimed to attract and encourage researchers working in different NLP domains to address the automatic fake news detection task and help to mitigate the proliferation of fake content on the web.

The team have also looked into tweets in Urdu, in their paper Threatening Language Detection and Target Identification in Urdu Tweets.

Indian regional languages

B. S. Harish and R. Kasturi Rangan provide a comprehensive survey on Indian regional language processing, looking at tasks such as machine translation, named entity recognition, sentiment analysis and parts-of-speech tagging.

Bengali

Md. Rajib Hossain and Mohammed Moshiul Hoque study Bengali word embedding in their paper Towards Bengali Word Embedding: Corpus Creation, Intrinsic and Extrinsic Evaluations. They presents three embedding techniques with different hyperparameters implemented on a Bengali corpus with consists of 180 million words.

Indigenous languages of the Americas

Introducing QuBERT: A Large Monolingual Corpus and BERT Model for Southern Quechua, by Rodolfo Zevallos et al., introduces a large combined corpus for deep learning of Quechua. The authors also provide a public, pre-trained, BERT model called QuBERT. They have tested their corpus and its corresponding BERT model on two major tasks: (1) named-entity recognition (NER) and (2) part-of-speech (POS) tagging.

In this paper you can read about the AmericasNLP 2021 shared task on open machine translation for indigenous languages of the Americas. Manuel Mager et al. report on the 214 submissions from eight teams, which focussed on 10 different languages: Asháninka, Aymara, Bribri, Guarani, Nahuatl, Otomí, Quechua, Rarámuri, Shipibo-Konibo, and Wixarika.

Axolotl: a Web Accessible Parallel Corpus for Spanish-Nahuatl, by Ximena Gutierrez-Vasques, Gerardo Sierra and Isaac Hernandez Pompa, presents a project which comprises a Spanish-Nahuatl parallel corpus and its search interface.

Gina Bustamante, Arturo Oncevay, Roberto Zariquiey introduce monolingual corpora for four indigenous and endangered languages from Peru (Shipibo-konibo, Ashaninka, Yanesha and Yine) in their paper No data to crawl? Monolingual corpus creation from PDF files of truly low-resource languages in Peru.

Dysarthric speech recognition

Karima Kadaoui is researching how to help speech-impaired people communicate. Part of her project is to build an application to “translate” speech which may by unclear. She talks about the inspiration behind her work, and what she plans to achieve, in this video.

Sign language

Steven Kolawole created a dataset for Nigerian sign language with the help of a TV sign language broadcaster and two schools. Using this dataset, he built a sign-to-speech model for the language. You can find out more in this interview.

In their position paper, Including Signed Languages in Natural Language Processing, Kayo Yin, Amit Moryossef, Julie Hochgesang, Yoav Goldberg, and Malihe Alikhani call on the NLP community to include signed languages as a research area with high social and scientific impact. They discuss the linguistic properties of signed languages, review the limitations of current sign language processing models, and identify the open challenges to extend NLP to signed languages.

In her paper Approaches to the Anonymisation of Sign Language Corpora, Amy Isard considers the state-of-the-art for the anonymisation of sign language corpora. She explores the motivations behind anonymisation, and details the processes which can be used to anonymise both the video and the annotations belonging to a corpus.

Further reading


The AI Around the World series is supported through a donation from the Mohamed bin Zayed University of Artificial Intelligence (MBZUAI). AIhub retains editorial freedom in selecting and preparing the content.



tags:


Lucy Smith is Senior Managing Editor for AIhub.
Lucy Smith is Senior Managing Editor for AIhub.




            AIhub is supported by:


Related posts :



2024 AAAI / ACM SIGAI Doctoral Consortium interviews compilation

  20 Dec 2024
We collate our interviews with the 2024 cohort of doctoral consortium participants.

Interview with Andrews Ata Kangah: Localising illegal mining sites using machine learning and geospatial data

  19 Dec 2024
We spoke to Andrews to find out more about his research, and attending the AfriClimate AI workshop at the Deep Learning Indaba.

#NeurIPS social media round-up part 2

  18 Dec 2024
We pick out some highlights from the second half of the conference.

The Good Robot podcast: Machine vision with Jill Walker Rettberg

  17 Dec 2024
Eleanor and Kerry talk to Jill about machine vision's origins in polished volcanic glass, whether or not we'll actually have self-driving cars, and a famous photo-shopped image.

Five ways you might already encounter AI in cities (and not realise it)

  13 Dec 2024
Researchers studied how residents and visitors experience the presence of AI in public spaces in the UK.

#NeurIPS2024 social media round-up part 1

  12 Dec 2024
Find out what participants have been getting up to at the Neural Information Processing Systems conference in Vancouver.

Congratulations to the #NeurIPS2024 award winners

  11 Dec 2024
Find out who has been recognised by the conference awards.

Multi-agent path finding in continuous environments

and   11 Dec 2024
How can a group of agents minimise their journey length whilst avoiding collisions?




AIhub is supported by:






©2024 - Association for the Understanding of Artificial Intelligence


 












©2021 - ROBOTS Association