ΑΙhub.org
 

Interview with Leanne Nortje: Visually-grounded few-shot word learning


by
05 July 2023



share this:

In their work Visually grounded few-shot word learning in low-resource settings, Leanne Nortje, Dan Oneata and Herman Kamper propose a visually-grounded speech model that learns new words and their visual depictions. In this interview, Leanne tells us more about their methodology and how it could be beneficial for low-resource languages.

What is the topic of the research in your paper?

We look into using vision as a form of weakly transcribing audio. This will be particularly helpful for low-resource languages where, in extreme cases, such languages have no written form. We specifically consider the task of retrieving relevant images for a given spoken word by learning from only a few image-word pairs, i.e. to do multimodal few-shot word learning. The aim is to give a model a set of spoken word examples where each word is paired with a corresponding image. Each paired word-image example contains a novel (new) class. After using only these examples to learn the few-shot word classes, the model is given another set of images – a matching set containing an image for each class. When we query the model with a spoken instance of one of these novel classes, the model should identify which image in the matching image set matches the word. For instance, imagine showing a robot images of different objects (zebra, kite, sheep, etc.) while saying the word for each picture. After seeing this small set of examples, we ask the robot to find a new image corresponding to the word “zebra”.

Could you tell us about the implications of your research and why it is an interesting area for study?

Our research has two main impacts. The first is the development of speech systems that cater to low-resourced languages. Current speech systems are trained on large corpora of transcribed speech, which are expensive and time-consuming to collect. This research aims to develop techniques that enable researchers to train speech systems from very few labelled data examples. Secondly, these models are inspired by how children learn languages. Therefore, we can probe the models to gain insight into the cognition and learning dynamics of children.

one audio wave above five images

Could you explain your methodology?

Intuitively, only a few examples, e.g. five, per word class will not be sufficient to learn a model capable of identifying the visual depictions of a spoken word. Therefore, we use the given word-image example pairs to mine new unsupervised word-image training pairs from large collections of unlabelled speech and images. In terms of architecture, we use a vision branch and an audio branch which is connected with a word-to-image attention mechanism to determine the similarity between a spoken word and an image.

What were your main findings?

For the fewer shot scenario, where we have a small amount of examples per class to learn from, we outperform any existing approach. We see that the mined word-image pairs are essential to our performance boost. For retrieving multiple images containing the visual depiction of a spoken word, we get consistent scores across varying numbers of examples per class.

What further work are you planning in this area?

For future work, we are planning to extend the number of novel classes we can learn using this approach. We also plan on applying this model on an actual low-resource language: Yoruba.

About Leanne Nortje

Leanne Nortje

I am currently doing a PhD which combines speech processing and computer vision in weakly supervised settings by using small amounts of labelled data. The inspiration behind my models is how efficiently children learn language from very few examples. If systems can learn as rapidly, we could develop less data-dependent systems.

In 2018 I received my BEng Electrical and Electronic Engineering degree cum laude from Stellenbosch University. Thereafter, I did my MEng Electronic Engineering degree in 2019 to 2020. I passed my masters cum laude and received the Rector’s Award for top masters student in Engineering.

Find out more




Lucy Smith is Senior Managing Editor for AIhub.
Lucy Smith is Senior Managing Editor for AIhub.




            AIhub is supported by:



Related posts :



monthly digest

AIhub monthly digest: September 2025 – conference reviewing, soccer ball detection, and memory traces

  30 Sep 2025
Welcome to our monthly digest, where you can catch up with AI research, events and news from the month past.

Botanical time machines: AI is unlocking a treasure trove of data held in herbarium collections

  29 Sep 2025
New research describes the development and testing of a new AI-driven tool.

All creatures, great, small, and artificial

  26 Sep 2025
AI in Veterinary Medicine and what it can teach us about the data revolution.

RoboCup Logistics League: an interview with Alexander Ferrein, Till Hofmann and Wataru Uemura

  25 Sep 2025
Find out more about the RoboCup league focused on production logistics and the planning.

Data centers consume massive amounts of water – companies rarely tell the public exactly how much

  24 Sep 2025
Why do data centres need so much water, and how much do they use?

Interview with Luc De Raedt: talking probabilistic logic, neurosymbolic AI, and explainability

  23 Sep 2025
AIhub ambassador Liliane-Caroline Demers caught up with Luc de Raedt at IJCAI 2025 to find out more about his research.

Call for AAAI educational AI videos

  22 Sep 2025
Submit your contributions by 30 November 2025.

Self-supervised learning for soccer ball detection and beyond: interview with winners of the RoboCup 2025 best paper award

  19 Sep 2025
Method for improving ball detection can also be applied in other fields, such as precision farming.



 

AIhub is supported by:






 












©2025.05 - Association for the Understanding of Artificial Intelligence