ΑΙhub.org
 

Interview with Xiang Fang: Multi-modal learning and embodied intelligence


by
20 January 2026



share this:

Each year, a small group of PhD students are chosen to participate in the AAAI/SIGAI Doctoral Consortium. This initiative provides an opportunity for the students to discuss and explore their research interests and career objectives in an interdisciplinary workshop together with a panel of established researchers. For the past couple of years, we’ve been meeting with some of the students to find out more about their work. In the first of our interviews with the 2026 cohort, we caught up with Xiang Fang.

Tell us a bit about your PhD – where are you studying, and what is the topic of your research?

I have been conducting my PhD research at Nanyang Technological University (NTU) in Singapore. Broadly speaking, my research focuses on multi-modal learning and embodied intelligence. I am trying to bridge the gap between how AI ‘sees’ the world (computer vision) and how it ‘understands’ language. Specifically, my PhD thesis work centered on two critical challenges:

  • Video understanding: Enabling models to locate specific moments in video using natural language (temporal sentence grounding).
  • Robustness: Ensuring these models don’t fail when they encounter data they haven’t seen before (out-of-distribution detection).

Ultimately, I want to build AI agents that can not only watch videos but actually understand and navigate the physical world.

Could you give us an overview of the research you’ve carried out during your PhD?

I’ve structured my research into three main phases, resulting in over 40 publications in venues like CVPR, NeurIPS, and AAAI.

  • Phase 1: Efficient video understanding (the foundation)
    I started by addressing the efficiency of video analysis. For example, my work on temporal sentence grounding (published in CVPR 2023 and AAAI 2025) focused on how to quickly locate a specific event in a long video using a language query. I developed methods to align text and video features more effectively using optimal transport and graph reasoning.
  • Phase 2: Trustworthy AI (the safety layer)
    I realized that high accuracy isn’t enough if the model is fragile. I expanded my research to out-of-distribution (OOD) detection—teaching models to know when they don’t know something. My work in ICML 2025 on adaptive multi-prompt contrastive networks addressed this by helping models detect unknown classes in few-shot scenarios.
  • Phase 3: Embodied intelligence (the application)
    Most recently, I moved from passive video watching to active agents. My NeurIPS 2025 paper on vision-language navigation explores how agents can reason about their environment to navigate complex spaces, moving us closer to functional robots/agents.

Is there an aspect of your research that has been particularly interesting?

Yes, I found my work on “Turing Patterns for Multimedia: Reaction-Diffusion Multi-Modal Fusion for Language-Guided Video Moment Retrieval” (published in ACM MM 2025) particularly fascinating because it was highly interdisciplinary.

In that project, I drew inspiration from biological reaction-diffusion systems—essentially the math behind how zebras get their stripes—and applied it to AI. I used these patterns to model how different modalities (video and text) should ‘diffuse’ and fuse together.

It was intellectually satisfying because it allowed me to leverage my strong mathematical background to solve a modern computer vision problem in a completely novel, non-traditional way. It showed that we can look outside of standard deep learning paradigms to find efficient solutions.

What are your plans for building on your research so far during the PhD – what aspects will you be investigating next?

Building on my PhD research, my immediate focus is on unified vision-language-action models. In my recent AAAI 2026 papers, I started exploring how to handle ‘incomplete’ inputs—where a robot or agent might lose a camera feed or audio stream but still needs to function.

Moving forward, I plan to:

  • Close the loop: Move fully from ‘perception’ (seeing) to ‘action’ (doing). I want to refine how large language models (LLMs) control robotic agents.
  • Robustness in the wild: Apply my OOD detection research to these agents. If a robot sees an object it wasn’t trained on, it needs to recognize that uncertainty rather than hallucinating an action.
  • Efficiency: As models get larger, they become harder to deploy. I want to continue my work on efficient clip trimming and sparse activation to make these massive models usable in real-time scenarios.

What made you want to study AI?

It was actually a journey of realizing where my skills could have the most impact. I originally studied Geological Engineering alongside Computer Science. I was very strong in mathematics—I competed in and won several national math modeling competitions. While I loved the rigour of geology, I realized that the mathematical models I was building had applications far beyond just the earth sciences. I became fascinated by the idea that the same underlying logic used to analyze physical data could be used to teach a machine to ‘see’ and ‘read.’ AI was the perfect intersection of my competitive math background and my desire to build systems that solve dynamic, open-ended problems, rather than static ones.

Could you tell us an interesting (non-AI related) fact about you?

A fun fact is that my first degree was actually in Geological Engineering, where I ranked 1st out of 60 students. Because of this, I spent a semester as a fully funded exchange student at Lomonosov Moscow State University in Russia. It was an intense experience — adapting to the Russian winter and a completely different academic culture gave me a lot of resilience. I think surviving a winter in Moscow makes debugging code seem a lot less stressful by comparison!

About Xiang Fang

Xiang Fang is a PhD candidate at Nanyang Technological University (NTU), Singapore. He holds an M.Eng. from Huazhong University of Science and Technology and a dual background in Computer Science and Geological Engineering. His research focuses on multi-modal learning, specifically advancing large vision-language models, embodied intelligence, and out-of-distribution detection. Xiang has published over 40 papers in top-tier venues, including CVPR, NeurIPS, ICML, AAAI, and ACM MM. He is the recipient of multiple awards, including the NTU Research Excellence Award and Best Student Paper at MIPR 2024, and serves as a reviewer for major AI conferences.



tags: , , ,


Lucy Smith is Senior Managing Editor for AIhub.
Lucy Smith is Senior Managing Editor for AIhub.




            AIhub is supported by:



Related posts :



An introduction to science communication at #AAAI2026

  19 Jan 2026
Find out more about our session on Wednesday 21 January.

Guarding Europe’s hidden lifelines: how AI could protect subsea infrastructure

  15 Jan 2026
EU-funded researchers are developing AI-powered surveillance tools to protect the vast network of subsea cables and pipelines that keep the continent’s energy and data flowing.

What’s coming up at #AAAI2026?

  14 Jan 2026
Find out what's on the programme at the annual AAAI Conference on Artificial Intelligence.

Taking humanoid soccer to the next level: An interview with RoboCup trustee Alessandra Rossi

  13 Jan 2026
Find out more about the forthcoming changes to the RoboCup soccer leagues.

Robots to navigate hiking trails

  12 Jan 2026
Find out more about work presented at IROS 2025 on autonomous hiking trail navigation via semantic segmentation and geometric analysis.

AAAI presidential panel – AI reasoning

  09 Jan 2026
Watch the third panel discussion in this series from AAAI.

The Machine Ethics podcast: Companion AI with Giulia Trojano

Ben chats to Giulia Trojano about AI as an economic narrative, companion chatbots, deskilling of digital literacy, chatbot parental controls, differences between social AI and general AI services and more.



 

AIhub is supported by:






 












©2025.05 - Association for the Understanding of Artificial Intelligence