Each year, a small group of PhD students are chosen to participate in the AAAI/SIGAI Doctoral Consortium. This initiative provides an opportunity for the students to discuss and explore their research interests and career objectives in an interdisciplinary workshop together with a panel of established researchers. For the past couple of years, we’ve been meeting with some of the students to find out more about their work. In the first of our interviews with the 2026 cohort, we caught up with Xiang Fang.
I have been conducting my PhD research at Nanyang Technological University (NTU) in Singapore. Broadly speaking, my research focuses on multi-modal learning and embodied intelligence. I am trying to bridge the gap between how AI ‘sees’ the world (computer vision) and how it ‘understands’ language. Specifically, my PhD thesis work centered on two critical challenges:
Ultimately, I want to build AI agents that can not only watch videos but actually understand and navigate the physical world.
I’ve structured my research into three main phases, resulting in over 40 publications in venues like CVPR, NeurIPS, and AAAI.
Yes, I found my work on “Turing Patterns for Multimedia: Reaction-Diffusion Multi-Modal Fusion for Language-Guided Video Moment Retrieval” (published in ACM MM 2025) particularly fascinating because it was highly interdisciplinary.
In that project, I drew inspiration from biological reaction-diffusion systems—essentially the math behind how zebras get their stripes—and applied it to AI. I used these patterns to model how different modalities (video and text) should ‘diffuse’ and fuse together.
It was intellectually satisfying because it allowed me to leverage my strong mathematical background to solve a modern computer vision problem in a completely novel, non-traditional way. It showed that we can look outside of standard deep learning paradigms to find efficient solutions.
Building on my PhD research, my immediate focus is on unified vision-language-action models. In my recent AAAI 2026 papers, I started exploring how to handle ‘incomplete’ inputs—where a robot or agent might lose a camera feed or audio stream but still needs to function.
Moving forward, I plan to:
It was actually a journey of realizing where my skills could have the most impact. I originally studied Geological Engineering alongside Computer Science. I was very strong in mathematics—I competed in and won several national math modeling competitions. While I loved the rigour of geology, I realized that the mathematical models I was building had applications far beyond just the earth sciences. I became fascinated by the idea that the same underlying logic used to analyze physical data could be used to teach a machine to ‘see’ and ‘read.’ AI was the perfect intersection of my competitive math background and my desire to build systems that solve dynamic, open-ended problems, rather than static ones.
A fun fact is that my first degree was actually in Geological Engineering, where I ranked 1st out of 60 students. Because of this, I spent a semester as a fully funded exchange student at Lomonosov Moscow State University in Russia. It was an intense experience — adapting to the Russian winter and a completely different academic culture gave me a lot of resilience. I think surviving a winter in Moscow makes debugging code seem a lot less stressful by comparison!
|
Xiang Fang is a PhD candidate at Nanyang Technological University (NTU), Singapore. He holds an M.Eng. from Huazhong University of Science and Technology and a dual background in Computer Science and Geological Engineering. His research focuses on multi-modal learning, specifically advancing large vision-language models, embodied intelligence, and out-of-distribution detection. Xiang has published over 40 papers in top-tier venues, including CVPR, NeurIPS, ICML, AAAI, and ACM MM. He is the recipient of multiple awards, including the NTU Research Excellence Award and Best Student Paper at MIPR 2024, and serves as a reviewer for major AI conferences. |