ΑΙhub.org
 

Interview with Yuki Mitsufuji: Text-to-sound generation


by
29 July 2025



share this:

Earlier this year, we spoke to Yuki Mitsufuji, Lead Research Scientist at Sony AI, about work concerning different aspects of image generation. Yuki and his team have since extended their work to sound generation, presenting work at ICLR 2025 entitled: SoundCTM: Unifying Score-based and Consistency Models for Full-band Text-to-Sound Generation. We caught up with Yuki to find out more.

In our previous interview, you mentioned that real-time sound generation was one of the projects you were working on. What were the problems with the existing text-to-sound generators that you were trying to solve with your work?

Creating sounds for different types of multimedia, such as video games and movies, takes a lot of experimenting, as artists try to match sounds to their evolving creative ideas. New high-quality diffusion-based Text-to-Sound (T2S) generative models can help with this process, but they are often slow, which makes it harder for creators to experiment quickly. Existing T2S distillation models address this limitation through 1-step generation, but often the quality isn’t good enough for professional use. Additionally, while multi-step sampling in the aforementioned distillation models improves sample quality, the semantic content changes because they don’t produce consistent results each time.

Could you tell us about the model that you’ve introduced – what are the main contributions of this work?

We proposed Sound Consistency Trajectory Models (SoundCTM), which allows flexible transitions between high-quality 1-step sound generation and superior sound quality through multi-step deterministic sampling. SoundCTM combines score-based diffusion and consistency models into a single architecture that supports both fast one-step sampling and high-fidelity multi-step generation for audio. This can empower creators to try out ideas quickly, match the sound to what they have in mind, and then improve the sound quality without changing its meaning.

How did you go about developing the model – what was the methodology?

SoundCTM builds directly on our previous computer vision CTM (Consistency Trajectory Models) research, which reimagined how diffusion models can learn from the trajectory of data as it transforms over time. By extending CTM into the audio domain, SoundCTM makes it possible to generate complex, full-band sound with speed, clarity, and control, while avoiding the training bottlenecks that slow down other models.

To develop SoundCTM, we addressed the limitations of the CTM framework by proposing a novel feature distance for distillation loss, a strategy for distilling CFG trajectories, and a ν-sampling that combines text-conditional and unconditional student jumps.

How did you evaluate the model, and what were the results?

Through our research, we demonstrate that SoundCTM-DiT-1B is the first large-scale distillation model to achieve notable 1-step and multi-step full-band text-to-sound generation.

When evaluating the model, in addition to standard objective metrics such as Fréchet Distance (FD), Kullback–Leibler divergence (KL), and CLAP score evaluated in full-band settings, we conducted subjective listening tests. A unique aspect of our evaluation was the use of sample-wise reconstruction error in the CLAP audio encoder’s feature space to compare outputs from 1-step and 16-step generations.

This approach allowed us to objectively verify whether semantic content remained consistent between 1-step and multi-step generations. Our findings revealed that only our unique multi-step deterministic sampling preserved semantic consistency when compared to 1-step generation. This is a significant result that, to our knowledge, has not yet been achieved by any other distillation-based sound generator.

While this outcome is theoretically expected, our empirical validation adds strong support—especially in the context of content creation, where semantic fidelity is crucial.

Audio samples are available here.

Read the work in full

SoundCTM: Unifying Score-based and Consistency Models for Full-band Text-to-Sound Generation, Koichi Saito, Dongjun Kim, Takashi Shibuya, Chieh-Hsin Lai, Zhi Zhong, Yuhta Takida, Yuki Mitsufuji.

About Yuki Mitsufuji

Yuki Mitsufuji is a Lead Research Scientist at Sony AI. In addition to his role at Sony AI, he is a Distinguished Engineer for Sony Group Corporation and the Head of Creative AI Lab for Sony R&D. Yuki holds a PhD in Information Science & Technology from the University of Tokyo. His groundbreaking work has made him a pioneer in foundational music and sound work, such as sound separation and other generative models that can be applied to music, sound, and other modalities.



tags: ,


Lucy Smith is Senior Managing Editor for AIhub.
Lucy Smith is Senior Managing Editor for AIhub.




            AIhub is supported by:



Related posts :



Advanced AI models are not always better than simple ones

  09 Sep 2025
Researchers have developed Systema, a new tool to evaluate how well AI models work when predicting the effects of genetic perturbations.

The Machine Ethics podcast: Autonomy AI with Adir Ben-Yehuda

This episode Adir and Ben chat about AI automation for frontend web development, where human-machine interface could be going, allowing an LLM to optimism itself, job displacement, vibe coding and more.

Using generative AI, researchers design compounds that can kill drug-resistant bacteria

  05 Sep 2025
The team used two different AI approaches to design novel antibiotics, including one that showed promise against MRSA.

#IJCAI2025 distinguished paper: Combining MORL with restraining bolts to learn normative behaviour

and   04 Sep 2025
The authors introduce a framework for guiding reinforcement learning agents to comply with social, legal, and ethical norms.

How the internet and its bots are sabotaging scientific research

  03 Sep 2025
What most people have failed to fully realise is that internet research has brought along risks of data corruption or impersonation.

#ICML2025 outstanding position paper: Interview with Jaeho Kim on addressing the problems with conference reviewing

  02 Sep 2025
Jaeho argues that the AI conference peer review crisis demands author feedback and reviewer rewards.

Forthcoming machine learning and AI seminars: September 2025 edition

  01 Sep 2025
A list of free-to-attend AI-related seminars that are scheduled to take place between 2 September and 31 October 2025.
monthly digest

AIhub monthly digest: August 2025 – causality and generative modelling, responsible multimodal AI, and IJCAI in Montréal and Guangzhou

  29 Aug 2025
Welcome to our monthly digest, where you can catch up with AI research, events and news from the month past.



 

AIhub is supported by:






 












©2025.05 - Association for the Understanding of Artificial Intelligence