ΑΙhub.org
 

An open-source training framework to advance multimodal AI


by
22 January 2025



share this:

Trying to model the physical reality by assembling various modalities: the image shows a couple of oranges seen through the lens of multiple modalities, with each slice showing a different way one might perceive and understand this scene. The modalities from left to right represent surface normals (the color represents surface orientation), depth (distance to the camera, red=near, blue=far), RGB (the original image), segmentation (distinct objects and image regions), and edges (object or texture boundaries).
2025 EPFL/Visual Intelligence and Learning Laboratory – CC-BY-SA 4.0

By Tanya Petersen

Large Language Models such as OpenAI’s ChatGPT have already transformed the way many of us go about some of our daily tasks. These generative artificial intelligence chatbots are trained with language — hundreds of terabytes of text ‘scraped’ from across the Internet and with billions of parameters.

Looking ahead, many believe the ‘engines’ that drive generative artificial intelligence will be multimodal models that are not just trained on text but also can process various other modalities of information, including images, video, sound, and modalities from other domains such as biological or atmospheric data.

Yet, until recently, training a single model to handle a wide range of modalities – inputs – and tasks – outputs – faced significant challenges. For example, the training often led to a reduction in performance compared to single-task models and typically required careful strategies to reduce quality losses and maximize accuracy. In addition, training one network on different modalities – or inputs – such as language, images or videos that vary greatly, presented additional complexities, and essential information in certain modalities was often incorrectly ignored by the model.

Multimodal Modeling

In a multi-year project undertaken with support from Apple in California, EPFL researchers from the Visual Intelligence and Learning Laboratory (VILAB) in the School of Computer and Communication Sciences (IC) have developed 4M, for Massively Masked Multimodal Modeling, one of the world’s most advanced single neural networks to handle a wide and varied range of tasks and modalities.

In their latest research paper on 4M, presented in December at NeurIPS 2024, the Annual Conference on Neural Information Processing Systems, the researchers describe how it expands the capabilities of existing models in multiple ways.

“With 4M, we now have a rich model that can interpret more than just language. But why does this matter? One common criticism of LLMs is that their knowledge is not grounded because the training data is limited to only language,” explained Assistant Professor Amir Zamir, Head of VILAB.

“When we advance to multimodal modeling, we don’t have to limit ourselves to language. We bring in other modalities, including sensors. For example, we can communicate an orange through the word ‘orange,’ just like in language models, but also through a collection of pixels, meaning how the orange looks, or through the sense of touch, capturing how touching an orange feels. If you assemble various modalities, you have a more complete encapsulation of the physical reality that we are trying to model,” he continued.

Towards an open-source, generic model for wide use

Despite these advances, Zamir says the development of 4M has presented some intriguing challenges, including the model not developing a truly unified representation across the modalities, and he has his own theory as to why.

“We think that secretly, under the hood, the models cheat and create a little ensemble of independent models. One set of parameters solves one problem, another set of parameters solves another, and collectively, they appear to solve the overall problem. But they’re not truly unifying their knowledge in a way that enables a compact joint representation of the environment that would be a good portal to the world.”

The VILAB team is continuing to work on building more structure and unification into 4M, with the goal of developing an open-source, generic architecture, enabling experts in other domains to adapt it to their specific needs, such as climate modeling or biomedical research. The team also works on addressing other important aspects, such as boosting the scalability even further and methods for the specialization of models to deployment contexts.

“The whole point of open sourcing is that people can tailor the model for themselves with their own data and their own specifications. 4M is coming at the right moment in time, and we are especially enthusiastic about other domains adopting this line of modeling for their specific use cases. We are excited to see where that leads. But there are still a lot of challenges, and there is still a lot to do,” said Oguzhan Fatih Kar and Roman Bachmann, Doctoral Assistants in VILAB and co-authors of the paper.

Based on the team’s experience developing 4M and the intriguing problems that they continue to work on, Zamir believes there are some interesting questions around the future development of foundation models.

“As humans, we have five key senses, and on top of that, we efficiently learn language, which adds labels and structure to the knowledge that was already grounded in these other senses. It’s the opposite with the current AI – we have language models without sensory access to the world but that are trained using colossal data and compute resources. Our goal is to study the role of multimodality and efficiently develop a grounded world model that can be effectively utilized for downstream uses.”

Find out more



tags: ,


EPFL

            AUAI is supported by:



Subscribe to AIhub newsletter on substack



Related posts :

The Machine Ethics podcast: organoid computing with Dr Ewelina Kurtys

In this episode, Ben chats to Ewelina about the uses of organoids and energy saving computing, differences between biological neurons and digital neural networks, and much more.

#AAAI2026 invited talk: Yolanda Gil on improving workflows with AI

  28 Apr 2026
Former AAAI president on using AI to help communities of scientists better streamline their research.

Maryna Viazovska’s proofs of sphere packing formalized with AI

  27 Apr 2026
Formalization achieved through a collaboration between mathematicians and artificial intelligence tools.

Interview with Deepika Vemuri: interpretability and concept-based learning

  24 Apr 2026
Find out more about Deepika's research bridging the gap between data-driven models and symbolic learning.

As a ‘book scientist’ I work with microscopes, imaging technologies and AI to preserve ancient texts

  23 Apr 2026
Using an array of technologies to recover, understand and preserve many valuable ancient texts.

Sony AI table tennis robot outplays elite human players

  22 Apr 2026
New robot and AI system has beaten professional and elite table tennis players.

Causal models for decision systems: an interview with Matteo Ceriscioli

  21 Apr 2026
How can we integrate causal knowledge into agents or decision systems to make them more reliable?

A model for defect identification in materials

  20 Apr 2026
A new model measures defects that can be leveraged to improve materials’ mechanical strength, heat transfer, and energy-conversion efficiency.



AUAI is supported by:







Subscribe to AIhub newsletter on substack




 















©2026.02 - Association for the Understanding of Artificial Intelligence