ΑΙhub.org
 

An open-source training framework to advance multimodal AI


by
22 January 2025



share this:

Trying to model the physical reality by assembling various modalities: the image shows a couple of oranges seen through the lens of multiple modalities, with each slice showing a different way one might perceive and understand this scene. The modalities from left to right represent surface normals (the color represents surface orientation), depth (distance to the camera, red=near, blue=far), RGB (the original image), segmentation (distinct objects and image regions), and edges (object or texture boundaries).
2025 EPFL/Visual Intelligence and Learning Laboratory – CC-BY-SA 4.0

By Tanya Petersen

Large Language Models such as OpenAI’s ChatGPT have already transformed the way many of us go about some of our daily tasks. These generative artificial intelligence chatbots are trained with language — hundreds of terabytes of text ‘scraped’ from across the Internet and with billions of parameters.

Looking ahead, many believe the ‘engines’ that drive generative artificial intelligence will be multimodal models that are not just trained on text but also can process various other modalities of information, including images, video, sound, and modalities from other domains such as biological or atmospheric data.

Yet, until recently, training a single model to handle a wide range of modalities – inputs – and tasks – outputs – faced significant challenges. For example, the training often led to a reduction in performance compared to single-task models and typically required careful strategies to reduce quality losses and maximize accuracy. In addition, training one network on different modalities – or inputs – such as language, images or videos that vary greatly, presented additional complexities, and essential information in certain modalities was often incorrectly ignored by the model.

Multimodal Modeling

In a multi-year project undertaken with support from Apple in California, EPFL researchers from the Visual Intelligence and Learning Laboratory (VILAB) in the School of Computer and Communication Sciences (IC) have developed 4M, for Massively Masked Multimodal Modeling, one of the world’s most advanced single neural networks to handle a wide and varied range of tasks and modalities.

In their latest research paper on 4M, presented in December at NeurIPS 2024, the Annual Conference on Neural Information Processing Systems, the researchers describe how it expands the capabilities of existing models in multiple ways.

“With 4M, we now have a rich model that can interpret more than just language. But why does this matter? One common criticism of LLMs is that their knowledge is not grounded because the training data is limited to only language,” explained Assistant Professor Amir Zamir, Head of VILAB.

“When we advance to multimodal modeling, we don’t have to limit ourselves to language. We bring in other modalities, including sensors. For example, we can communicate an orange through the word ‘orange,’ just like in language models, but also through a collection of pixels, meaning how the orange looks, or through the sense of touch, capturing how touching an orange feels. If you assemble various modalities, you have a more complete encapsulation of the physical reality that we are trying to model,” he continued.

Towards an open-source, generic model for wide use

Despite these advances, Zamir says the development of 4M has presented some intriguing challenges, including the model not developing a truly unified representation across the modalities, and he has his own theory as to why.

“We think that secretly, under the hood, the models cheat and create a little ensemble of independent models. One set of parameters solves one problem, another set of parameters solves another, and collectively, they appear to solve the overall problem. But they’re not truly unifying their knowledge in a way that enables a compact joint representation of the environment that would be a good portal to the world.”

The VILAB team is continuing to work on building more structure and unification into 4M, with the goal of developing an open-source, generic architecture, enabling experts in other domains to adapt it to their specific needs, such as climate modeling or biomedical research. The team also works on addressing other important aspects, such as boosting the scalability even further and methods for the specialization of models to deployment contexts.

“The whole point of open sourcing is that people can tailor the model for themselves with their own data and their own specifications. 4M is coming at the right moment in time, and we are especially enthusiastic about other domains adopting this line of modeling for their specific use cases. We are excited to see where that leads. But there are still a lot of challenges, and there is still a lot to do,” said Oguzhan Fatih Kar and Roman Bachmann, Doctoral Assistants in VILAB and co-authors of the paper.

Based on the team’s experience developing 4M and the intriguing problems that they continue to work on, Zamir believes there are some interesting questions around the future development of foundation models.

“As humans, we have five key senses, and on top of that, we efficiently learn language, which adds labels and structure to the knowledge that was already grounded in these other senses. It’s the opposite with the current AI – we have language models without sensory access to the world but that are trained using colossal data and compute resources. Our goal is to study the role of multimodality and efficiently develop a grounded world model that can be effectively utilized for downstream uses.”

Find out more



tags: ,


EPFL

            AIhub is supported by:



Subscribe to AIhub newsletter on substack



Related posts :

AI chatbots can effectively sway voters – in either direction

  12 Mar 2026
A short interaction with a chatbot can meaningfully shift a voter’s opinion about a presidential candidate or proposed policy.

Studying the properties of large language models: an interview with Maxime Meyer

  11 Mar 2026
What happens when you increase the prompt length in a LLM? In the latest interview in our AAAI Doctoral Consortium series, we sat down with Maxime, a PhD student in Singapore.

What the Moltbook experiment is teaching us about AI

An experimental social media platform where only AI bots can post reveals surprising lessons about artificial intelligence behaviour and safety.

The malleable mind: context accumulation drives LLM’s belief drift

  09 Mar 2026
LLMs change their "beliefs" over time, depending on the data they are given.

RWDS Big Questions: how do we balance innovation and regulation in the world of AI?

  06 Mar 2026
The panel explores the tensions, trade-offs and practical realities facing policymakers and data scientists alike.

Studying multiplicity: an interview with Prakhar Ganesh

  05 Mar 2026
What is multiplicity, and what implications does it have for fairness, privacy and interpretability in real-world systems?

Top AI ethics and policy issues of 2025 and what to expect in 2026

, and   04 Mar 2026
In the latest issue of AI Matters, a publication of ACM SIGAI, Larry Medsker summarised the year in AI ethics and policy, and looked ahead to 2026.

The greatest risk of AI in higher education isn’t cheating – it’s the erosion of learning itself

  03 Mar 2026
Will AI hollow out the pipeline of students, researchers and faculty that is the basis of today’s universities?



AIhub is supported by:







Subscribe to AIhub newsletter on substack




 















©2026.02 - Association for the Understanding of Artificial Intelligence