ΑΙhub.org
 

An open-source training framework to advance multimodal AI


by
22 January 2025



share this:

Trying to model the physical reality by assembling various modalities: the image shows a couple of oranges seen through the lens of multiple modalities, with each slice showing a different way one might perceive and understand this scene. The modalities from left to right represent surface normals (the color represents surface orientation), depth (distance to the camera, red=near, blue=far), RGB (the original image), segmentation (distinct objects and image regions), and edges (object or texture boundaries).
2025 EPFL/Visual Intelligence and Learning Laboratory – CC-BY-SA 4.0

By Tanya Petersen

Large Language Models such as OpenAI’s ChatGPT have already transformed the way many of us go about some of our daily tasks. These generative artificial intelligence chatbots are trained with language — hundreds of terabytes of text ‘scraped’ from across the Internet and with billions of parameters.

Looking ahead, many believe the ‘engines’ that drive generative artificial intelligence will be multimodal models that are not just trained on text but also can process various other modalities of information, including images, video, sound, and modalities from other domains such as biological or atmospheric data.

Yet, until recently, training a single model to handle a wide range of modalities – inputs – and tasks – outputs – faced significant challenges. For example, the training often led to a reduction in performance compared to single-task models and typically required careful strategies to reduce quality losses and maximize accuracy. In addition, training one network on different modalities – or inputs – such as language, images or videos that vary greatly, presented additional complexities, and essential information in certain modalities was often incorrectly ignored by the model.

Multimodal Modeling

In a multi-year project undertaken with support from Apple in California, EPFL researchers from the Visual Intelligence and Learning Laboratory (VILAB) in the School of Computer and Communication Sciences (IC) have developed 4M, for Massively Masked Multimodal Modeling, one of the world’s most advanced single neural networks to handle a wide and varied range of tasks and modalities.

In their latest research paper on 4M, presented in December at NeurIPS 2024, the Annual Conference on Neural Information Processing Systems, the researchers describe how it expands the capabilities of existing models in multiple ways.

“With 4M, we now have a rich model that can interpret more than just language. But why does this matter? One common criticism of LLMs is that their knowledge is not grounded because the training data is limited to only language,” explained Assistant Professor Amir Zamir, Head of VILAB.

“When we advance to multimodal modeling, we don’t have to limit ourselves to language. We bring in other modalities, including sensors. For example, we can communicate an orange through the word ‘orange,’ just like in language models, but also through a collection of pixels, meaning how the orange looks, or through the sense of touch, capturing how touching an orange feels. If you assemble various modalities, you have a more complete encapsulation of the physical reality that we are trying to model,” he continued.

Towards an open-source, generic model for wide use

Despite these advances, Zamir says the development of 4M has presented some intriguing challenges, including the model not developing a truly unified representation across the modalities, and he has his own theory as to why.

“We think that secretly, under the hood, the models cheat and create a little ensemble of independent models. One set of parameters solves one problem, another set of parameters solves another, and collectively, they appear to solve the overall problem. But they’re not truly unifying their knowledge in a way that enables a compact joint representation of the environment that would be a good portal to the world.”

The VILAB team is continuing to work on building more structure and unification into 4M, with the goal of developing an open-source, generic architecture, enabling experts in other domains to adapt it to their specific needs, such as climate modeling or biomedical research. The team also works on addressing other important aspects, such as boosting the scalability even further and methods for the specialization of models to deployment contexts.

“The whole point of open sourcing is that people can tailor the model for themselves with their own data and their own specifications. 4M is coming at the right moment in time, and we are especially enthusiastic about other domains adopting this line of modeling for their specific use cases. We are excited to see where that leads. But there are still a lot of challenges, and there is still a lot to do,” said Oguzhan Fatih Kar and Roman Bachmann, Doctoral Assistants in VILAB and co-authors of the paper.

Based on the team’s experience developing 4M and the intriguing problems that they continue to work on, Zamir believes there are some interesting questions around the future development of foundation models.

“As humans, we have five key senses, and on top of that, we efficiently learn language, which adds labels and structure to the knowledge that was already grounded in these other senses. It’s the opposite with the current AI – we have language models without sensory access to the world but that are trained using colossal data and compute resources. Our goal is to study the role of multimodality and efficiently develop a grounded world model that can be effectively utilized for downstream uses.”

Find out more



tags: ,


EPFL




            AIhub is supported by:


Related posts :



Interview with Amina Mević: Machine learning applied to semiconductor manufacturing

  17 Apr 2025
Find out how Amina is using machine learning to develop an explainable multi-output virtual metrology system.

Images of AI – between fiction and function

“The currently pervasive images of AI make us look somewhere, at the cost of somewhere else.”

Grace Wahba awarded the 2025 International Prize in Statistics

  16 Apr 2025
Her contributions laid the foundation for modern statistical techniques that power machine learning algorithms such as gradient boosting and neural networks.

Repurposing protein folding models for generation with latent diffusion

  14 Apr 2025
The awarding of the 2024 Nobel Prize to AlphaFold2 marks an important moment of recognition for the of AI role in biology. What comes next after protein folding?

AI UK 2025 conference recordings now available to watch

  11 Apr 2025
Listen to the talks from this year's AI UK conference.

#AAAI2025 workshops round-up 2: Open-source AI for mainstream use, and federated learning for unbounded and intelligent decentralization

  10 Apr 2025
We hear from the organisers of two workshops at AAAI2025 and find out the key takeaways from their events.

Accelerating drug development with AI

  09 Apr 2025
Waterloo researchers use machine learning to predict how new drugs could affect the body




AIhub is supported by:






©2024 - Association for the Understanding of Artificial Intelligence


 












©2021 - ROBOTS Association