ΑΙhub.org
 

An open-source training framework to advance multimodal AI


by
22 January 2025



share this:

Trying to model the physical reality by assembling various modalities: the image shows a couple of oranges seen through the lens of multiple modalities, with each slice showing a different way one might perceive and understand this scene. The modalities from left to right represent surface normals (the color represents surface orientation), depth (distance to the camera, red=near, blue=far), RGB (the original image), segmentation (distinct objects and image regions), and edges (object or texture boundaries).
2025 EPFL/Visual Intelligence and Learning Laboratory – CC-BY-SA 4.0

By Tanya Petersen

Large Language Models such as OpenAI’s ChatGPT have already transformed the way many of us go about some of our daily tasks. These generative artificial intelligence chatbots are trained with language — hundreds of terabytes of text ‘scraped’ from across the Internet and with billions of parameters.

Looking ahead, many believe the ‘engines’ that drive generative artificial intelligence will be multimodal models that are not just trained on text but also can process various other modalities of information, including images, video, sound, and modalities from other domains such as biological or atmospheric data.

Yet, until recently, training a single model to handle a wide range of modalities – inputs – and tasks – outputs – faced significant challenges. For example, the training often led to a reduction in performance compared to single-task models and typically required careful strategies to reduce quality losses and maximize accuracy. In addition, training one network on different modalities – or inputs – such as language, images or videos that vary greatly, presented additional complexities, and essential information in certain modalities was often incorrectly ignored by the model.

Multimodal Modeling

In a multi-year project undertaken with support from Apple in California, EPFL researchers from the Visual Intelligence and Learning Laboratory (VILAB) in the School of Computer and Communication Sciences (IC) have developed 4M, for Massively Masked Multimodal Modeling, one of the world’s most advanced single neural networks to handle a wide and varied range of tasks and modalities.

In their latest research paper on 4M, presented in December at NeurIPS 2024, the Annual Conference on Neural Information Processing Systems, the researchers describe how it expands the capabilities of existing models in multiple ways.

“With 4M, we now have a rich model that can interpret more than just language. But why does this matter? One common criticism of LLMs is that their knowledge is not grounded because the training data is limited to only language,” explained Assistant Professor Amir Zamir, Head of VILAB.

“When we advance to multimodal modeling, we don’t have to limit ourselves to language. We bring in other modalities, including sensors. For example, we can communicate an orange through the word ‘orange,’ just like in language models, but also through a collection of pixels, meaning how the orange looks, or through the sense of touch, capturing how touching an orange feels. If you assemble various modalities, you have a more complete encapsulation of the physical reality that we are trying to model,” he continued.

Towards an open-source, generic model for wide use

Despite these advances, Zamir says the development of 4M has presented some intriguing challenges, including the model not developing a truly unified representation across the modalities, and he has his own theory as to why.

“We think that secretly, under the hood, the models cheat and create a little ensemble of independent models. One set of parameters solves one problem, another set of parameters solves another, and collectively, they appear to solve the overall problem. But they’re not truly unifying their knowledge in a way that enables a compact joint representation of the environment that would be a good portal to the world.”

The VILAB team is continuing to work on building more structure and unification into 4M, with the goal of developing an open-source, generic architecture, enabling experts in other domains to adapt it to their specific needs, such as climate modeling or biomedical research. The team also works on addressing other important aspects, such as boosting the scalability even further and methods for the specialization of models to deployment contexts.

“The whole point of open sourcing is that people can tailor the model for themselves with their own data and their own specifications. 4M is coming at the right moment in time, and we are especially enthusiastic about other domains adopting this line of modeling for their specific use cases. We are excited to see where that leads. But there are still a lot of challenges, and there is still a lot to do,” said Oguzhan Fatih Kar and Roman Bachmann, Doctoral Assistants in VILAB and co-authors of the paper.

Based on the team’s experience developing 4M and the intriguing problems that they continue to work on, Zamir believes there are some interesting questions around the future development of foundation models.

“As humans, we have five key senses, and on top of that, we efficiently learn language, which adds labels and structure to the knowledge that was already grounded in these other senses. It’s the opposite with the current AI – we have language models without sensory access to the world but that are trained using colossal data and compute resources. Our goal is to study the role of multimodality and efficiently develop a grounded world model that can be effectively utilized for downstream uses.”

Find out more



tags: ,


EPFL




            AIhub is supported by:


Related posts :



monthly digest

AIhub monthly digest: March 2025 – human-allied AI, differential privacy, and social media microtargeting

  28 Mar 2025
Welcome to our monthly digest, where you can catch up with AI research, events and news from the month past.

AI ring tracks spelled words in American Sign Language

  27 Mar 2025
In its current form, SpellRing could be used to enter text into computers or smartphones via fingerspelling.

How AI images are ‘flattening’ Indigenous cultures – creating a new form of tech colonialism

  26 Mar 2025
AI-generated stock images that claim to depict “Indigenous Australians”, don’t resemble Aboriginal and Torres Strait Islander peoples.

Interview with Lea Demelius: Researching differential privacy

  25 Mar 2025
We hear from doctoral consortium participant Lea Demelius who is investigating the trade-offs and synergies that arise between various requirements for trustworthy AI.

The Machine Ethics podcast: Careful technology with Rachel Coldicutt

This episode, Ben chats to Rachel Coldicutt about AI taxonomy, innovating for everyone not just the few, responsibilities of researchers, and more.

Interview with AAAI Fellow Roberto Navigli: multilingual natural language processing

  21 Mar 2025
Roberto tells us about his career path, some big research projects he’s led, and why it’s important to follow your passion.

Museums have tons of data, and AI could make it more accessible − but standardizing and organizing it across fields won’t be easy

  20 Mar 2025
How can AI models help organize large amounts of data from different collections, and what are the challenges?

Shlomo Zilberstein wins the 2025 ACM/SIGAI Autonomous Agents Research Award

  19 Mar 2025
Congratulations to Shlomo Zilberstein on winning this prestigious award!




AIhub is supported by:






©2024 - Association for the Understanding of Artificial Intelligence


 












©2021 - ROBOTS Association