AIhub coffee corner: World models

by AIhub

22 May 2026

share this:

The AIhub coffee corner captures the musings of AI experts over a short conversation. This month we delve into world models. What are they, and what potential do they have? Joining the conversation this time are: Sanmay Das (Virginia Tech), Rina Dechter (University of California, Irvine), Tom Dietterich (Oregon State University), Sabine Hauert (University of Bristol), Michael Littman (Brown University), and Marija Slavkovik (University of Bergen).

Sabine Hauert: World models have been prominent lately and there are huge amounts of money being thrown at this. Recently, NVIDIA came to do a one-day training at the Bristol Robotics Laboratory and presented us with all their latest tools, including world models as a way to generate simulated environments that could be useful to train robot policies on. And I was quite dazzled by the potential of generating lots of examples, because I think that is a bottleneck in robotics. We can’t feed off a lot of the data that’s already present on the web for some of the stuff that we do. I thought it looked really useful, so I’m trying it out. I can’t quite figure out if it’s just extending what we do with language to video, and then pretending it has physical properties, or whether it does have this holy grail where we can generate all these environments that will be directly useful for robotics. At this training, someone raised their hand and asked what a world model was, and the reply was that it is just a video generator. Therefore, I think a good starting point for our discussion today would be to define what a world model is, because I think people are spinning it in different ways.

Michael Littman: Sure, I think it’s worth pointing out that the phrase world model was pretty well used in the reinforcement learning literature for many years. It is sometimes also called a transition model. It says, “if this is the state of the world and if you were to execute this action on it, then this is the probability distribution over the next state of the world”. So we’re trying to simulate a world, and model how the world moves from point to point to point. If you have a good world model, you can potentially use it in place of the world for doing decision-making.

As an example, I have a chocolate here, and if I take an action with that chocolate, what’s going to happen next? Maybe I’m trying to accomplish something, like blocking out the camera on my computer by throwing it. I can reason about that in advance, and not have to do it in the world, to decide that was actually a terrible idea – even if I had hit the camera, it wouldn’t have stuck. So I can use my model for planning and decision-making. To me, that’s what “world model” meant. But then again, I thought I knew what the word “agent” meant, and now it means something new and different and exciting…

So in what sense is a video generator a world model? If you are trying to figure out how to train a world model on the actual world around us, that’s a much harder problem than learning from some kind of simulator where you have direct access to the state and you can actually observe what the next state looks like. If you’re actually predicting things in the real world, you haven’t measured everything, so maybe you could learn something like a world model from video. You could learn that the current state is whatever the video is showing, imagine different possible interventions, and then say what the next state is, not by giving its internal parameters, but by visualizing it. And so if you get excited about that idea, and you train on lots of videos, and you forget that this was maybe for actions and decision-making, then you get something that looks a lot like a video generator. It’s something where you can give it a frame of video and it can spool out what the future might look like based on its internal representations of the dynamics. So that’s a thing.

Now when you get to Yann LeCun’s new company called AMI, they say that their central thing is going to be world models, but they’re not video generators at all. They’re referring back to the original concept a little bit, while not actually referencing it. In the sense that they’re taking snapshots of the state of, for example, a chemical plant and trying to say, “okay, if we increase the amount of carbon dioxide that we insert in this part of the pipeline, then this is how it’s going to change the way that the chemical plant plays out”. And again, if you can do that, you can use it for prediction and decision-making and ultimately do things more effectively.

Tom Dietterich: And in that use, it’s quite similar to digital twins, which are also supposed to be dynamical models that can predict the effect of action as well as just describing the time series of the system. Different community, also somewhat aspirational in many cases. But Yann has been working on his JEPA neural network architecture now for three or four years, which claims to learn the dynamics of the system in a lower-dimensional latent space, which is absolutely essential with vision. If you look at what the self-driving vehicle people are doing, like Waymo and Waabi and so on, they all build models like this because they can’t go out and enact fatal collisions on the highways just to collect data. But what’s unclear is how they validate those models. They can train them on all the normal states, but as far as I can tell, a lot of them are doing end-to-end stuff where they say, “given an input LiDAR point cloud and the following accelerations on the vehicle, this will be the resulting LiDAR point cloud”. And it’s not clear to me how well that generalizes without having a physics engine or something behind it.

Sanmay Das: So I guess one thing I’ve always wondered about world models (and I’m speaking as a card-carrying believer in models) is that, if you think about what happened with natural language, people thought that we would need to understand the properties of language to build language models, and that was not the way it worked out at all. And I’ve always got the sense that some of the ideas around world models are similar. This is related to the end-to-end stuff, where if you can build some kind of black box that is relatively predictive of the next step, whether it’s doing time series or non-time series, and in some way just add in the action as part of the input, then you can figure out what’s going to happen at the next time step. That’s a world model, even though it doesn’t in any way correspond to the George Box notion of a model. And so I wonder, is that what a lot of people are thinking in terms of world models, that they’re essentially another time series prediction task in the same way that language is a sequence prediction task? And that people are just going to go with that rather than actually thinking of it as a model that in some way might help our understanding or predictive capability or that has physics or dynamics built into it. That’s definitely the vibe I’ve been getting from a lot of people who talk about world models.

Rina Dechter: So, another take on models (that are not transitional systems as we have in planning and reinforcement learning) is causal models, graphical models, probabilistic graphical models, anything where you represent knowledge. When we talk about knowledge representation, often we mean that we represent knowledge in some kind of a model language. And in particular in causality, the idea is that the structural causal model captures the world – what causes what, and what is the effect of doing stuff? The whole research area is to really see if we can answer cause and effect questions or counterfactual questions if we only have partial information about the model of the world. The argument is that you have to have some kind of model, otherwise you will not be able to say what the consequences of actions are. And also counterfactual reasoning – what would happen if I did something different? A general explanation without a world model would be impossible. So I think these types of models are different from the transitional model of the next state, i.e. if you are in a particular state, what will be the effect of an action on the next state? But these models can be combined somehow, so they are not completely different.

Marija Slavkovik: I am also in the group of “there are world models”, but I’m sitting here wondering whether there is a difference between a world model and the model of the world. And I’m wondering how that connects to this thing that Niantic pulled where they got us all running around catching pokémon in order to build a model of the world to then be used for delivery agents, which I can’t even be angry about. I mean, it was not part of their original business model. But now they are sitting on an extremely valuable dataset of a model of the world as it is for deliveries.

Sabine: The thing that impressed me at that NVIDIA day was their example, which was a video of a surgery. So you have an end effector in a body with soft tissue and blood, behaving in all these very complicated ways, and through that video and the world model training they could generate lots of instances of these very weird mechanics of an end effector interacting in the body. I think that is very close to what you were saying, Sanmay, of just predicting the next step in this video series based on what’s happening with this tissue. It was also very imperfect, so there were floating end effectors, which you would not want in surgery as your training proposition. But this is the kind of simulation that it would be very difficult to do via hand design. And I thought, that’s actually a sweet spot. It’s just this very narrow thing, which then makes me question, why do we call these things world models? Why do we go for such big terminology all the time? Why is this a world model instead of a really efficient way to do prediction on next-step video?

Sanmay: Exactly right. I mean, you’ve articulated what I was trying to get at in some ways. I guess my question is really: is there a path? So, I was surprised by the path to generating language essentially using recursive models that do one thing at a time. So maybe world models will get to being world models without giving us any insight into how the world works by becoming really good at one-step video prediction and then threading these things together with enough context to give the sequence of actions you need to take based on just looking at the video for performing this complex surgery or something. I mean, I don’t think it’ll happen, but I didn’t think language would happen the way it did either.

Tom: Language doesn’t have a lot of the partial observability issues. Watching a robot, you get absolutely no tactile information whatsoever. So even Yann is going to have to infer the physics of the wheels interacting with the surface, the friction and everything. And this is a very hard latent variable problem, basically rediscovering Newtonian physics.

Sabine: The gap with the physics is going to be challenging. And there’s these papers coming out now on companies that are putting data capture systems on wrists of workers so that they can then train robot arms. Just the sheer amount of data needed to do something that was similar to what was done in the language world is almost incompatible in some of these papers. I don’t know if this helps go in the right direction. I think it’s wide open, and the robotics community is wondering what will happen.

Rina: I’m always puzzled by world models, or large language models, because we don’t fully understand what they are doing, besides understanding that they predict the next word or whatever. It’s hard to know what those models are. And the essence of a model is something that you can interrogate and ask questions about. It looks to me that most of the research is looking at these large models as if they are natural phenomena and trying to explore them in the same way that people explore the real world with physics. So I wonder whether this is a fruitful approach to really try to understand this huge system as a black box and try to interrogate and explore what it is doing.

Sabine: I kind of like the delivery robot scenario (it’s something I work on), rather than going for surgery with these systems. I think there are more mundane applications that probably are more fruitful than going for hard things first. That’ll be interesting to see.

tags: coffee corner

AIhub is dedicated to free high-quality information about AI.

AUAI is supported by:

AIhub coffee corner: World models

Related posts :

AI listens in to help protect wildlife

How can we characterize consensus in a network of agents?

Anyone can fake a scientific image with AI, tricking even academic journals – and undermining trust in science

AAAI presidential panel – AI and scientific integrity

Congratulations to the #ICML2026 award winners

Interactive world simulator for robot policy training and evaluation

#ICML2026 social media round-up

François Pachet on music generation with AI

↑