about

resources

events

contribute

republishing

☰

ΑΙhub.org

Memory traces in reinforcement learning

by Onno Eberhard

12 September 2025

share this:

The T-maze, shown below, is a prototypical example of a task studied in the field of reinforcement learning. An artificial agent enters the maze from the left and immediately receives one of two possible observations: red or green. Red means that the agent will be rewarded for moving to the top at the right end of the corridor (in the question mark tile), while green means the opposite: the agent will be rewarded for moving down. While this seems like a trivial task, modern machine learning algorithms (such as Q-learning) fail at learning the desired behavior. This is because these algorithms are designed to solve Markov Decision Processes (MDPs). In an MDP, optimal agents are reactive: the optimal action depends only on the current observation. However, in the T-maze, the blue question mark tile does not give enough information: the optimal action (going up or down) depends also on the first observation (red or green). Such an environment is called a Partially Observable Markov Decision Process (POMDP).

In a POMDP, it is necessary for an agent to keep a memory of past observations. The most common type of memory is a sliding window of a fixed length $m$ . If the complete history of observations up to time $t$ is $(y_t, y_{t-1}, \dots, y_1)$ , then the sliding window memory is $(y_t, y_{t-1}, \dots, y_{t-m+1})$ . In the T-maze, since we have to remember the first observation until we reach the blue tile, the length $m$ of the window has to be at least equal to the corridor length. The problem with this approach is that learning with long windows is expensive! We can show [1] that learning with windows of length $m$ generally requires a number of samples that scales exponentially in $m$ . Thus, learning in the T-maze with the naive sliding window memory is not tractable if the corridor is very long.

Our new work introduces an alternative memory framework: memory traces. The memory trace $z$ is an exponential moving average of the history of observations. Formally, $z(y_t, y_{t-1}, \dots, y_1) = \lambda z(y_{t-1}, y_{t-2}, \dots, y_1) + (1 - \lambda) y_t$ . The forgetting factor $\lambda \in [0, 1]$ controls how quickly the past is forgotten. This memory is illustrated in the T-maze above. There are 4 possible observations (colors), and thus memory traces take the form of 4-vectors. In this example, the initial observation is green. As the agent walks along the corridor, this initial observation slowly fades in the memory trace. Once the agent reaches the blue decision state, the information from the first observation is still accessible in the memory trace, making optimal behavior possible.

To understand whether memory traces provide any benefit over sliding windows, it is helpful to visualize the space of memory traces. Consider the case where there are three possible observations: $\texttt{a} = (1, 0, 0)$ , $\texttt{b} = (0, 1, 0)$ , and $\texttt{c} = (0, 0, 1)$ . Memory traces are linear combinations of these three vectors, but in this case it turns out that they all lie in a 2-dimensional subspace, so that we can easily visualize them. The picture below shows the set of all possible memory traces for different history lengths with the forgetting factor $\lambda = \frac{1}{2}$ . The set of memory traces forms a recursive Sierpiński triangle.

The picture changes if we vary the forgetting factor $\lambda$ , as shown below.

A surprising result is that, if $\lambda \leq \frac{1}{2}$ , then memory traces preserve all information of the complete history of observations! In this case, we could theoretically decode all previous observations from a single memory trace vector. The reason for this property is that we can identify what happened the past by zooming in on the space of memory traces.

As nothing is truly forgotten, memory traces are equivalent to sliding windows of unbounded length. Since learning with long windows is intractable, so is learning with memory traces. To make learning possible, we can restrict the “resolution” of the functions that we learn, so that they cannot zoom arbitrarily. Mathematically, this “resolution” is given by the Lipschitz constant of a function. Our main results show that, if we bound the Lipschitz constant, then sliding windows are equivalent to memory traces with $\lambda \leq \frac{1}{2}$ (“fast forgetting”), while memory traces with $\lambda > \frac{1}{2}$ (“slow forgetting”) can significantly outperform sliding windows in certain environments. In fact, the T-maze is such an environment. While the cost of learning with sliding windows scales exponentially with the corridor length, for memory traces this scaling is only polynomial!

Reference

[1] Partially Observable Reinforcement Learning with Memory Traces, Onno Eberhard, Michael Muehlebach and Claire Vernade. In Proceedings of the 42nd International Conference on Machine Learning, volume 267 of Proceedings of Machine Learning Research, 2025.

tags: deep dive, ICML, ICML2025

Onno Eberhard is a PhD student at the Max Planck Institute for Intelligent Systems and the University of Tübingen

AIhub is supported by:

We asked teachers about their experiences with AI in the classroom — here’s what they said

The Conversation 05 Dec 2025

Researchers interviewed teachers from across Canada and asked them about their experiences with GenAI in the classroom.

Interview with Alice Xiang: Fair human-centric image dataset for ethical AI benchmarking

Lucy Smith 04 Dec 2025

Find out more about this publicly-available, globally-diverse, consent-based human image dataset.

The Machine Ethics podcast: Fostering morality with Dr Oliver Bridge

The Machine Ethics Podcast 03 Dec 2025

Talking machine ethics, superintelligence, virtue ethics, AI alignment, fostering morality in humans and AI, and more.

Interview with Frida Hartman: Studying bias in AI-based recruitment tools

Lucy Smith 02 Dec 2025

In the next in our series of interviews with ECAI2025 Doctoral Consortium participants, we caught up with Frida, a PhD student at the University of Helsinki.

Forthcoming machine learning and AI seminars: December 2025 edition

Lucy Smith 01 Dec 2025

A list of free-to-attend AI-related seminars that are scheduled to take place between 1 December 2025 and 31 January 2026.

monthly digest

AIhub monthly digest: November 2025 – learning robust controllers, trust in multi-agent systems, and a new fairness evaluation dataset

Lucy Smith 28 Nov 2025

Welcome to our monthly digest, where you can catch up with AI research, events and news from the month past.

Generations in Dialogue: Embodied AI, robotics, perception, and action with Professor Roberto Martín-Martín

Association for the Understanding of Artificial Intelligence (AAAI) 27 Nov 2025

Listen and watch the latest podcast in the new series from AAAI.

EU proposal to delay parts of its AI Act signal a policy shift that prioritises big tech over fairness

The Conversation 27 Nov 2025

The EC has proposed delaying parts of the act until 2027 following intense pressure from tech companies and the Trump administration.