ΑΙhub.org
 

Exploring counterfactuals in continuous-action reinforcement learning


by
20 June 2025



share this:

Reinforcement learning (RL) agents are capable of making complex decisions in dynamic environments, yet their behavior often remains opaque. When an agent executes a sequence of actions—such as administering insulin to a diabetic patient or controlling a spacecraft’s landing—it is rarely clear how outcomes might have changed under alternative choices. This challenge becomes particularly pronounced in settings involving continuous action spaces, where decisions are not confined to discrete options but span a spectrum of real-valued magnitudes. The framework introduced in recent work aims to generate counterfactual explanations in such settings, offering a structured approach to explore “what if” scenarios.

Why counterfactuals for RL?

The value of counterfactual reasoning in RL becomes apparent in scenarios with high-stakes, temporally extended consequences. The example above illustrates the case of blood glucose control in type-1 diabetes. Here, an RL agent determines insulin dosages at regular intervals in response to physiological signals. In the trajectory labeled \tau, the patient’s blood glucose initially rises into a dangerous range before eventually declining, resulting in a moderate total reward. Below this trajectory, three counterfactual alternatives—\tau_1, \tau_2, and \tau_3—demonstrate the potential outcomes of slightly different insulin dosing decisions. Among these, \tau_1 and \tau_2 yield higher cumulative rewards than \tau, while \tau_3 performs worse. Notably, \tau_1 achieves the best outcome with minimal deviation from the original actions and satisfies a clinically motivated constraint: administering a fixed insulin dose when glucose falls below a predefined threshold.

These examples suggest that counterfactual explanations may assist in diagnosing and refining learned behaviors. Rather than treating an RL policy as a black box, this perspective facilitates the identification of marginal adjustments with meaningful effects. It also offers a mechanism for domain experts—such as clinicians or engineers—to assess whether agent decisions align with established safety and performance criteria.

Counterfactual policies with minimal deviation

The method formulates counterfactual explanation as an optimization problem, seeking alternative trajectories that improve performance while remaining close to an observed sequence of actions. Proximity is quantified using a tailored distance metric over continuous action sequences. To solve this, the Twin Delayed Deep Deterministic Policy Gradient (TD3) algorithm is adapted with a reward-shaping mechanism that penalizes large deviations. The resulting counterfactual policy is deterministic and designed to produce interpretable alternatives from a given initial state.

The formulation accommodates constrained action settings, where certain decisions—such as those taken in critical physiological states—must adhere to domain-specific policies. This is addressed by constructing an augmented Markov Decision Process (MDP) that isolates unconstrained portions of the state space while embedding fixed behaviors into the transition dynamics. Optimization is then applied selectively over the flexible parts of the trajectory.

Rather than constructing one-off explanations for individual examples, the approach learns a generalizable counterfactual policy. This enables consistent and scalable explanation generation across a distribution of observed behaviors.

Applications: Diabetes control and Lunar Lander

Empirical evaluation was carried out in two representative domains, each involving continuous control in temporally extended environments. The first task involved glucose regulation using the FDA-approved UVA/PADOVA simulator, which models the physiology of patients with type-1 diabetes. In this context, the agent is tasked with adjusting insulin dosages in real time based on glucose trends, carbohydrate intake, and other state variables. The goal is to keep blood glucose within a safe target range while avoiding hypoglycemic or hyperglycemic events. Counterfactual trajectories in this domain illustrate how small, policy-consistent changes to insulin administration can yield improved outcomes.

The second domain uses the Lunar Lander environment, a standard RL benchmark where a simulated spacecraft must land upright on a designated pad. The agent must regulate thrust from main and side engines to maintain balance and minimize velocity on landing. The environment is governed by gravity and momentum, making small control variations potentially impactful. Counterfactual explanations in this case provide insight into how modest control refinements might improve landing stability or energy use.

In both settings, the approach identified alternative trajectories with improved performance relative to a standard baseline, particularly in terms of interpretability and adherence to constraints. Positive counterfactuals—those with higher cumulative reward—were found in over 50–80% of test cases. The learned policy also demonstrated generalization across both single- and multi-environment conditions.

Limitations and broader implications

While the framework shows promise in interpretability and empirical performance, it relies on a trajectory-level reward signal with sparse shaping. This design may limit the resolution of feedback during training, particularly in long-horizon or fine-grained control settings. Nonetheless, the approach contributes to a broader effort toward interpretable reinforcement learning. In domains where transparency is essential—such as healthcare, finance, or autonomous systems—it is important to understand not only what the agent chose, but what alternatives could have yielded better results. Counterfactual reasoning offers one pathway to illuminate these possibilities in a structured and policy-aware manner.

Learn more



tags: ,


Shuyang Dong is a PhD candidate in Computer Engineering at the University of Virginia.
Shuyang Dong is a PhD candidate in Computer Engineering at the University of Virginia.

            AUAI is supported by:



Subscribe to AIhub newsletter on substack



Related posts :

Interview with AAAI Fellow Sanmay Das: multiagent systems

  04 Jun 2026
We find out more about multi-agent research for the allocation of scarce societal resources.

Design tweaks promote responsible AI use for environmental protection, research shows

  03 Jun 2026
Systems that ask users to pause to consider AI’s energy consumption and environmental impacts are likely to reduce unnecessary AI use

An AI solution to an 80‑year‑old problem has shocked mathematicians

  02 Jun 2026
An OpenAI model has been used to find a counterexample to a famous conjecture made by legendary Hungarian mathematician Paul Erdős.

Forthcoming machine learning and AI seminars: June 2026 edition

  01 Jun 2026
A list of free-to-attend AI-related seminars that are scheduled to take place between 1 June and 31 July 2026.

Image Empire – a new short film from Alan Warburton

  29 May 2026
An animated fairytale about the fusion of the real and the virtual within contemporary AI models.
monthly digest

AIhub monthly digest: May 2026 – AI for science, the lottery ticket hypothesis, and world models

  28 May 2026
Welcome to our monthly digest, where you can catch up with AI research, events and news from the month past.

You probably wouldn’t notice if an AI chatbot slipped ads into its responses

  27 May 2026
Research suggests AI chatbots could easily be used for covert advertising to manipulate their human users.

The Good Robot podcast: the future of data centres and digital sovereignty with Friederike von Franqué

  26 May 2026
Can cloud infrastructure be owned and governed by the people, and not just Big Tech?



AUAI is supported by:







Subscribe to AIhub newsletter on substack




 















©2026.05 - Association for the Understanding of Artificial Intelligence