MESA: An Evaluation Framework for Compositional, Semantic, and Spatial Generalization in Robotics

Abstract

Despite growing interest in foundation models for robotic manipulation, systematic evaluation of their capabilities remains elusive. Existing benchmarks are either narrow and saturated, offering little signal on SOTA methods, or broadly scoped, providing diversity at the cost of controlled experimentation.

We introduce MESA, a dynamic evaluation framework designed for precise, reproducible measurement of language-conditioned policy generalization. MESA provides five evaluation suites that isolate in-distribution performance and four distinct axes of generalization: spatial configuration, object instance, object category, and subtask composition, enabling researchers to diagnose where and why policies fail, not just whether they fail.

We further introduce a language-following metric that disentangles task understanding from low-level execution, revealing that state-of-the-art policies often understand commands but fail to execute them. To support rapid iteration, we develop MESA-Gen, a pipeline for scalable task and demonstration generation built by improving MimicGen.

We evaluate seven baseline methods spanning diffusion policies, VLM-based approaches, and pretrained VLAs, finding that even the strongest model achieves only 52% average success, providing meaningful signal while leaving ample room for progress. The results are correlated with rankings in RoboArena evaluations (ρ = 0.9), validating MESA as a reproducible testbed predictive of relative real-world performance.

MESA-Bench: Overview

Our benchmark consists of a canonical training set MESA-70 and the five following evaluation suites:

A suite evaluating in-distribution performance
MESA-Spatial: evaluates seen tasks with unseen spatial configurations
MESA-Category: evaluates tasks with unseen object categories
MESA-Instance: evaluates seen tasks with unseen object instances
MESA-Compositional: evaluates unseen composite tasks comprised of in-distribution subtasks

Reset Distribution

Each task contains a number of distractor objects and corresponding distractor tasks, aimed to probe the language following of the evaluated policy. For example, for a target task “put the orange in the basket” with a bowl and apple for distractor objects, the distractor task predicates would be In(apple, basket), In(apple, bowl), In(orange, bowl) and In(bowl, basket). After a failed rollout, if any of these predicates were satisfied, we log a distractor task completion.

Below, we visualize reset distributions with distractor objects for several tasks.

MESA-Gen: Data Generation Pipeline

We introduce MESA-Gen, a pipeline for scalable task and demonstration generation built by improving MimicGen. To build your own experiments using our framework:

If a suitable task skeleton does not already exist, define it in Python.
Instantiate a task from the skeleton and generate BDDL files for the training and evaluation sets.
Optionally, collect a small number of source demonstrations.
Generate a large dataset for imitation learning using our modified MimicGen pipeline. As we describe in the paper, we introduce "subtask stitching", which enables data generation for tasks without human data.

Key Findings

Robot pretraining is a primary differentiator.

Pretraining on large-scale robot datasets provides policies with robust priors, significantly boosting performance across MESA evaluation suites. Policies without pretraining often fail to generalize beyond in-distribution tasks. Above, we compare PG-FM and $\pi_0$, which share the same architecture and differ only in $\pi_0$'s pretraining.

Action representation matters.

PG-Bin, which uses RT-2- style action binning, fails across a variety of tasks (0.9%) despite sharing its VLM backbone with PG-FM (22.8%). This indicates that directly discretizing continuous action dimensions into language tokens is a poor fit for MESA's fine-grained control tasks.

OOD object instances are challenging, even for frontier policies.

We see that across the board, policies struggle to identify unseen object instances. For example, all three models above have seen the red bell pepper during training. However, they have difficulty identifying the out-of-distribution green bell pepper, with some even mistaking the mango for the bell pepper.

OOD spatial configurations lead to a consistent performance drop.

Despite seeing the same object and same task previously, policies struggle to identify the target object when it is placed in different spatial configurations. Here, we see that the policies can correctly identify the tomato over a similar looking object when the tomato is in an in-distribution position (left). When the tomato's position is on the right, all policies except $\pi_{0.5}$ still select the distractor item on the left.

Vision-language grounding breaks down with unseen objects.

We show that for all trained baselines, performance drops significantly when evaluated against MESA-Category, suggesting that open-vocabulary object generalization is not adequately addressed by current VLM pretraining. In the first table, we show the success rates over each of our evaluation suites. In the second table, we evaluate the language following rate, which we define precisely in the paper, of these baselines.

Success rates on MESA-Bench.

Language following rates on MESA-Bench.

BibTeX

@article{mesa2026,
  author    = {Albert Wilcox and Frank Chang and Aishani Chakraborty and Nhi Nguyen and Jeremy A. COllins and Vaibhav Saxena and Benjamin Joffe and Siddhath Karamcheti and Animesh Garg},
  title     = {MESA: An Evaluation Framework for Compositional, Semantic, and Spatial Generalization in Robotics},
  journal   = {arXiv preprint},
  year      = {2026},
}

Acknowledgements

The authors would like to acknowledge the State of Georgia and the Agricultural Technology Research Program at Georgia Tech for supporting the work described in this paper.