An Evaluation Framework for Compositional, Semantic, and Spatial Generalization in RoboticsDespite growing interest in foundation models for robotic manipulation, systematic evaluation of their capabilities remains elusive. Existing benchmarks are either narrow and saturated, offering little signal on SOTA methods, or broadly scoped, providing diversity at the cost of controlled experimentation.
We introduce MESA, a dynamic evaluation framework designed for precise, reproducible measurement of language-conditioned policy generalization. MESA provides five evaluation suites that isolate in-distribution performance and four distinct axes of generalization: spatial configuration, object instance, object category, and subtask composition, enabling researchers to diagnose where and why policies fail, not just whether they fail.
We further introduce a language-following metric that disentangles task understanding from low-level execution, revealing that state-of-the-art policies often understand commands but fail to execute them. To support rapid iteration, we develop MESA-Gen, a pipeline for scalable task and demonstration generation built by improving MimicGen.
We evaluate seven baseline methods spanning diffusion policies, VLM-based approaches, and pretrained VLAs, finding that even the strongest model achieves only 52% average success, providing meaningful signal while leaving ample room for progress. The results are correlated with rankings in RoboArena evaluations (ρ = 0.9), validating MESA as a reproducible testbed predictive of relative real-world performance.
Our benchmark consists of a canonical training set MESA-70 and the five following evaluation suites:
Each task contains a number of distractor objects and corresponding distractor tasks, aimed to probe the language following of the evaluated policy.
For example, for a target task “put the orange in the basket” with a bowl and apple for distractor objects, the distractor task predicates
would be In(apple, basket), In(apple, bowl), In(orange, bowl) and In(bowl, basket). After
a failed rollout, if any of these predicates were satisfied, we log a distractor task completion.
Below, we visualize reset distributions with distractor objects for several tasks.
We introduce MESA-Gen, a pipeline for scalable task and demonstration generation built by improving MimicGen. To build your own experiments using our framework:
Pretraining on large-scale robot datasets provides policies with robust priors, significantly boosting performance across MESA evaluation suites. Policies without pretraining often fail to generalize beyond in-distribution tasks. Above, we compare PG-FM and $\pi_0$, which share the same architecture and differ only in $\pi_0$'s pretraining.
PG-Bin, which uses RT-2- style action binning, fails across a variety of tasks (0.9%) despite sharing its VLM backbone with PG-FM (22.8%). This indicates that directly discretizing continuous action dimensions into language tokens is a poor fit for MESA's fine-grained control tasks.
We see that across the board, policies struggle to identify unseen object instances. For example, all three models above have seen the red bell pepper during training. However, they have difficulty identifying the out-of-distribution green bell pepper, with some even mistaking the mango for the bell pepper.
Despite seeing the same object and same task previously, policies struggle to identify the target object when it is placed in different spatial configurations. Here, we see that the policies can correctly identify the tomato over a similar looking object when the tomato is in an in-distribution position (left). When the tomato's position is on the right, all policies except $\pi_{0.5}$ still select the distractor item on the left.
We show that for all trained baselines, performance drops significantly when evaluated against MESA-Category, suggesting that open-vocabulary object generalization is not adequately addressed by current VLM pretraining. In the first table, we show the success rates over each of our evaluation suites. In the second table, we evaluate the language following rate, which we define precisely in the paper, of these baselines.
Success rates on MESA-Bench.
Language following rates on MESA-Bench.
@article{mesa2026,
author = {Albert Wilcox and Frank Chang and Aishani Chakraborty and Nhi Nguyen and Jeremy A. COllins and Vaibhav Saxena and Benjamin Joffe and Siddhath Karamcheti and Animesh Garg},
title = {MESA: An Evaluation Framework for Compositional, Semantic, and Spatial Generalization in Robotics},
journal = {arXiv preprint},
year = {2026},
}
The authors would like to acknowledge the State of Georgia and the Agricultural Technology Research Program at Georgia Tech for supporting the work described in this paper.