The Perceptual Observatory

Characterizing robustness and grounding in multimodal large language models (MLLMs) through perception-first benchmarks and structured perturbations.

Robustness Visual Grounding Perceptual Illusions Benchmark Suite
Arizona State University
*Equal contribution.
WACV 2026

Why it matters

MLLMs often scale language while reusing vision encoders. The Perceptual Observatory probes whether progress reflects true visual grounding or reliance on textual priors.

What it delivers

A benchmark framework that tests simple vision, local-to-global grounding, and robustness under pixel and diffusion-based perturbations.

Abstract

Recent advances in multimodal large language models (MLLMs) have yielded increasingly powerful models, yet their perceptual capacities remain poorly characterized. In practice, most model families scale language components while reusing nearly identical vision encoders (e.g., Qwen2.5-VL 3B/7B/72B), which raises pivotal concerns about whether progress reflects genuine visual grounding or reliance on internet-scale textual world knowledge. Existing evaluation methods emphasize end-task accuracy, overlooking robustness, attribution fidelity, and reasoning under controlled perturbations. We present The PERCEPTUAL OBSERVATORY, a framework that characterizes MLLMs across verticals like: (i) simple vision tasks, such as face matching and text-in-vision comprehension capabilities; (ii) local-to-global understanding, encompassing image matching, grid pointing game, and attribute localization, which tests general visual grounding. Each vertical is instantiated with ground-truth datasets of faces and words, systematically perturbed through pixel-based augmentations and diffusion-based stylized illusions. The PERCEPTUAL OBSERVATORY moves beyond leaderboard accuracy to yield insights into how MLLMs preserve perceptual grounding and relational structure under perturbations, providing a principled foundation for analyzing strengths and weaknesses of current and future models.

Introduction

MLLMs now anchor tasks from captioning and VQA to OCR-centric reasoning, yet strong leaderboard scores do not guarantee robust perception. Modern model families often scale language while reusing largely fixed vision encoders, raising questions about whether gains come from visual grounding or textual priors. The Perceptual Observatory targets this gap by stressing identity preservation, spatial invariance, and attribution fidelity under both ID corruptions and diffusion-based stylized shifts.

Core Contributions

Property-Driven Evaluation

A principled framework that measures perceptual robustness and grounding beyond end-task accuracy, isolating visual behavior from language priors.

Unified Tasks + Metrics

Three tasks capture identity matching, spatial invariance, and attribute localization, paired with interpretable metrics for robustness, fairness, and reasoning effects.

Perturbation Pipeline

A scalable pipeline for ID augmentations and diffusion-based stylized illusions that preserve layout while changing appearance.

Perceptual Observatory

The Observatory is a property-driven evaluation suite spanning robustness, relational vision, in-context adaptation, and vision-language alignment. It probes whether models maintain identity, resist distractors, preserve spatial structure, and avoid over-reliance on language priors.

Framework overview of The Perceptual Observatory tasks and properties
Framework overview of tasks, properties, and insights used to probe perceptual robustness and grounding.

Benchmark Datasets

Two canonical domains anchor the evaluation: CELEB (1,000 celebrity faces with boxes for eyes, nose, and mouth) and WORD (approximately 267K rendered words across 21 categories, yielding over 1M unique images with exact text boxes).

Each image is paired with 15 in-distribution augmentations (blur, jitter, noise, etc.) and 15 diffusion-based stylized illusions, plus the original, totaling 31K images per dataset.

CELEB

1,000 faces

WORD

~267K words

Variants

15 ID + 15 OOD

Total

62K images

Benchmark Tasks

All tasks are framed as in-context prediction. Each query pairs a support example with a prompted question, then tests identity preservation, spatial invariance, and attribute localization under ID augmentations and OOD stylized perturbations.

Task 1: Image Matching

The model sees a support image and chooses the correct match from four candidates. The query set includes a perturbed version of the same entity, near-neighbor distractors, and an out-of-context sample from the other domain (face vs. word).

Image Matching task figure
Figure 2: Image Matching task.

Task 2: Grid Pointing Game

The support image is placed within a 2×2 grid that also contains distractors and out-of-context samples. The model must identify the correct grid location, probing spatial invariance across positions.

Grid Pointing Game task figure
Figure 3: Grid Pointing Game.

Task 3: Attribute Localization

Given one-hint or full-hint attribute boxes on a support image, the model transfers those boxes to perturbed views, testing attribution fidelity and perceptual consistency.

Attribute Localization task figure
Figure 4: Attribute Localization.

Properties Measured

Each property corresponds to a behavioral goal and a simple quantitative metric. Robustness tracks identity preservation under ID and OOD shifts. Spatial invariance measures whether the correct grid location changes with position. Attribution fidelity evaluates how well bounding boxes transfer across perturbations. Fairness gaps compare subgroup performance (e.g., gender) under shifts. Scale consistency captures whether larger models improve monotonically within a family, while thinking superiority measures the impact of reasoning-enabled decoding.

Robustness

Accuracy drop from Org to ID/OOD perturbations for identity matching and grid pointing.

Spatial Invariance

Position gap across grid locations for Task 2 highlights layout sensitivity.

Attribution Fidelity

Transfer retention of attribute boxes under perturbations for Task 3.

Fairness Gap

Performance differences across subpopulations (e.g., gender) under shifts.

Scale Consistency

Trend of performance gains as parameter count increases within a model family.

Thinking Superiority

Delta between reasoning-enabled decoding and base decoding.

Experiments & Setup

The evaluation spans three model families: Qwen2.5-VL (3B/7B/72B), Gemma-3 (4B/12B/27B), and InternVL3.5 (8B/14B, with instruct and thinking variants). Experiments use consistent decoding settings (temperature 0.2, top_p 0.95, top_k 32) to compare scale, reasoning mode, and vision-language alignment.

Perturbation Protocol

Each sample is evaluated on original, ID-augmented, and diffusion-based illusion variants. Metrics capture robustness deltas, spatial invariance gaps, and transfer retention.

ID Augmentations OOD Illusions Transfer Retention

Results & Discussion

Four consistent themes emerge from the evaluation: WORD tasks are near-saturated in-distribution and retain accuracy under shifts; CELEB tasks degrade sharply under stylized perturbations. Scaling mainly benefits OCR, pointing, and guided localization but does not guarantee robustness. Language-side capacity drives most gains when vision encoders are fixed, and thinking-mode decoding improves clean performance while hurting transfer retention on faces.

Implications

The Observatory surfaces where MLLMs rely on textual priors and where perceptual grounding breaks under distribution shifts, providing a diagnostic lens for future model design.

Robustness Gaps Grounding Failures Scale Tradeoffs
Figure 5: Robustness vs gap for Task 3(a)
Figure 5: Multidimensional insights for Task 3(a) across attributes, ID vs OOD robustness, and gender gap.
Table 1: Robustness of MLLMs for ID vs OOD across tasks
Table 1: Robustness of MLLMs for ID vs OOD across tasks and datasets.
Figure 6: Multidimensional insights for Task 2 across datasets
Figure 6: Multidimensional insights for Task 2 across datasets (CELEB Tood, WORD Tid/Tood).
Figure 7: Celeb chain length vs. outcome
Figure 7: Celeb chain length vs. outcome. Histogram (log-y) of token length for cases where reasoning fixes vs. fails. Top: Org fixes vs. fails. Bottom: T_ood fixes vs. fails.

BibTeX

@misc{anvekar2025perceptualobservatorycharacterizingrobustness,
      title={The Perceptual Observatory Characterizing Robustness and Grounding in MLLMs}, 
      author={Tejas Anvekar and Fenil Bardoliya and Pavan K. Turaga and Chitta Baral and Vivek Gupta},
      year={2025},
      eprint={2512.15949},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2512.15949}, 
}