TabReX‑Bench stress‑tests metric robustness and generalization. It spans multiple domains and planner‑driven perturbations across three difficulty tiers, enabling controlled, harder‑case analyses.
We define two complementary perturbation groups: Data‑Preserving (Group 0) alters layout or presentation (e.g., row/header reordering, unit conversion, paraphrasing) without changing factual content; Data‑Altering (Group 1) introduces semantic modifications such as adding or deleting rows/columns, swapping numeric values, or injecting noise and misspellings. Each group is further stratified into three difficulty tiers (Easy, Medium, Hard), supporting controlled analyses of metric robustness as perturbation severity increases.
Benchmark overview — domains, perturbations, and difficulty tiers for robust evaluation.