TabReX

Why It Matters

Referenceless table evaluation

TabReX scores generated tables directly against source text, avoiding brittle one-reference matching.

How It Works

Graph reasoning with traces

It canonicalizes text and tables into KGs, aligns them, and exposes cell- and table-level error evidence.

What We Built

A benchmark for stress-testing metrics

TabReX-Bench spans 9,120 perturbed tables across 6 domains and 3 difficulty tiers.

Highlights

Human alignment: TabReX reaches ρ_S = 74.51 and τ_K = 64.24 on TabReX-Bench, outperforming other referenceless baselines.
Interpretability: The metric returns explicit cell- and table-level traces instead of a single opaque score.
Robustness: The benchmark probes schema edits, entity swaps, numeric errors, and harder perturbation tiers.
Actionable analysis: TabReX supports model and prompt comparisons with a tunable sensitivity-specificity trade-off.

9,120 Benchmark Tables

6 Domains

12 Perturbation Types

3 Difficulty Tiers

System Overview

From source text to scored tables

TabReX turns table evaluation into a structured reasoning problem instead of surface-form matching.

Canonicalization: Convert source text and generated tables into comparable KG structures.
Alignment: Use LLM-guided matching to connect entities, headers, and values across views.
Rubric-aware scoring: Produce interpretable structure and factuality scores with traceable failures.

Canonical KG conversion, LLM-guided alignment, and rubric-based scoring together yield explainable evaluation traces.

Why is Evaluating Generated Tables Hard?

Evaluating tables generated by large language models is hard: text-only metrics (like ROUGE, BERTScore) miss the critical structural information of a table.

On the other hand, reference-based metrics (like Exact Match) are too rigid. They fail to generalize across different tasks or schemas, penalizing tables that are factually correct but structured differently from a single "gold" reference.

We need a metric that is referenceless, understands tabular structure, and is interpretable.

How TabReX Works: Evaluation as Graph Reasoning

TabReX approaches evaluation as graph reasoning. Instead of comparing to a reference table, we compare the generated table directly against the source text.

We convert the source text and generated table into canonical KGs, align them with an LLM‑guided matcher, and score them with rubric rules for structure and factual fidelity. The result is interpretable scores, a tunable sensitivity–specificity trade‑off, and cell‑ or table‑level error traces for diagnosis.

Step 01

Canonicalize

Convert source text and generated tables into aligned KG views so structure and semantics can be compared consistently.

Step 02

Align

Use LLM-guided matching to resolve entity, row, column, and value correspondences across the two graph representations.

Step 03

Score and Trace

Apply rubric rules to produce structure and fidelity scores, then surface explicit cell- and table-level failures for diagnosis.

Finding Errors: What TabReX Sees

TabReX produces cell-level, table-level error traces that reveal where and why a generated table deviates from the source. These traces arise directly from graph alignment conflicts and rubric checks, helping users diagnose specific issues.

We convert both the Source Text (g1) and the Generated Table (g2) into Knowledge Graphs, then align them to find mismatches. The graph triplets are in [Row Header, Column Header, Cell Value] format.

Source & Table → Graphize → Align → Trace

How Can We Reliably Test a Metric?

TabReX‑Bench stress‑tests metric robustness and generalization. It spans multiple domains and planner‑driven perturbations across three difficulty tiers, enabling controlled, harder‑case analyses.

Domains

6

Perturbation Types

12

Difficulty Tiers

3

We define two complementary perturbation groups: Data‑Preserving (Group 0) alters layout or presentation (e.g., row/header reordering, unit conversion, paraphrasing) without changing factual content; Data‑Altering (Group 1) introduces semantic modifications such as adding or deleting rows/columns, swapping numeric values, or injecting noise and misspellings. Each group is further stratified into three difficulty tiers (Easy, Medium, Hard), supporting controlled analyses of metric robustness as perturbation severity increases.

Benchmark Composition by Dataset

Dataset	# of Tables	# Perturb / Table	Total Tables	Avg Row	Avg Col	Avg Cell	Avg Tokens	Avg Num
FinQA	150	12	1950	5.55	2.47	13.22	119.5	33.55
HiTabQA	150	12	1950	20.08	5.60	115.1	434.8	102.7
ToTTo	150	12	1950	24.97	5.49	142.2	361.3	69.63
OpenML med	10	12	120	4.20	11.58	47.94	210.9	23.80
MIMIC-IV	100	12	1200	10.58	3.94	40.84	153.5	26.29
RotoWire	150	12	1950	10.18	5.86	59.50	146.5	14.33
Total	710	-	9120	-	-	-	-	-

Benchmark overview — domains, perturbations, and difficulty tiers for robust evaluation.

How Well Does TabReX Perform?

TabReX-Bench ρ_S 74.51

Strongest referenceless alignment with expert ranking on the benchmark.

TabReX-Bench τ_K 64.24

Improved ordinal agreement while preserving fine-grained rank distinctions.

Real-world Text-to-Table RBO 0.41

Best overlap with expert orderings beyond the controlled benchmark.

Alignment with Human Judgment on TabReX‑Bench

TabReX outperforms all traditional and referenceless LLM‑based metrics in aligning with expert rankings on TabReX‑Bench.

Metric	ρ_S ↑	τ_K ↑	τ_w ↑	RBO ↑	ζ_F ↓	π_t ↓
Non-LLM Based (w/ Ref)
EM	45.88	39.38	39.51	43.33	47.49	58.40
chrF	41.76	34.55	31.61	39.39	49.26	01.64
ROUGE-L	31.18	26.69	22.56	37.65	55.94	01.97
BLUERT	44.66	37.64	36.09	39.57	48.09	00.77
BERTScore	36.21	30.66	27.96	38.11	53.25	00.92
H-Score	56.87	47.97	51.73	41.11	40.02	00.99
LLM-Based (w/ Ref)
P-Score	49.24	40.00	37.43	40.73	43.93	07.39
TabEval	49.01	39.22	34.21	41.11	43.06	00.63
TabXEval	80.27	72.37	66.87	47.54	20.94	45.33
(w/o Ref)
QuestEval	62.93	52.29	51.71	42.70	35.04	03.03
TabReX (Ours)	74.51	64.24	62.28	44.85	27.01	13.59

Higher values of Spearman’s rank correlation (ρ_S), Kendall’s tau (τ_K), weighted Kendall’s tau (τ_w), and Rank‑Biased Overlap (RBO) indicate stronger monotonic and positional agreement with human orderings (↑), while lower values of Spearman’s footrule distance (ζ_F) and tie ratio (π_t) denote better rank stability and finer discriminative resolution (↓).

Do Ensembles Close the Gap?

Ensembles help but don’t close the gap. The best variant (LLM, harmonic) reaches ρ_S=0.56 and τ_K=0.47, still behind TabReX (ρ_S=0.75, τ_K=0.64) and with higher rank dispersion—simple averaging or harmonic averaging can’t match TabReX’s targeted, referenceless graph reasoning.

Metric	ρ_S ↑	τ_K ↑	τ_w ↑	RBO ↑	ζ_F ↓	π_t ↓
Ensemble Baselines
Lex‑Emb (M)	38.43	32.65	30.17	38.52	52.15	00.49
Lex‑Emb (H)	29.80	24.00	19.68	37.65	55.04	00.63
LLM (M)	48.49	39.21	36.94	40.56	44.38	00.42
LLM (H)	56.00	46.93	50.64	40.95	40.63	00.42
Hybrid (M)	32.04	24.94	20.29	37.03	51.51	01.13
Hybrid (H)	54.03	42.71	32.61	42.31	40.11	01.13
TabReX (Ours)	74.51	64.24	62.28	44.85	27.01	13.59

Ensembles combine metric families using either simple Mean (M) or Harmonic (H) aggregation:

Lex‑Emb (lexical + embedding): EM, ROUGE‑L, BERTScore, BLEURT, chrF
LLM (LLM‑based): P‑Score, H‑Score
Hybrid (reference‑based + referenceless): TabXEval, QuestEval

All ensemble variants fall short of TabReX, which achieves the highest correlation with expert rankings and better rank stability.

Correlation on Real-World Text-to-Table Generation

Beyond TabReX‑Bench, we measure correlation with expert rankings on a real‑world text‑to‑table dataset. TabReX again achieves the highest alignment across correlation metrics, outperforming reference‑based and referenceless baselines.

Metric	ρ_S ↑	τ_K ↑	RBO ↑
Standard Metrics (w/ Ref)
EM	-0.01	0.01	0.33
ROUGE-L	0.33	0.25	0.29
BERTScore	0.26	0.19	0.38
BLEURT	0.29	0.20	0.39
CHRF	0.25	0.19	0.36
LLM-Based (w/ Ref)
TabEval	0.25	0.19	0.36
TabXEval	0.24	0.17	0.37
(w/o Ref)
QuestEval	0.28	0.20	0.39
TabReX (Ours)	0.39	0.30	0.41

Sensitivity–Specificity Under Stress

A robust evaluation metric must remain reliable not only in standard (easy) settings but also under hard perturbations—tables with subtle misalignments, semantic shifts, or fine-grained numeric errors. We use TabReX-Bench to sample both easy and hard cases and compute true-positive (sensitivity) and true-negative (specificity) rates.

Metric Movements Across Difficulty Levels: Arrows show each metric’s shift from easy (blue) to hard (red) perturbations. The green region denotes the balanced ideal zone, and the dashed diagonal marks the optimal trade-off. TabReX stays near this zone, maintaining the right direction even for hard examples.

Model and Prompt Analysis

TabReX's rubric-aware scoring enables coarse to fine-grained comparisons across models (e.g., Gemma 27B vs. 4B) and prompting strategies (e.g., Zero-Shot, Chain-of-Thought, Map&Make), measured at both cell-level and table-level granularity.

Rubric-wise alignment across models and prompting strategies: The top row shows cell-level agreement, while the bottom row shows table-level agreement.

Key Insights:

Model Size: Larger models (Gemma 27B) show clear gains in local, fine-grained (cell-level) fidelity but not necessarily global (table-level) coherence.
Reasoning Style: Reasoning-oriented ("Thinking") variants improve precision on numeric/structural dimensions but can reduce semantic coverage, favoring accuracy over breadth.
Prompt Design: The prompt strategy (especially Map&Make) contributes as much as model scale to achieving a balanced alignment across all rubric dimensions.

These results illustrate how a referenceless, explainable evaluation metric like TabReX can reveal the strengths and weaknesses of models and prompting strategies across hierarchical levels.

BibTeX

@misc{anvekar2025tabrextabularreferenceless,
  title={TabReX : Tabular Referenceless eXplainable Evaluation},
  author={Tejas Anvekar and Junha Park and Aparna Garimella and Vivek Gupta},
  year={2025},
  eprint={2512.15907},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2512.15907},
}

TabReX: