TabXEval

Tabular eXhaustive and eXplainable Evaluation

Vihang Pancholi, Jainit Bafna, Tejas Anvekar, Manish Shrivastava, Vivek Gupta

A rubric-driven framework for evaluating generated tables through structural alignment, semantic comparison, and interpretable discrepancy analysis.

Paper Code Data Slides Poster

Why It Matters

Evaluation needs more than string overlap

Traditional metrics often miss structural errors, schema shifts, and subtle content discrepancies in generated tables.

How It Works

Two-phase alignment and comparison

TabAlign handles structural alignment first, then TabCompare performs semantic and syntactic comparison under a detailed rubric.

What We Built

A benchmark for realistic table perturbations

TabXBench supports controlled evaluation with multi-domain tables and human-annotated judgments.

Highlights

Exhaustive rubric: combines structural descriptors with contextual quantification for comprehensive table comparison.
Explainable evaluation: separates structural alignment from semantic comparison to reveal where systems diverge.
Benchmark-backed analysis: TabXBench captures realistic perturbations across domains for robust metric study.
Human alignment: results show stronger qualitative and quantitative agreement with human judgments than conventional baselines.

2 Core Phases

Explainable Evaluation Traces

Multi-domain Benchmark Coverage

Human Judgment Alignment

System Overview

From table alignment to interpretable scoring

TabXEval turns table evaluation into a structured, rubric-aware process rather than a single opaque matching score.

TabAlign: align reference and candidate tables structurally before comparison.
TabCompare: inspect semantic and syntactic consistency at fine granularity.
Explainability: expose discrepancies clearly enough for qualitative analysis and diagnosis.

TabXEval aligns tables structurally first, then compares them semantically and syntactically for explainable evaluation.

TabXBench: A benchmark for realistic table evaluation

TabXBench provides a controlled way to study how automatic metrics behave under realistic table perturbations. It supports broad, multi-domain evaluation with human-annotated judgments and enables sensitivity-specificity analysis across a diverse set of failures.

Benchmark composition and perturbation design for robust evaluation across table tasks and domains.

How well does TabXEval align with human judgment?

Best Overall Spearman 0.44

Highest rank correlation with human judgments in the comparison set.

Strong Agreement Kendall 0.40

Better ordinal agreement than conventional automatic baselines.

Explainable Scoring RBO 0.34

Competitive overlap with human ranking while retaining interpretable error analysis.

Correlation with Human Judgments

Metric	Spearman's ρ ↑	Kendall's τ ↑	W-Kendall's τ† ↑	RBO ↑	Spearman's Footrule ↓
EM	0.18	0.16	0.16	0.26	0.57
chrF	0.12	0.11	0.08	0.25	0.59
H-Score	0.14	0.11	0.09	0.28	0.51
BERTScore	0.19	0.15	0.13	0.25	0.57
ROUGE-L	0.21	0.18	0.40	0.27	0.53
BLEURT	0.29	0.25	0.25	0.27	0.51
TabEval	-0.04	-0.04	-0.03	0.23	0.63
P-Score	0.30	0.27	0.24	0.31	0.39
LLM rubric	0.23	0.16	0.17	0.28	0.47
LLM ranking	0.29	0.24	0.23	0.30	0.41
Multi-prompt	0.29	0.24	0.23	0.30	0.42
Multi-prompt + CoT	0.30	0.25	0.24	0.29	0.45
TabXEval	0.44	0.40	0.38	0.34	0.29