TabReX

Why is Evaluating Generated Tables Hard?

Evaluating tables generated by large language models is hard: text-only metrics (like ROUGE, BERTScore) miss the critical structural information of a table.

On the other hand, reference-based metrics (like Exact Match) are too rigid. They fail to generalize across different tasks or schemas, penalizing tables that are factually correct but structured differently from a single "gold" reference.

We need a metric that is referenceless, understands tabular structure, and is interpretable.

How TabReX Works: Evaluation as Graph Reasoning

TabReX approaches evaluation as graph reasoning. Instead of comparing to a reference table, we compare the generated table directly against the source text.

We convert the source text and generated table into canonical KGs, align them with an LLM‑guided matcher, and score them with rubric rules for structure and factual fidelity. The result is interpretable scores, a tunable sensitivity–specificity trade‑off, and cell‑ or table‑level error traces for diagnosis.

The TabReX Pipeline: 1. Canonical KG conversion, 2. LLM-guided alignment, 3. Rubric-aware scoring with explainable traces.

Finding Errors: What TabReX Sees

TabReX produces cell-level, table-level error traces that reveal where and why a generated table deviates from the source. These traces arise directly from graph alignment conflicts and rubric checks, helping users diagnose specific issues.

We convert both the Source Text (g1) and the Generated Table (g2) into Knowledge Graphs, then align them to find mismatches. The graph triplets are in [Row Header, Column Header, Cell Value] format.

Example 1: Schema Mismatch

Source Text: "The project involved Tejas, Junha, and Aparna, who are Researchers. Vivek is an Engineer."

Generated Table (g2)

Name	Position	Team
Tejas	Researcher	AI
Junha	Researcher	AI
Aparna	Researcher	AI
Vivek	Engineer	AI

Source Graph (g1) Triplets

[Tejas, Position, Researcher]
[Junha, Position, Researcher]
[Aparna, Position, Researcher]
[Vivek, Position, Engineer]

Table Graph (g2) Triplets

[Tejas, Position, Researcher], [Tejas, Team, AI]
[Junha, Position, Researcher], [Junha, Team, AI]
[Aparna, Position, Researcher], [Aparna, Team, AI]
[Vivek, Position, Engineer], [Vivek, Team, AI]

Alignment Result

❌ Mismatch: Schema Mismatch

Alignment for 'Junha' (example):

[Junha/Junha, Position/Position, Researcher/Researcher] (Match)
[Junha/Junha, -/Team, -/AI] (Mismatch)

Reason: Relation 'Team' in g2 is an Extra Column not found in g1.

Final Error Trace

Name	Position	Team ❌
Tejas	Researcher	AI
Junha	Researcher	AI
Aparna	Researcher	AI
Vivek	Engineer	AI

Example 2: Entity Swap

Source Text: "Sales for Q1 were $1000. Sales for Q2 were $1200."

Generated Table (g2)

Quarter	Sales
Q1	$1200
Q2	$1000

Source Graph (g1) Triplets

[Q1, Sales, $1000]
[Q2, Sales, $1200]

Table Graph (g2) Triplets

[Q1, Sales, $1200]
[Q2, Sales, $1000]

Alignment Result

❌ Mismatch: Factual Mismatch (Value Swap)

Alignment for 'Q1':

[Q1/Q1, Sales/Sales, $1000/$1200]

Reason: Values for aligned triplet do not match. (Difference: $200)

Final Error Trace

Quarter	Sales
Q1	$1200 ❌
Q2	$1000 ❌

Example 3: Aggregation Error

Source Text: "Team A has 3 members. Team B has 4 members. Total members: 7."

Generated Table (g2)

Team	Members
Team A	3
Team B	4
Total	12

Source Graph (g1) Triplets

[Team A, Members, 3]
[Team B, Members, 4]
[Total, Members, 7]

Table Graph (g2) Triplets

[Team A, Members, 3]
[Team B, Members, 4]
[Total, Members, 12]

Alignment Result (Rubric Check)

❌ Mismatch: Aggregation Error

Alignment for 'Total':

[Total/Total, Members/Members, 7/12]

Reason: Aggregation rubric failed. Expected '7' (from g1 or SUM(3,4)), but g2 has '12'.

Final Error Trace

Team	Members
Team A	3
Team B	4
Total	12 ❌

How Can We Reliably Test a Metric?

TabReX‑Bench stress‑tests metric robustness and generalization. It spans multiple domains and planner‑driven perturbations across three difficulty tiers, enabling controlled, harder‑case analyses.

Domains

6

Perturbation Types

12

Difficulty Tiers

3

We define two complementary perturbation groups: Data‑Preserving (Group 0) alters layout or presentation (e.g., row/header reordering, unit conversion, paraphrasing) without changing factual content; Data‑Altering (Group 1) introduces semantic modifications such as adding or deleting rows/columns, swapping numeric values, or injecting noise and misspellings. Each group is further stratified into three difficulty tiers (Easy, Medium, Hard), supporting controlled analyses of metric robustness as perturbation severity increases.

Benchmark Composition by Dataset

Dataset	# of Tables	# Perturb / Table	Total Tables	Avg Row	Avg Col	Avg Cell	Avg Tokens	Avg Num
FinQA	150	12	1950	5.55	2.47	13.22	119.5	33.55
HiTabQA	150	12	1950	20.08	5.60	115.1	434.8	102.7
ToTTo	150	12	1950	24.97	5.49	142.2	361.3	69.63
OpenML med	10	12	120	4.20	11.58	47.94	210.9	23.80
MIMIC-IV	100	12	1200	10.58	3.94	40.84	153.5	26.29
RotoWire	150	12	1950	10.18	5.86	59.50	146.5	14.33
Total	710	-	9120	-	-	-	-	-

Benchmark overview — domains, perturbations, and difficulty tiers for robust evaluation.

How Well Does TabReX Perform?

Alignment with Human Judgment on TabReX‑Bench

TabReX outperforms all traditional and referenceless LLM‑based metrics in aligning with expert rankings on TabReX‑Bench.

Metric	ρ_S ↑	τ_K ↑	τ_w ↑	RBO ↑	ζ_F ↓	π_t ↓
Non-LLM Based (w/ Ref)
EM	45.88	39.38	39.51	43.33	47.49	58.40
chrF	41.76	34.55	31.61	39.39	49.26	01.64
ROUGE-L	31.18	26.69	22.56	37.65	55.94	01.97
BLUERT	44.66	37.64	36.09	39.57	48.09	00.77
BERTScore	36.21	30.66	27.96	38.11	53.25	00.92
H-Score	56.87	47.97	51.73	41.11	40.02	00.99
LLM-Based (w/ Ref)
P-Score	49.24	40.00	37.43	40.73	43.93	07.39
TabEval	49.01	39.22	34.21	41.11	43.06	00.63
TabXEval	80.27	72.37	66.87	47.54	20.94	45.33
(w/o Ref)
QuestEval	62.93	52.29	51.71	42.70	35.04	03.03
TabReX (Ours)	74.51	64.24	62.28	44.85	27.01	13.59

Higher values of Spearman’s rank correlation (ρ_S), Kendall’s tau (τ_K), weighted Kendall’s tau (τ_w), and Rank‑Biased Overlap (RBO) indicate stronger monotonic and positional agreement with human orderings (↑), while lower values of Spearman’s footrule distance (ζ_F) and tie ratio (π_t) denote better rank stability and finer discriminative resolution (↓).

Do Ensembles Close the Gap?

Ensembles help but don’t close the gap. The best variant (LLM, harmonic) reaches ρ_S=0.56 and τ_K=0.47, still behind TabReX (ρ_S=0.75, τ_K=0.64) and with higher rank dispersion—simple averaging or harmonic averaging can’t match TabReX’s targeted, referenceless graph reasoning.

Metric	ρ_S ↑	τ_K ↑	τ_w ↑	RBO ↑	ζ_F ↓	π_t ↓
Ensemble Baselines
Lex‑Emb (M)	38.43	32.65	30.17	38.52	52.15	00.49
Lex‑Emb (H)	29.80	24.00	19.68	37.65	55.04	00.63
LLM (M)	48.49	39.21	36.94	40.56	44.38	00.42
LLM (H)	56.00	46.93	50.64	40.95	40.63	00.42
Hybrid (M)	32.04	24.94	20.29	37.03	51.51	01.13
Hybrid (H)	54.03	42.71	32.61	42.31	40.11	01.13
TabReX (Ours)	74.51	64.24	62.28	44.85	27.01	13.59

Ensembles combine metric families using either simple Mean (M) or Harmonic (H) aggregation:

Lex‑Emb (lexical + embedding): EM, ROUGE‑L, BERTScore, BLEURT, chrF
LLM (LLM‑based): P‑Score, H‑Score
Hybrid (reference‑based + referenceless): TabXEval, QuestEval

All ensemble variants fall short of TabReX, which achieves the highest correlation with expert rankings and better rank stability.

Correlation on Real-World Text-to-Table Generation

Beyond TabReX‑Bench, we measure correlation with expert rankings on a real‑world text‑to‑table dataset. TabReX again achieves the highest alignment across correlation metrics, outperforming reference‑based and referenceless baselines.

Metric	ρ_S ↑	τ_K ↑	RBO ↑
Standard Metrics (w/ Ref)
EM	-0.01	0.01	0.33
ROUGE-L	0.33	0.25	0.29
BERTScore	0.26	0.19	0.38
BLEURT	0.29	0.20	0.39
CHRF	0.25	0.19	0.36
LLM-Based (w/ Ref)
TabEval	0.25	0.19	0.36
TabXEval	0.24	0.17	0.37
(w/o Ref)
QuestEval	0.28	0.20	0.39
TabReX (Ours)	0.39	0.30	0.41

Sensitivity–Specificity Under Stress

A robust evaluation metric must remain reliable not only in standard (easy) settings but also under hard perturbations—tables with subtle misalignments, semantic shifts, or fine-grained numeric errors. We use TabReX-Bench to sample both easy and hard cases and compute true-positive (sensitivity) and true-negative (specificity) rates.

Metric Movements Across Difficulty Levels: Arrows show each metric’s shift from easy (blue) to hard (red) perturbations. The green region denotes the balanced ideal zone, and the dashed diagonal marks the optimal trade-off. TabReX stays near this zone, maintaining the right direction even for hard examples.

Model and Prompt Analysis

TabReX's rubric-aware scoring enables coarse to fine-grained comparisons across models (e.g., Gemma 27B vs. 4B) and prompting strategies (e.g., Zero-Shot, Chain-of-Thought, Map&Make), measured at both cell-level and table-level granularity.

Rubric-wise alignment across models and prompting strategies: The top row shows cell-level agreement, while the bottom row shows table-level agreement.

Key Insights:

Model Size: Larger models (Gemma 27B) show clear gains in local, fine-grained (cell-level) fidelity but not necessarily global (table-level) coherence.
Reasoning Style: Reasoning-oriented ("Thinking") variants improve precision on numeric/structural dimensions but can reduce semantic coverage, favoring accuracy over breadth.
Prompt Design: The prompt strategy (especially Map&Make) contributes as much as model scale to achieving a balanced alignment across all rubric dimensions.

These results illustrate how a referenceless, explainable evaluation metric like TabReX can reveal the strengths and weaknesses of models and prompting strategies across hierarchical levels.

TabReX:

Why is Evaluating Generated Tables Hard?

How TabReX Works: Evaluation as Graph Reasoning

Finding Errors: What TabReX Sees

Example 1: Schema Mismatch

Example 2: Entity Swap

Example 3: Aggregation Error

How Can We Reliably Test a Metric?

Benchmark Composition by Dataset

How Well Does TabReX Perform?

Alignment with Human Judgment on TabReX‑Bench

Do Ensembles Close the Gap?

Correlation on Real-World Text-to-Table Generation

Sensitivity–Specificity Under Stress

Model and Prompt Analysis

BibTeX