TabReX:

Tabular Referenceless eXplainable Evaluation

Why is Evaluating Generated Tables Hard?

Evaluating tables generated by large language models is hard: text-only metrics (like ROUGE, BERTScore) miss the critical structural information of a table.

On the other hand, reference-based metrics (like Exact Match) are too rigid. They fail to generalize across different tasks or schemas, penalizing tables that are factually correct but structured differently from a single "gold" reference.

We need a metric that is referenceless, understands tabular structure, and is interpretable.

How TabReX Works: Evaluation as Graph Reasoning

TabReX approaches evaluation as graph reasoning. Instead of comparing to a reference table, we compare the generated table directly against the source text.

We convert the source text and generated table into canonical KGs, align them with an LLM‑guided matcher, and score them with rubric rules for structure and factual fidelity. The result is interpretable scores, a tunable sensitivity–specificity trade‑off, and cell‑ or table‑level error traces for diagnosis.

TabReX pipeline overview

The TabReX Pipeline: 1. Canonical KG conversion, 2. LLM-guided alignment, 3. Rubric-aware scoring with explainable traces.

Finding Errors: What TabReX Sees

TabReX produces cell-level, table-level error traces that reveal where and why a generated table deviates from the source. These traces arise directly from graph alignment conflicts and rubric checks, helping users diagnose specific issues.

We convert both the Source Text (g1) and the Generated Table (g2) into Knowledge Graphs, then align them to find mismatches. The graph triplets are in [Row Header, Column Header, Cell Value] format.

Example 1: Schema Mismatch

Source Text: "The project involved Tejas, Junha, and Aparna, who are Researchers. Vivek is an Engineer."

Generated Table (g2)
NamePositionTeam
TejasResearcherAI
JunhaResearcherAI
AparnaResearcherAI
VivekEngineerAI
Source Graph (g1) Triplets
[Tejas, Position, Researcher]
[Junha, Position, Researcher]
[Aparna, Position, Researcher]
[Vivek, Position, Engineer]
Table Graph (g2) Triplets
[Tejas, Position, Researcher], [Tejas, Team, AI]
[Junha, Position, Researcher], [Junha, Team, AI]
[Aparna, Position, Researcher], [Aparna, Team, AI]
[Vivek, Position, Engineer], [Vivek, Team, AI]
Alignment Result

❌ Mismatch: Schema Mismatch

Alignment for 'Junha' (example):

  • [Junha/Junha, Position/Position, Researcher/Researcher] (Match)
  • [Junha/Junha, -/Team, -/AI] (Mismatch)

Reason: Relation 'Team' in g2 is an Extra Column not found in g1.

Final Error Trace
NamePositionTeam ❌
TejasResearcherAI
JunhaResearcherAI
AparnaResearcherAI
VivekEngineerAI

Example 2: Entity Swap

Source Text: "Sales for Q1 were $1000. Sales for Q2 were $1200."

Generated Table (g2)
QuarterSales
Q1$1200
Q2$1000
Source Graph (g1) Triplets
[Q1, Sales, $1000]
[Q2, Sales, $1200]
Table Graph (g2) Triplets
[Q1, Sales, $1200]
[Q2, Sales, $1000]
Alignment Result

❌ Mismatch: Factual Mismatch (Value Swap)

Alignment for 'Q1':

  • [Q1/Q1, Sales/Sales, $1000/$1200]

Reason: Values for aligned triplet do not match. (Difference: $200)

Final Error Trace
QuarterSales
Q1$1200 ❌
Q2$1000 ❌

Example 3: Aggregation Error

Source Text: "Team A has 3 members. Team B has 4 members. Total members: 7."

Generated Table (g2)
TeamMembers
Team A3
Team B4
Total12
Source Graph (g1) Triplets
[Team A, Members, 3]
[Team B, Members, 4]
[Total, Members, 7]
Table Graph (g2) Triplets
[Team A, Members, 3]
[Team B, Members, 4]
[Total, Members, 12]
Alignment Result (Rubric Check)

❌ Mismatch: Aggregation Error

Alignment for 'Total':

  • [Total/Total, Members/Members, 7/12]

Reason: Aggregation rubric failed. Expected '7' (from g1 or SUM(3,4)), but g2 has '12'.

Final Error Trace
TeamMembers
Team A3
Team B4
Total12 ❌

How Can We Reliably Test a Metric?

TabReX‑Bench stress‑tests metric robustness and generalization. It spans multiple domains and planner‑driven perturbations across three difficulty tiers, enabling controlled, harder‑case analyses.

Domains

6

Perturbation Types

12

Difficulty Tiers

3

We define two complementary perturbation groups: Data‑Preserving (Group 0) alters layout or presentation (e.g., row/header reordering, unit conversion, paraphrasing) without changing factual content; Data‑Altering (Group 1) introduces semantic modifications such as adding or deleting rows/columns, swapping numeric values, or injecting noise and misspellings. Each group is further stratified into three difficulty tiers (Easy, Medium, Hard), supporting controlled analyses of metric robustness as perturbation severity increases.

Benchmark Composition by Dataset

Dataset # of Tables # Perturb / Table Total Tables Avg Row Avg Col Avg Cell Avg Tokens Avg Num
FinQA 150 12 1950 5.55 2.47 13.22 119.5 33.55
HiTabQA 150 12 1950 20.08 5.60 115.1 434.8 102.7
ToTTo 150 12 1950 24.97 5.49 142.2 361.3 69.63
OpenML med 10 12 120 4.20 11.58 47.94 210.9 23.80
MIMIC-IV 100 12 1200 10.58 3.94 40.84 153.5 26.29
RotoWire 150 12 1950 10.18 5.86 59.50 146.5 14.33
Total 710 - 9120 - - - - -
TabReX‑Bench composition and protocol

Benchmark overview — domains, perturbations, and difficulty tiers for robust evaluation.

How Well Does TabReX Perform?

Alignment with Human Judgment on TabReX‑Bench

TabReX outperforms all traditional and referenceless LLM‑based metrics in aligning with expert rankings on TabReX‑Bench.

Metric ρS τK τw RBO ↑ ζF πt
Non-LLM Based (w/ Ref)
EM45.8839.3839.5143.3347.4958.40
chrF41.7634.5531.6139.3949.2601.64
ROUGE-L31.1826.6922.5637.6555.9401.97
BLUERT44.6637.6436.0939.5748.0900.77
BERTScore36.2130.6627.9638.1153.2500.92
H-Score56.8747.9751.7341.1140.0200.99
LLM-Based (w/ Ref)
P-Score49.2440.0037.4340.7343.9307.39
TabEval49.0139.2234.2141.1143.0600.63
TabXEval80.2772.3766.8747.5420.9445.33
(w/o Ref)
QuestEval62.9352.2951.7142.7035.0403.03
TabReX (Ours)74.5164.2462.2844.8527.0113.59

Higher values of Spearman’s rank correlation (ρS), Kendall’s tau (τK), weighted Kendall’s tau (τw), and Rank‑Biased Overlap (RBO) indicate stronger monotonic and positional agreement with human orderings (↑), while lower values of Spearman’s footrule distance (ζF) and tie ratio (πt) denote better rank stability and finer discriminative resolution (↓).

Do Ensembles Close the Gap?

Ensembles help but don’t close the gap. The best variant (LLM, harmonic) reaches ρS=0.56 and τK=0.47, still behind TabReX (ρS=0.75, τK=0.64) and with higher rank dispersion—simple averaging or harmonic averaging can’t match TabReX’s targeted, referenceless graph reasoning.

Metric ρS τK τw RBO ↑ ζF πt
Ensemble Baselines
Lex‑Emb (M)38.4332.6530.1738.5252.1500.49
Lex‑Emb (H)29.8024.0019.6837.6555.0400.63
LLM (M)48.4939.2136.9440.5644.3800.42
LLM (H)56.0046.9350.6440.9540.6300.42
Hybrid (M)32.0424.9420.2937.0351.5101.13
Hybrid (H)54.0342.7132.6142.3140.1101.13
TabReX (Ours)74.5164.2462.2844.8527.0113.59

Ensembles combine metric families using either simple Mean (M) or Harmonic (H) aggregation:

  • Lex‑Emb (lexical + embedding): EM, ROUGE‑L, BERTScore, BLEURT, chrF
  • LLM (LLM‑based): P‑Score, H‑Score
  • Hybrid (reference‑based + referenceless): TabXEval, QuestEval

All ensemble variants fall short of TabReX, which achieves the highest correlation with expert rankings and better rank stability.

Correlation on Real-World Text-to-Table Generation

Beyond TabReX‑Bench, we measure correlation with expert rankings on a real‑world text‑to‑table dataset. TabReX again achieves the highest alignment across correlation metrics, outperforming reference‑based and referenceless baselines.

Metric ρS τK RBO ↑
Standard Metrics (w/ Ref)
EM-0.010.010.33
ROUGE-L0.330.250.29
BERTScore0.260.190.38
BLEURT0.290.200.39
CHRF0.250.190.36
LLM-Based (w/ Ref)
TabEval0.250.190.36
TabXEval0.240.170.37
(w/o Ref)
QuestEval0.280.200.39
TabReX (Ours) 0.39 0.30 0.41

Sensitivity–Specificity Under Stress

A robust evaluation metric must remain reliable not only in standard (easy) settings but also under hard perturbations—tables with subtle misalignments, semantic shifts, or fine-grained numeric errors. We use TabReX-Bench to sample both easy and hard cases and compute true-positive (sensitivity) and true-negative (specificity) rates.

Sensitivity vs specificity trade-off

Metric Movements Across Difficulty Levels: Arrows show each metric’s shift from easy (blue) to hard (red) perturbations. The green region denotes the balanced ideal zone, and the dashed diagonal marks the optimal trade-off. TabReX stays near this zone, maintaining the right direction even for hard examples.

Model and Prompt Analysis

TabReX's rubric-aware scoring enables coarse to fine-grained comparisons across models (e.g., Gemma 27B vs. 4B) and prompting strategies (e.g., Zero-Shot, Chain-of-Thought, Map&Make), measured at both cell-level and table-level granularity.

Model-vs-prompt alignment analysis

Rubric-wise alignment across models and prompting strategies: The top row shows cell-level agreement, while the bottom row shows table-level agreement.

Key Insights:

  1. Model Size: Larger models (Gemma 27B) show clear gains in local, fine-grained (cell-level) fidelity but not necessarily global (table-level) coherence.
  2. Reasoning Style: Reasoning-oriented ("Thinking") variants improve precision on numeric/structural dimensions but can reduce semantic coverage, favoring accuracy over breadth.
  3. Prompt Design: The prompt strategy (especially Map&Make) contributes as much as model scale to achieving a balanced alignment across all rubric dimensions.

These results illustrate how a referenceless, explainable evaluation metric like TabReX can reveal the strengths and weaknesses of models and prompting strategies across hierarchical levels.

Top

BibTeX