Evaluating tables generated by large language models is hard: text-only metrics (like ROUGE, BERTScore) miss the critical structural information of a table.
On the other hand, reference-based metrics (like Exact Match) are too rigid. They fail to generalize across different tasks or schemas, penalizing tables that are factually correct but structured differently from a single "gold" reference.
We need a metric that is referenceless, understands tabular structure, and is interpretable.
TabReX approaches evaluation as graph reasoning. Instead of comparing to a reference table, we compare the generated table directly against the source text.
We convert the source text and generated table into canonical KGs, align them with an LLM‑guided matcher, and score them with rubric rules for structure and factual fidelity. The result is interpretable scores, a tunable sensitivity–specificity trade‑off, and cell‑ or table‑level error traces for diagnosis.
The TabReX Pipeline: 1. Canonical KG conversion, 2. LLM-guided alignment, 3. Rubric-aware scoring with explainable traces.
TabReX produces cell-level, table-level error traces that reveal where and why a generated table deviates from the source. These traces arise directly from graph alignment conflicts and rubric checks, helping users diagnose specific issues.
We convert both the Source Text (g1) and the Generated Table (g2) into Knowledge Graphs, then align them to find mismatches. The graph triplets are in [Row Header, Column Header, Cell Value] format.
Source Text: "The project involved Tejas, Junha, and Aparna, who are Researchers. Vivek is an Engineer."
Generated Table (g2)| Name | Position | Team |
|---|---|---|
| Tejas | Researcher | AI |
| Junha | Researcher | AI |
| Aparna | Researcher | AI |
| Vivek | Engineer | AI |
[Tejas, Position, Researcher][Junha, Position, Researcher][Aparna, Position, Researcher][Vivek, Position, Engineer]
[Tejas, Position, Researcher], [Tejas, Team, AI][Junha, Position, Researcher], [Junha, Team, AI][Aparna, Position, Researcher], [Aparna, Team, AI][Vivek, Position, Engineer], [Vivek, Team, AI]
❌ Mismatch: Schema Mismatch
Alignment for 'Junha' (example):
[Junha/Junha, Position/Position, Researcher/Researcher] (Match)[Junha/Junha, -/Team, -/AI] (Mismatch)Reason: Relation 'Team' in g2 is an Extra Column not found in g1.
| Name | Position | Team ❌ |
|---|---|---|
| Tejas | Researcher | AI |
| Junha | Researcher | AI |
| Aparna | Researcher | AI |
| Vivek | Engineer | AI |
Source Text: "Sales for Q1 were $1000. Sales for Q2 were $1200."
Generated Table (g2)| Quarter | Sales |
|---|---|
| Q1 | $1200 |
| Q2 | $1000 |
[Q1, Sales, $1000][Q2, Sales, $1200]
[Q1, Sales, $1200][Q2, Sales, $1000]
❌ Mismatch: Factual Mismatch (Value Swap)
Alignment for 'Q1':
[Q1/Q1, Sales/Sales, $1000/$1200]Reason: Values for aligned triplet do not match. (Difference: $200)
| Quarter | Sales |
|---|---|
| Q1 | $1200 ❌ |
| Q2 | $1000 ❌ |
Source Text: "Team A has 3 members. Team B has 4 members. Total members: 7."
Generated Table (g2)| Team | Members |
|---|---|
| Team A | 3 |
| Team B | 4 |
| Total | 12 |
[Team A, Members, 3][Team B, Members, 4][Total, Members, 7]
[Team A, Members, 3][Team B, Members, 4][Total, Members, 12]
❌ Mismatch: Aggregation Error
Alignment for 'Total':
[Total/Total, Members/Members, 7/12]Reason: Aggregation rubric failed. Expected '7' (from g1 or SUM(3,4)), but g2 has '12'.
| Team | Members |
|---|---|
| Team A | 3 |
| Team B | 4 |
| Total | 12 ❌ |
TabReX‑Bench stress‑tests metric robustness and generalization. It spans multiple domains and planner‑driven perturbations across three difficulty tiers, enabling controlled, harder‑case analyses.
Domains
6
Perturbation Types
12
Difficulty Tiers
3
We define two complementary perturbation groups: Data‑Preserving (Group 0) alters layout or presentation (e.g., row/header reordering, unit conversion, paraphrasing) without changing factual content; Data‑Altering (Group 1) introduces semantic modifications such as adding or deleting rows/columns, swapping numeric values, or injecting noise and misspellings. Each group is further stratified into three difficulty tiers (Easy, Medium, Hard), supporting controlled analyses of metric robustness as perturbation severity increases.
| Dataset | # of Tables | # Perturb / Table | Total Tables | Avg Row | Avg Col | Avg Cell | Avg Tokens | Avg Num |
|---|---|---|---|---|---|---|---|---|
| FinQA | 150 | 12 | 1950 | 5.55 | 2.47 | 13.22 | 119.5 | 33.55 |
| HiTabQA | 150 | 12 | 1950 | 20.08 | 5.60 | 115.1 | 434.8 | 102.7 |
| ToTTo | 150 | 12 | 1950 | 24.97 | 5.49 | 142.2 | 361.3 | 69.63 |
| OpenML med | 10 | 12 | 120 | 4.20 | 11.58 | 47.94 | 210.9 | 23.80 |
| MIMIC-IV | 100 | 12 | 1200 | 10.58 | 3.94 | 40.84 | 153.5 | 26.29 |
| RotoWire | 150 | 12 | 1950 | 10.18 | 5.86 | 59.50 | 146.5 | 14.33 |
| Total | 710 | - | 9120 | - | - | - | - | - |
Benchmark overview — domains, perturbations, and difficulty tiers for robust evaluation.
TabReX outperforms all traditional and referenceless LLM‑based metrics in aligning with expert rankings on TabReX‑Bench.
| Metric | ρS ↑ | τK ↑ | τw ↑ | RBO ↑ | ζF ↓ | πt ↓ |
|---|---|---|---|---|---|---|
| Non-LLM Based (w/ Ref) | ||||||
| EM | 45.88 | 39.38 | 39.51 | 43.33 | 47.49 | 58.40 |
| chrF | 41.76 | 34.55 | 31.61 | 39.39 | 49.26 | 01.64 |
| ROUGE-L | 31.18 | 26.69 | 22.56 | 37.65 | 55.94 | 01.97 |
| BLUERT | 44.66 | 37.64 | 36.09 | 39.57 | 48.09 | 00.77 |
| BERTScore | 36.21 | 30.66 | 27.96 | 38.11 | 53.25 | 00.92 |
| H-Score | 56.87 | 47.97 | 51.73 | 41.11 | 40.02 | 00.99 |
| LLM-Based (w/ Ref) | ||||||
| P-Score | 49.24 | 40.00 | 37.43 | 40.73 | 43.93 | 07.39 |
| TabEval | 49.01 | 39.22 | 34.21 | 41.11 | 43.06 | 00.63 |
| TabXEval | 80.27 | 72.37 | 66.87 | 47.54 | 20.94 | 45.33 |
| (w/o Ref) | ||||||
| QuestEval | 62.93 | 52.29 | 51.71 | 42.70 | 35.04 | 03.03 |
| TabReX (Ours) | 74.51 | 64.24 | 62.28 | 44.85 | 27.01 | 13.59 |
Higher values of Spearman’s rank correlation (ρS), Kendall’s tau (τK), weighted Kendall’s tau (τw), and Rank‑Biased Overlap (RBO) indicate stronger monotonic and positional agreement with human orderings (↑), while lower values of Spearman’s footrule distance (ζF) and tie ratio (πt) denote better rank stability and finer discriminative resolution (↓).
Ensembles help but don’t close the gap. The best variant (LLM, harmonic) reaches ρS=0.56 and τK=0.47, still behind TabReX (ρS=0.75, τK=0.64) and with higher rank dispersion—simple averaging or harmonic averaging can’t match TabReX’s targeted, referenceless graph reasoning.
| Metric | ρS ↑ | τK ↑ | τw ↑ | RBO ↑ | ζF ↓ | πt ↓ |
|---|---|---|---|---|---|---|
| Ensemble Baselines | ||||||
| Lex‑Emb (M) | 38.43 | 32.65 | 30.17 | 38.52 | 52.15 | 00.49 |
| Lex‑Emb (H) | 29.80 | 24.00 | 19.68 | 37.65 | 55.04 | 00.63 |
| LLM (M) | 48.49 | 39.21 | 36.94 | 40.56 | 44.38 | 00.42 |
| LLM (H) | 56.00 | 46.93 | 50.64 | 40.95 | 40.63 | 00.42 |
| Hybrid (M) | 32.04 | 24.94 | 20.29 | 37.03 | 51.51 | 01.13 |
| Hybrid (H) | 54.03 | 42.71 | 32.61 | 42.31 | 40.11 | 01.13 |
| TabReX (Ours) | 74.51 | 64.24 | 62.28 | 44.85 | 27.01 | 13.59 |
Ensembles combine metric families using either simple Mean (M) or Harmonic (H) aggregation:
All ensemble variants fall short of TabReX, which achieves the highest correlation with expert rankings and better rank stability.
Beyond TabReX‑Bench, we measure correlation with expert rankings on a real‑world text‑to‑table dataset. TabReX again achieves the highest alignment across correlation metrics, outperforming reference‑based and referenceless baselines.
| Metric | ρS ↑ | τK ↑ | RBO ↑ |
|---|---|---|---|
| Standard Metrics (w/ Ref) | |||
| EM | -0.01 | 0.01 | 0.33 |
| ROUGE-L | 0.33 | 0.25 | 0.29 |
| BERTScore | 0.26 | 0.19 | 0.38 |
| BLEURT | 0.29 | 0.20 | 0.39 |
| CHRF | 0.25 | 0.19 | 0.36 |
| LLM-Based (w/ Ref) | |||
| TabEval | 0.25 | 0.19 | 0.36 |
| TabXEval | 0.24 | 0.17 | 0.37 |
| (w/o Ref) | |||
| QuestEval | 0.28 | 0.20 | 0.39 |
| TabReX (Ours) | 0.39 | 0.30 | 0.41 |
A robust evaluation metric must remain reliable not only in standard (easy) settings but also under hard perturbations—tables with subtle misalignments, semantic shifts, or fine-grained numeric errors. We use TabReX-Bench to sample both easy and hard cases and compute true-positive (sensitivity) and true-negative (specificity) rates.
Metric Movements Across Difficulty Levels: Arrows show each metric’s shift from easy (blue) to hard (red) perturbations. The green region denotes the balanced ideal zone, and the dashed diagonal marks the optimal trade-off. TabReX stays near this zone, maintaining the right direction even for hard examples.
TabReX's rubric-aware scoring enables coarse to fine-grained comparisons across models (e.g., Gemma 27B vs. 4B) and prompting strategies (e.g., Zero-Shot, Chain-of-Thought, Map&Make), measured at both cell-level and table-level granularity.
Rubric-wise alignment across models and prompting strategies: The top row shows cell-level agreement, while the bottom row shows table-level agreement.
Key Insights:
These results illustrate how a referenceless, explainable evaluation metric like TabReX can reveal the strengths and weaknesses of models and prompting strategies across hierarchical levels.