In table question answering, getting the final answer right is only part of the story. In many real settings, we also need to verify which cells were actually used. Without this evidence trail, even accurate answers remain hard to trust.
TraceBack addresses this gap by tracing each answer back to supporting cells, including both explicit evidence and intermediate cells that appear in multi-step reasoning. This makes attribution more transparent at the cell level rather than only at coarse row or column granularity.
| Source | Cost | Efficiency | Scalability |
|---|---|---|---|
| Solar Power | 30-50 | 15-20 | 4 |
| Wind Power | 20-40 | 30-45 | 5 |
| Hydropower | 40-70 | 70-90 | 3 |
| Geothermal | 50-80 | 90+ | 2 |
Question: Among renewable sources costing ≤ 50/MWh and scalability ≥ 3, which is most efficient, and what is its efficiency?
Reasoning: Step 1: Cost Filter → Keep Solar, Wind. Step 2: Scalability Filter → Keep Solar, Wind. Step 3: Efficiency Selection → Choose Wind (30-45%).
Answer: Wind Power, 30-45% efficiency.
Rather than attributing evidence in one shot, TraceBack decomposes the attribution process into coordinated agent steps so each stage remains interpretable and scalable.
Pipeline overview: a modular attribution workflow from relevant schema discovery to final cell-level grounding.
To evaluate attribution quality systematically, TraceBack introduces CITEBENCH, combining manual phrase-to-cell annotations with larger silver subsets. The benchmark is built from ToTTo, FetaQA, and AITQA, with a 1,500-example gold set for precise analysis.
Gold Examples
1,500
Datasets
3
IAA (Cohen's kappa)
0.72
| Dataset | Total | Gold Set (Human) | Silver Set (Original) |
|---|---|---|---|
| ToTTo | 7,700 | 500 | 7,200 |
| FetaQA | 3,004 | 500 | 2,504 |
| AITQA | 513 | 513 | - |
CITEBENCH statistics reported in the paper (Table 2).
Across ToTTo, FetaQA, and AITQA, the paper reports attribution quality at row, column, and cell levels. Viewing all three granularities together reveals how each method behaves from coarse evidence localization to fine-grained grounding.
| Method | ToTTo Precision |
ToTTo Recall |
FetaQA Precision |
FetaQA Recall |
AITQA Precision |
AITQA Recall |
|---|---|---|---|---|---|---|
| Row-Level Attribution | ||||||
| SBERT + GPT-4o | 43.38 | 39.09 | 57.02 | 55.40 | 66.82 | 68.10 |
| GenerationPrograms | 50.00 | 31.28 | 75.00 | 40.06 | 59.68 | 71.90 |
| Fewshot + CoT | 17.30 | 12.80 | 34.40 | 30.10 | 04.30 | 04.30 |
| INSEQ | 37.50 | 74.50 | 56.40 | 84.70 | 31.20 | 97.60 |
| TraceBack - Lite | 77.00 | 62.60 | 83.00 | 79.20 | 91.30 | 92.90 |
| TraceBack | 71.19 | 80.38 | 94.30 | 93.36 | 96.65 | 97.12 |
| Column-Level Attribution | ||||||
| SBERT + GPT-4o | 90.51 | 85.91 | 94.67 | 84.77 | 46.68 | 97.14 |
| GenerationPrograms | 71.81 | 24.71 | 78.42 | 20.00 | 49.86 | 83.81 |
| Fewshot + CoT | 92.70 | 76.40 | 95.80 | 67.30 | 47.50 | 94.30 |
| INSEQ | 73.10 | 74.10 | 82.60 | 65.50 | 34.70 | 99.98 |
| TraceBack - Lite | 88.60 | 48.90 | 94.70 | 54.80 | 79.20 | 85.30 |
| TraceBack | 91.50 | 77.64 | 96.39 | 83.07 | 54.09 | 98.09 |
| Cell-Level Attribution | ||||||
| SBERT + GPT-4o | 39.78 | 36.97 | 52.08 | 46.16 | 31.96 | 66.67 |
| GenerationPrograms | 29.35 | 13.61 | 50.78 | 15.74 | 30.32 | 67.14 |
| Fewshot + CoT | 14.50 | 10.10 | 27.40 | 17.80 | 02.20 | 04.30 |
| INSEQ | 42.70 | 53.80 | 56.50 | 44.20 | 19.20 | 97.10 |
| TraceBack - Lite | 73.80 | 39.60 | 75.40 | 42.30 | 73.70 | 80.60 |
| TraceBack | 74.20 | 67.05 | 89.81 | 78.84 | 52.37 | 95.22 |
Precision and recall for row-, column-, and cell-level attribution on ToTTo, FetaQA, and AITQA. Blocks correspond to different granularities. For each dataset–granularity pair, bold best method; underline marks second best.
Removing query decomposition produces the largest quality drop, which suggests that surfacing intermediate reasoning steps is central to accurate cell-level grounding. Table pruning also matters: it typically reduces noise and improves attribution efficiency while preserving evidence coverage.
| Method | ToTTo Precision |
ToTTo Recall |
FetaQA Precision |
FetaQA Recall |
AITQA Precision |
AITQA Recall |
|---|---|---|---|---|---|---|
| Variants of TraceBack | ||||||
| Query Decomposition before Table Pruning | 63.00 | 60.15 | 71.23 | 75.10 | 85.38 | 92.89 |
| Passing One Subquery at a Time | 61.32 | 60.10 | 69.67 | 73.33 | 85.95 | 92.00 |
| TraceBack | 74.20 | 67.05 | 89.81 | 78.84 | 52.37 | 95.22 |
| Ablation Study on TraceBack | ||||||
| TraceBack | 74.20 | 67.05 | 89.81 | 78.84 | 52.37 | 95.22 |
| - w/o Table Pruning | 73.14 | 60.10 | 72.89 | 75.78 | 86.33 | 93.10 |
| - w/o Query Decomposition | 56.00 | 56.30 | 65.40 | 68.32 | 47.22 | 91.80 |
TraceBack also introduces FairScore, a referenceless metric that compares atomic facts extracted from predicted cells with atomic facts extracted from answers. This supports scalable evaluation when gold cell labels are limited.
FairScore preserves relative method ordering and separates strong and weak attribution models, while keeping evaluation practical on larger silver subsets.
FairScore concept: convert cell evidence and answers into atomic facts, then align them to estimate precision and recall.
Systematic analysis of the metric - FAIRSCORE
| Method | ToTTo (P / R) | FetaQA (P / R) | AITQA (P / R) |
|---|---|---|---|
| Few-shot + CoT | 30.81 / 13.25 | 15.67 / 17.73 | 11.73 / 6.69 |
| SBERT + GPT-4o | 20.51 / 16.84 | 20.05 / 21.87 | 4.51 / 5.47 |
| INSEQ | 16.85 / 18.95 | 15.48 / 17.94 | 10.99 / 16.53 |
| GenerationPrograms | 14.96 / 11.32 | 16.04 / 14.13 | 7.62 / 3.77 |
| TraceBack-Lite | 53.87 / 40.20 | 51.88 / 45.12 | 63.44 / 55.13 |
| TraceBack (full) | 56.89 / 48.39 | 63.73 / 64.15 | 42.20 / 49.93 |