TraceBack

Multi-Agent Decomposition for Fine-Grained Table Attribution

Why Do Correct Answers Still Feel Untrustworthy in Table QA?

In table question answering, getting the final answer right is only part of the story. In many real settings, we also need to verify which cells were actually used. Without this evidence trail, even accurate answers remain hard to trust.

TraceBack addresses this gap by tracing each answer back to supporting cells, including both explicit evidence and intermediate cells that appear in multi-step reasoning. This makes attribution more transparent at the cell level rather than only at coarse row or column granularity.

A Concrete Example of Step-by-Step Cell Attribution

Source Cost Efficiency Scalability
Solar Power 30-50 15-20 4
Wind Power 20-40 30-45 5
Hydropower 40-70 70-90 3
Geothermal 50-80 90+ 2

Question: Among renewable sources costing ≤ 50/MWh and scalability ≥ 3, which is most efficient, and what is its efficiency?

Reasoning: Step 1: Cost Filter → Keep Solar, Wind. Step 2: Scalability Filter → Keep Solar, Wind. Step 3: Efficiency Selection → Choose Wind (30-45%).

S1: Cost ≤ 50 S2: Scalability ≥ 3 S3: Max Efficiency

Answer: Wind Power, 30-45% efficiency.

How Does TraceBack Reconstruct a Reasoning Chain Step by Step?

Rather than attributing evidence in one shot, TraceBack decomposes the attribution process into coordinated agent steps so each stage remains interpretable and scalable.

  1. Column relevance identification: find columns needed for the answer, including implicit columns for intermediate reasoning.
  2. Evidence span extraction: prune the table to rows that still preserve the reasoning path.
  3. Query decomposition: split the original question into sub-questions aligned with intermediate steps.
  4. Sub-query attribution: map each sub-question to minimal evidence cells.
  5. Final attribution: merge evidence into a coherent final set of supporting cells.
TraceBack multi-agent pipeline

Pipeline overview: a modular attribution workflow from relevant schema discovery to final cell-level grounding.

How Was the Benchmark Designed for Fine-Grained Attribution?

To evaluate attribution quality systematically, TraceBack introduces CITEBENCH, combining manual phrase-to-cell annotations with larger silver subsets. The benchmark is built from ToTTo, FetaQA, and AITQA, with a 1,500-example gold set for precise analysis.

Gold Examples

1,500

Datasets

3

IAA (Cohen's kappa)

0.72

Benchmark Composition

Dataset Total Gold Set (Human) Silver Set (Original)
ToTTo 7,700 500 7,200
FetaQA 3,004 500 2,504
AITQA 513 513 -

CITEBENCH statistics reported in the paper (Table 2).

What Do the Experiments Reveal About Attribution Quality?

Across ToTTo, FetaQA, and AITQA, the paper reports attribution quality at row, column, and cell levels. Viewing all three granularities together reveals how each method behaves from coarse evidence localization to fine-grained grounding.

Attribution performance across granularities and datasets

Method ToTTo
Precision
ToTTo
Recall
FetaQA
Precision
FetaQA
Recall
AITQA
Precision
AITQA
Recall
Row-Level Attribution
SBERT + GPT-4o 43.38 39.09 57.02 55.40 66.82 68.10
GenerationPrograms 50.00 31.28 75.00 40.06 59.68 71.90
Fewshot + CoT 17.30 12.80 34.40 30.10 04.30 04.30
INSEQ 37.50 74.50 56.40 84.70 31.20 97.60
TraceBack - Lite 77.00 62.60 83.00 79.20 91.30 92.90
TraceBack 71.19 80.38 94.30 93.36 96.65 97.12
Column-Level Attribution
SBERT + GPT-4o 90.51 85.91 94.67 84.77 46.68 97.14
GenerationPrograms 71.81 24.71 78.42 20.00 49.86 83.81
Fewshot + CoT 92.70 76.40 95.80 67.30 47.50 94.30
INSEQ 73.10 74.10 82.60 65.50 34.70 99.98
TraceBack - Lite 88.60 48.90 94.70 54.80 79.20 85.30
TraceBack 91.50 77.64 96.39 83.07 54.09 98.09
Cell-Level Attribution
SBERT + GPT-4o 39.78 36.97 52.08 46.16 31.96 66.67
GenerationPrograms 29.35 13.61 50.78 15.74 30.32 67.14
Fewshot + CoT 14.50 10.10 27.40 17.80 02.20 04.30
INSEQ 42.70 53.80 56.50 44.20 19.20 97.10
TraceBack - Lite 73.80 39.60 75.40 42.30 73.70 80.60
TraceBack 74.20 67.05 89.81 78.84 52.37 95.22

Precision and recall for row-, column-, and cell-level attribution on ToTTo, FetaQA, and AITQA. Blocks correspond to different granularities. For each dataset–granularity pair, bold best method; underline marks second best.

What Changes When We Simplify TraceBack?

Removing query decomposition produces the largest quality drop, which suggests that surfacing intermediate reasoning steps is central to accurate cell-level grounding. Table pruning also matters: it typically reduces noise and improves attribution efficiency while preserving evidence coverage.

Method ToTTo
Precision
ToTTo
Recall
FetaQA
Precision
FetaQA
Recall
AITQA
Precision
AITQA
Recall
Variants of TraceBack
Query Decomposition before Table Pruning 63.00 60.15 71.23 75.10 85.38 92.89
Passing One Subquery at a Time 61.32 60.10 69.67 73.33 85.95 92.00
TraceBack 74.20 67.05 89.81 78.84 52.37 95.22
Ablation Study on TraceBack
TraceBack 74.20 67.05 89.81 78.84 52.37 95.22
- w/o Table Pruning 73.14 60.10 72.89 75.78 86.33 93.10
- w/o Query Decomposition 56.00 56.30 65.40 68.32 47.22 91.80

How Can We Evaluate Attribution Without New Human Labels?

TraceBack also introduces FairScore, a referenceless metric that compares atomic facts extracted from predicted cells with atomic facts extracted from answers. This supports scalable evaluation when gold cell labels are limited.

FairScore preserves relative method ordering and separates strong and weak attribution models, while keeping evaluation practical on larger silver subsets.

FairScore reference-less evaluation concept

FairScore concept: convert cell evidence and answers into atomic facts, then align them to estimate precision and recall.

FairScore reference-less evaluation concept

Systematic analysis of the metric - FAIRSCORE

FairScore-Based Cell-Level Estimates on Gold Sets

Method ToTTo (P / R) FetaQA (P / R) AITQA (P / R)
Few-shot + CoT 30.81 / 13.25 15.67 / 17.73 11.73 / 6.69
SBERT + GPT-4o 20.51 / 16.84 20.05 / 21.87 4.51 / 5.47
INSEQ 16.85 / 18.95 15.48 / 17.94 10.99 / 16.53
GenerationPrograms 14.96 / 11.32 16.04 / 14.13 7.62 / 3.77
TraceBack-Lite 53.87 / 40.20 51.88 / 45.12 63.44 / 55.13
TraceBack (full) 56.89 / 48.39 63.73 / 64.15 42.20 / 49.93
Top