TraceBack

Multi-Agent Decomposition for Fine-Grained Table Attribution

Arizona State University, Adobe Research

A multi-agent attribution framework that reconstructs answer evidence step by step, introduces CITEBENCH for fine-grained analysis, and uses FairScore for scalable reference-less evaluation.

Why It Matters

Correct answers still need evidence

TraceBack focuses on which table cells support an answer, not just whether the final prediction matches the target.

How It Works

Attribution as coordinated decomposition

The pipeline identifies relevant schema, prunes evidence spans, decomposes queries, and grounds each step in minimal supporting cells.

What We Built

Benchmark plus scalable metric

CITEBENCH supports detailed human analysis, while FairScore extends evaluation beyond settings with full gold annotations.

Highlights

  • Fine-grained attribution: TraceBack localizes the exact cells behind an answer, including intermediate evidence used in multi-step reasoning.
  • Interpretable decomposition: The attribution workflow is modular, making it easier to inspect where grounding succeeds or fails.
  • Benchmark-backed analysis: CITEBENCH combines human gold labels with larger silver subsets across ToTTo, FetaQA, and AITQA.
  • Reference-less evaluation: FairScore compares atomic facts from predicted evidence and answers to scale analysis without new labels.
1,500 Gold Examples
3 Datasets
0.72 IAA (kappa)
Cell-level Grounding Target

Why Do Correct Answers Still Feel Untrustworthy in Table QA?

In table question answering, getting the final answer right is only part of the story. In many real settings, we also need to verify which cells were actually used. Without this evidence trail, even accurate answers remain hard to trust.

TraceBack addresses this gap by tracing each answer back to supporting cells, including both explicit evidence and intermediate cells that appear in multi-step reasoning. This makes attribution more transparent at the cell level rather than only at coarse row or column granularity.

A Concrete Example of Step-by-Step Cell Attribution

Source Cost Efficiency Scalability
Solar Power 30-50 15-20 4
Wind Power 20-40 30-45 5
Hydropower 40-70 70-90 3
Geothermal 50-80 90+ 2

Question: Among renewable sources costing ≤ 50/MWh and scalability ≥ 3, which is most efficient, and what is its efficiency?

Reasoning: Step 1: Cost Filter → Keep Solar, Wind. Step 2: Scalability Filter → Keep Solar, Wind. Step 3: Efficiency Selection → Choose Wind (30-45%).

S1: Cost ≤ 50 S2: Scalability ≥ 3 S3: Max Efficiency

Answer: Wind Power, 30-45% efficiency.

How Does TraceBack Reconstruct a Reasoning Chain Step by Step?

Rather than attributing evidence in one shot, TraceBack decomposes the attribution process into coordinated agent steps so each stage remains interpretable and scalable.

  1. Column relevance identification: find columns needed for the answer, including implicit columns for intermediate reasoning.
  2. Evidence span extraction: prune the table to rows that still preserve the reasoning path.
  3. Query decomposition: split the original question into sub-questions aligned with intermediate steps.
  4. Sub-query attribution: map each sub-question to minimal evidence cells.
  5. Final attribution: merge evidence into a coherent final set of supporting cells.
TraceBack multi-agent pipeline

Pipeline overview: a modular attribution workflow from relevant schema discovery to final cell-level grounding.

How Was the Benchmark Designed for Fine-Grained Attribution?

To evaluate attribution quality systematically, TraceBack introduces CITEBENCH, combining manual phrase-to-cell annotations with larger silver subsets. The benchmark is built from ToTTo, FetaQA, and AITQA, with a 1,500-example gold set for precise analysis.

Gold Examples

1,500

Datasets

3

IAA (Cohen's kappa)

0.72

Benchmark Composition

Dataset Total Gold Set (Human) Silver Set (Original)
ToTTo 7,700 500 7,200
FetaQA 3,004 500 2,504
AITQA 513 513 -

CITEBENCH statistics reported in the paper (Table 2).

What Do the Experiments Reveal About Attribution Quality?

Across ToTTo, FetaQA, and AITQA, the paper reports attribution quality at row, column, and cell levels. Viewing all three granularities together reveals how each method behaves from coarse evidence localization to fine-grained grounding.

Attribution performance across granularities and datasets

Method ToTTo
Precision
ToTTo
Recall
FetaQA
Precision
FetaQA
Recall
AITQA
Precision
AITQA
Recall
Row-Level Attribution
SBERT + GPT-4o43.3839.0957.0255.4066.8268.10
GenerationPrograms50.0031.2875.0040.0659.6871.90
Fewshot + CoT17.3012.8034.4030.1004.3004.30
INSEQ37.5074.5056.4084.7031.2097.60
TraceBack - Lite77.0062.6083.0079.2091.3092.90
TraceBack71.1980.3894.3093.3696.6597.12
Column-Level Attribution
SBERT + GPT-4o90.5185.9194.6784.7746.6897.14
GenerationPrograms71.8124.7178.4220.0049.8683.81
Fewshot + CoT92.7076.4095.8067.3047.5094.30
INSEQ73.1074.1082.6065.5034.7099.98
TraceBack - Lite88.6048.9094.7054.8079.2085.30
TraceBack91.5077.6496.3983.0754.0998.09
Cell-Level Attribution
SBERT + GPT-4o39.7836.9752.0846.1631.9666.67
GenerationPrograms29.3513.6150.7815.7430.3267.14
Fewshot + CoT14.5010.1027.4017.8002.2004.30
INSEQ42.7053.8056.5044.2019.2097.10
TraceBack - Lite73.8039.6075.4042.3073.7080.60
TraceBack74.2067.0589.8178.8452.3795.22

Precision and recall for row-, column-, and cell-level attribution on ToTTo, FetaQA, and AITQA. Blocks correspond to different granularities. For each dataset-granularity pair, bold marks best and underline marks second best.

What Changes When We Simplify TraceBack?

Removing query decomposition produces the largest quality drop, which suggests that surfacing intermediate reasoning steps is central to accurate cell-level grounding. Table pruning also matters: it typically reduces noise and improves attribution efficiency while preserving evidence coverage.

Method ToTTo
Precision
ToTTo
Recall
FetaQA
Precision
FetaQA
Recall
AITQA
Precision
AITQA
Recall
Variants of TraceBack
Query Decomposition before Table Pruning63.0060.1571.2375.1085.3892.89
Passing One Subquery at a Time61.3260.1069.6773.3385.9592.00
TraceBack74.2067.0589.8178.8452.3795.22
Ablation Study on TraceBack
TraceBack74.2067.0589.8178.8452.3795.22
- w/o Table Pruning73.1460.1072.8975.7886.3393.10
- w/o Query Decomposition56.0056.3065.4068.3247.2291.80

How Can We Evaluate Attribution Without New Human Labels?

TraceBack also introduces FairScore, a referenceless metric that compares atomic facts extracted from predicted cells with atomic facts extracted from answers. This supports scalable evaluation when gold cell labels are limited.

FairScore preserves relative method ordering and separates strong and weak attribution models, while keeping evaluation practical on larger silver subsets.

FairScore reference-less evaluation concept

FairScore concept: convert cell evidence and answers into atomic facts, then align them to estimate precision and recall.

Systematic FairScore metric analysis

Systematic analysis of FairScore across datasets and attribution regimes.

FairScore-Based Cell-Level Estimates on Gold Sets

Method ToTTo (P / R) FetaQA (P / R) AITQA (P / R)
Few-shot + CoT30.81 / 13.2515.67 / 17.7311.73 / 6.69
SBERT + GPT-4o20.51 / 16.8420.05 / 21.874.51 / 5.47
INSEQ16.85 / 18.9515.48 / 17.9410.99 / 16.53
GenerationPrograms14.96 / 11.3216.04 / 14.137.62 / 3.77
TraceBack-Lite53.87 / 40.2051.88 / 45.1263.44 / 55.13
TraceBack (full)56.89 / 48.3963.73 / 64.1542.20 / 49.93

BibTeX

@article{anvekar2026traceback,
  title={TraceBack: Multi-Agent Decomposition for Fine-Grained Table Attribution},
  author={Tejas Anvekar and Junha Park and Rajat Jha and Devanshu Gupta and Poojah Ganesan and Puneeth Mathur and Vivek Gupta},
  year={2026},
  eprint={2602.13059},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2602.13059},
}

Top