TraceBack

Multi-Agent Decomposition for Fine-Grained Table Attribution

Tejas Anvekar, Junha Park, Rajat Jha, Devanshu Gupta, Poojah Ganesan, Puneeth Mathur, Vivek Gupta

Arizona State University, Adobe Research

A multi-agent attribution framework that reconstructs answer evidence step by step, introduces CITEBENCH for fine-grained analysis, and uses FairScore for scalable reference-less evaluation.

Paper Code FairScore

Why It Matters

Correct answers still need evidence

TraceBack focuses on which table cells support an answer, not just whether the final prediction matches the target.

How It Works

Attribution as coordinated decomposition

The pipeline identifies relevant schema, prunes evidence spans, decomposes queries, and grounds each step in minimal supporting cells.

What We Built

Benchmark plus scalable metric

CITEBENCH supports detailed human analysis, while FairScore extends evaluation beyond settings with full gold annotations.

Highlights

Fine-grained attribution: TraceBack localizes the exact cells behind an answer, including intermediate evidence used in multi-step reasoning.
Interpretable decomposition: The attribution workflow is modular, making it easier to inspect where grounding succeeds or fails.
Benchmark-backed analysis: CITEBENCH combines human gold labels with larger silver subsets across ToTTo, FetaQA, and AITQA.
Reference-less evaluation: FairScore compares atomic facts from predicted evidence and answers to scale analysis without new labels.

1,500 Gold Examples

3 Datasets

0.72 IAA (kappa)

Cell-level Grounding Target

Why Do Correct Answers Still Feel Untrustworthy in Table QA?

In table question answering, getting the final answer right is only part of the story. In many real settings, we also need to verify which cells were actually used. Without this evidence trail, even accurate answers remain hard to trust.

TraceBack addresses this gap by tracing each answer back to supporting cells, including both explicit evidence and intermediate cells that appear in multi-step reasoning. This makes attribution more transparent at the cell level rather than only at coarse row or column granularity.

A Concrete Example of Step-by-Step Cell Attribution

Source	Cost	Efficiency	Scalability
Solar Power	30-50	15-20	4
Wind Power	20-40	30-45	5
Hydropower	40-70	70-90	3
Geothermal	50-80	90+	2

Question: Among renewable sources costing ≤ 50/MWh and scalability ≥ 3, which is most efficient, and what is its efficiency?

Reasoning: Step 1: Cost Filter → Keep Solar, Wind. Step 2: Scalability Filter → Keep Solar, Wind. Step 3: Efficiency Selection → Choose Wind (30-45%).

S1: Cost ≤ 50 S2: Scalability ≥ 3 S3: Max Efficiency

Answer: Wind Power, 30-45% efficiency.

How Does TraceBack Reconstruct a Reasoning Chain Step by Step?

Rather than attributing evidence in one shot, TraceBack decomposes the attribution process into coordinated agent steps so each stage remains interpretable and scalable.

Column relevance identification: find columns needed for the answer, including implicit columns for intermediate reasoning.
Evidence span extraction: prune the table to rows that still preserve the reasoning path.
Query decomposition: split the original question into sub-questions aligned with intermediate steps.
Sub-query attribution: map each sub-question to minimal evidence cells.
Final attribution: merge evidence into a coherent final set of supporting cells.

Pipeline overview: a modular attribution workflow from relevant schema discovery to final cell-level grounding.

How Was the Benchmark Designed for Fine-Grained Attribution?

To evaluate attribution quality systematically, TraceBack introduces CITEBENCH, combining manual phrase-to-cell annotations with larger silver subsets. The benchmark is built from ToTTo, FetaQA, and AITQA, with a 1,500-example gold set for precise analysis.

Gold Examples

1,500

Datasets

IAA (Cohen's kappa)

0.72

Benchmark Composition

Dataset	Total	Gold Set (Human)	Silver Set (Original)
ToTTo	7,700	500	7,200
FetaQA	3,004	500	2,504
AITQA	513	513	-

CITEBENCH statistics reported in the paper (Table 2).

What Do the Experiments Reveal About Attribution Quality?

Across ToTTo, FetaQA, and AITQA, the paper reports attribution quality at row, column, and cell levels. Viewing all three granularities together reveals how each method behaves from coarse evidence localization to fine-grained grounding.

Attribution performance across granularities and datasets

Method	ToTTo Precision	ToTTo Recall	FetaQA Precision	FetaQA Recall	AITQA Precision	AITQA Recall
Row-Level Attribution
SBERT + GPT-4o	43.38	39.09	57.02	55.40	66.82	68.10
GenerationPrograms	50.00	31.28	75.00	40.06	59.68	71.90
Fewshot + CoT	17.30	12.80	34.40	30.10	04.30	04.30
INSEQ	37.50	74.50	56.40	84.70	31.20	97.60
TraceBack - Lite	77.00	62.60	83.00	79.20	91.30	92.90
TraceBack	71.19	80.38	94.30	93.36	96.65	97.12
Column-Level Attribution
SBERT + GPT-4o	90.51	85.91	94.67	84.77	46.68	97.14
GenerationPrograms	71.81	24.71	78.42	20.00	49.86	83.81
Fewshot + CoT	92.70	76.40	95.80	67.30	47.50	94.30
INSEQ	73.10	74.10	82.60	65.50	34.70	99.98
TraceBack - Lite	88.60	48.90	94.70	54.80	79.20	85.30
TraceBack	91.50	77.64	96.39	83.07	54.09	98.09
Cell-Level Attribution
SBERT + GPT-4o	39.78	36.97	52.08	46.16	31.96	66.67
GenerationPrograms	29.35	13.61	50.78	15.74	30.32	67.14
Fewshot + CoT	14.50	10.10	27.40	17.80	02.20	04.30
INSEQ	42.70	53.80	56.50	44.20	19.20	97.10
TraceBack - Lite	73.80	39.60	75.40	42.30	73.70	80.60
TraceBack	74.20	67.05	89.81	78.84	52.37	95.22

Precision and recall for row-, column-, and cell-level attribution on ToTTo, FetaQA, and AITQA. Blocks correspond to different granularities. For each dataset-granularity pair, bold marks best and underline marks second best.

What Changes When We Simplify TraceBack?

Removing query decomposition produces the largest quality drop, which suggests that surfacing intermediate reasoning steps is central to accurate cell-level grounding. Table pruning also matters: it typically reduces noise and improves attribution efficiency while preserving evidence coverage.

Method	ToTTo Precision	ToTTo Recall	FetaQA Precision	FetaQA Recall	AITQA Precision	AITQA Recall
Variants of TraceBack
Query Decomposition before Table Pruning	63.00	60.15	71.23	75.10	85.38	92.89
Passing One Subquery at a Time	61.32	60.10	69.67	73.33	85.95	92.00
TraceBack	74.20	67.05	89.81	78.84	52.37	95.22
Ablation Study on TraceBack
TraceBack	74.20	67.05	89.81	78.84	52.37	95.22
- w/o Table Pruning	73.14	60.10	72.89	75.78	86.33	93.10
- w/o Query Decomposition	56.00	56.30	65.40	68.32	47.22	91.80

How Can We Evaluate Attribution Without New Human Labels?

TraceBack also introduces FairScore, a referenceless metric that compares atomic facts extracted from predicted cells with atomic facts extracted from answers. This supports scalable evaluation when gold cell labels are limited.

FairScore preserves relative method ordering and separates strong and weak attribution models, while keeping evaluation practical on larger silver subsets.

FairScore reference-less evaluation concept

FairScore concept: convert cell evidence and answers into atomic facts, then align them to estimate precision and recall.

Systematic analysis of FairScore across datasets and attribution regimes.

FairScore-Based Cell-Level Estimates on Gold Sets

Method	ToTTo (P / R)	FetaQA (P / R)	AITQA (P / R)
Few-shot + CoT	30.81 / 13.25	15.67 / 17.73	11.73 / 6.69
SBERT + GPT-4o	20.51 / 16.84	20.05 / 21.87	4.51 / 5.47
INSEQ	16.85 / 18.95	15.48 / 17.94	10.99 / 16.53
GenerationPrograms	14.96 / 11.32	16.04 / 14.13	7.62 / 3.77
TraceBack-Lite	53.87 / 40.20	51.88 / 45.12	63.44 / 55.13
TraceBack (full)	56.89 / 48.39	63.73 / 64.15	42.20 / 49.93

BibTeX

@article{anvekar2026traceback,
  title={TraceBack: Multi-Agent Decomposition for Fine-Grained Table Attribution},
  author={Tejas Anvekar and Junha Park and Rajat Jha and Devanshu Gupta and Poojah Ganesan and Puneeth Mathur and Vivek Gupta},
  year={2026},
  eprint={2602.13059},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2602.13059},
}

Top