arXiv 2026  ·  ACL 2026

CRAFT: Training-Free Cascaded Retrieval
for Tabular QA

Zero-shot, modular table retrieval that outperforms fine-tuned SOTA

* Equal Contribution

Adarsh Singh1* Kushal Raj Bhandari2* Jianxi Gao2 Soham Dan3 Vivek Gupta1

1Arizona State University    2Rensselaer Polytechnic Institute    3Microsoft

arXiv Paper Code BibTeX

Abstract

We present CRAFT — a training-free, zero-shot cascaded retrieval framework for Open-Domain Tabular Question Answering. CRAFT enriches table representations with LLM-generated titles and summaries, then narrows a large corpus to the most relevant tables through three progressively precise stages: sparse lexical retrieval with SPLADE, dense semantic reranking with Sentence Transformers, and fine-grained reranking with state-of-the-art text embeddings. Without any fine-tuning, CRAFT achieves state-of-the-art performance on NQ-Tables and strong zero-shot results on OTT-QA — surpassing dedicated fine-tuned retrievers while remaining plug-and-play across new domains and query types.

Introduction

Open-Domain Table Question Answering (TQA) demands finding the single relevant table from corpora containing hundreds of thousands of candidates, then reading it to answer a natural-language question. The retrieval step is the true bottleneck: tables are semi-structured, often lack meaningful titles, and exhibit a lexical gap with natural-language queries.

Prior state-of-the-art systems — DTR, THYME, T-RAG, Re²G — address this via task-specific fine-tuning of dense bi-encoders. While effective, this creates costly barriers: GPU-intensive training, dataset-specific annotation, and poor transfer to unseen domains or paraphrased queries.

Core Insight: Tables can be made textually rich enough for off-the-shelf retrievers to outperform fine-tuned alternatives. CRAFT uses an LLM (Gemini Flash 1.5) to generate informative titles and summaries for every table, then cascades multiple retrieval stages — each stage using a stronger (but slower) model on a progressively smaller candidate set.

This modular design lets practitioners swap in new embedding models as they improve, requires no labeled data, and is robust to query paraphrasing — making CRAFT a scalable, future-proof alternative to fine-tuned retrieval pipelines.

The CRAFT Pipeline

Step 0 — Preprocessing

Enriching Queries & Tables

🔍

Query Expansion

Gemini Flash 1.5 generates sub-questions that decompose the original query for better lexical coverage.

📋

Table Enrichment

Each table receives an LLM-generated title and a natural-language summary, bridging the lexical gap.

✂️

Mini-Table Construction

Sentence Transformer selects the top-5 most query-relevant rows per table, reducing context size 5×.

CRAFT Pipeline Overview

Figure 1: CRAFT three-stage cascaded retrieval pipeline with preprocessing and answer generation.


Three Retrieval Stages

Progressively Precise Reranking

Stage 1

Sparse Retrieval

SPLADE indexes enriched table representations (table content + title + description) and retrieves the top 5,000 candidates from the full corpus via sparse lexical matching.

Stage 2

Dense Reranking

Sentence Transformer (all-mpnet-base-v2 / Jina v3) scores mini-tables against the expanded query, narrowing 5,000 candidates down to the top 100.

Stage 3

Fine-Grained Reranking

OpenAI text-embedding-3 (small/large) or gemini-embedding-001 performs high-precision reranking of the 100 candidates to select the final top-k tables.


Answer Generation

LLM Reading over Mini-Tables

The top-n mini-tables retrieved by Stage 3 are concatenated and passed to an LLM (Llama-3.1-8B, Qwen2.5-7B, Mistral-7B, GPT-4o, etc.) to generate the final answer. Using mini-tables instead of full tables reduces token consumption by up to while preserving the most relevant evidence rows.

Datasets

🗂 NQ-Tables Single-hop

A large-scale benchmark built on Natural Questions, where each query is answered by a single Wikipedia table. Tables are retrieved from a corpus of Wikipedia HTML tables, making lexical matching challenging due to sparse natural-language context.

169K
Tables
~1K
Test Queries
Single
Hop

🔗 Dataset & Details →

🔗 OTT-QA Multi-hop

Open-domain Table&Text QA requiring multi-hop reasoning over both structured tables and unstructured Wikipedia passages. Answers often span heterogeneous evidence, making it a significantly harder retrieval challenge.

419K
Tables
2K+
Test Queries
Multi
Hop

🔗 Dataset & Details →

Results

Table Retrieval

NQ-Tables: New State-of-the-Art

49.84
R@1 (CRAFT Stage 3)
86.83
R@10 (CRAFT Stage 3)
97.17
R@50 (CRAFT Stage 3)
48.55
R@1 (THYME, fine-tuned)
System Training R@1 R@10 R@50
BM25None18.343.260.1
DTR (DPR)Fine-tuned32.168.482.6
SSDRFine-tuned40.576.388.9
THYMEFine-tuned48.5584.9094.61
CRAFT Stage 1 (Ours)None34.3872.90
CRAFT Stage 2 (Ours)None82.91
CRAFT Stage 3 (Ours)None 49.84 86.83 97.17

Table 2: Retrieval performance on NQ-Tables test set. CRAFT sets new SOTA without any fine-tuning.


Zero-Shot Transfer

OTT-QA: Competitive Without Fine-Tuning

55.56
R@1 (CRAFT)
89.88
R@10 (CRAFT)
96.07
R@50 (CRAFT)
91.10
R@10 (THYME, fine-tuned)

On the significantly harder OTT-QA benchmark, CRAFT achieves R@10 = 89.88 zero-shot — within 1.2 points of fine-tuned THYME (91.10), demonstrating strong generalization across very different retrieval conditions.


Robustness

Query Paraphrasing: CRAFT Drops Only −0.04

−0.04
CRAFT avg. drop
−8 to −12
DTR avg. drop

Under paraphrased query evaluation (Table 4 in the paper), fine-tuned DTR degrades by 8–12 points on average. CRAFT Stage 3 is essentially invariant to query reformulation, confirming that semantic embedding-based retrieval generalizes far beyond lexical patterns learned during fine-tuning.


End-to-End QA

NQ-Tables: Answer F1 vs. Context Length

The plots below show the trade-off between answer F1 and average context length (log scale) across different top-n configurations for 7B-scale LLMs. CRAFT Stage 3 consistently reaches high F1 at lower context budgets than fine-tuned baselines (DTR).

Context vs F1 on NQ-Tables (7B models)

Figure 2: Average F1 vs. context length (log scale) on NQ-Tables for Mistral-7B, Llama-3.1-8B, and Qwen2.5-7B. Points are labeled with the optimal top-n value.


End-to-End QA

OTT-QA: Large Model Performance

For large-scale LLMs (Mistral-Small, Llama-3.3-70B, Qwen2.5-72B), CRAFT Stage 3 with Gemini embeddings approaches gold-table performance at modest context lengths, while Stage 2 (MT) offers a lightweight, cost-effective alternative.

Context vs F1 on OTT-QA (large models)

Figure 3: Average F1 vs. context length (log scale) on OTT-QA for large-scale models. "×" marks gold-table upper bound.

Analysis

Token Efficiency

Mini-Tables Reduce Tokens by up to 5×

CRAFT's mini-table construction (top-5 relevant rows per table) dramatically reduces the average token count compared to providing full tables, while preserving the most query-relevant evidence. The table below shows average token counts across different models and top-n values.

Token consumption: sub-table vs full table

Table 7: Average token counts (sub-table vs. full table) across models and top-k values. Sub-tables use 4–8× fewer tokens at every setting.

Impact & Key Takeaways

🚀

No Fine-Tuning Required

CRAFT is entirely plug-and-play. Any improvement in upstream embedding models (SPLADE, Sentence Transformers, OpenAI) directly translates to better retrieval.

🛡️

Robust to Paraphrasing

While fine-tuned models degrade significantly under query reformulation, CRAFT's semantic embeddings are near-invariant — critical for real-world deployment.

💡

LLM-Enriched Tables

Generating natural-language titles and summaries for tables closes the lexical gap between structured data and free-text queries — a broadly applicable technique.

Token-Efficient QA

Mini-tables reduce context length by 4–8×, enabling cost-effective inference with smaller LLMs while maintaining competitive answer accuracy.

🌍

Domain Generalization

Strong zero-shot transfer from NQ-Tables to OTT-QA (a completely different benchmark) demonstrates CRAFT's generalizability beyond a single retrieval setting.

🔧

Modular Architecture

Each stage can be replaced independently. Practitioners can swap in custom sparse retrievers, embeddings, or LLMs to adapt CRAFT to new corpora or languages.

BibTeX

If you find CRAFT useful, please cite our work:

@misc{singh2025crafttrainingfreecascadedretrieval,
      title={CRAFT: Training-Free Cascaded Retrieval for Tabular QA},
      author={Adarsh Singh and Kushal Raj Bhandari and Jianxi Gao
              and Soham Dan and Vivek Gupta},
      year={2025},
      eprint={2505.14984},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2505.14984},
}