Zero-shot, modular table retrieval that outperforms fine-tuned SOTA
* Equal Contribution
1Arizona State University 2Rensselaer Polytechnic Institute 3Microsoft
We present CRAFT — a training-free, zero-shot cascaded retrieval framework for Open-Domain Tabular Question Answering. CRAFT enriches table representations with LLM-generated titles and summaries, then narrows a large corpus to the most relevant tables through three progressively precise stages: sparse lexical retrieval with SPLADE, dense semantic reranking with Sentence Transformers, and fine-grained reranking with state-of-the-art text embeddings. Without any fine-tuning, CRAFT achieves state-of-the-art performance on NQ-Tables and strong zero-shot results on OTT-QA — surpassing dedicated fine-tuned retrievers while remaining plug-and-play across new domains and query types.
Open-Domain Table Question Answering (TQA) demands finding the single relevant table from corpora containing hundreds of thousands of candidates, then reading it to answer a natural-language question. The retrieval step is the true bottleneck: tables are semi-structured, often lack meaningful titles, and exhibit a lexical gap with natural-language queries.
Prior state-of-the-art systems — DTR, THYME, T-RAG, Re²G — address this via task-specific fine-tuning of dense bi-encoders. While effective, this creates costly barriers: GPU-intensive training, dataset-specific annotation, and poor transfer to unseen domains or paraphrased queries.
This modular design lets practitioners swap in new embedding models as they improve, requires no labeled data, and is robust to query paraphrasing — making CRAFT a scalable, future-proof alternative to fine-tuned retrieval pipelines.
Gemini Flash 1.5 generates sub-questions that decompose the original query for better lexical coverage.
Each table receives an LLM-generated title and a natural-language summary, bridging the lexical gap.
Sentence Transformer selects the top-5 most query-relevant rows per table, reducing context size 5×.
Figure 1: CRAFT three-stage cascaded retrieval pipeline with preprocessing and answer generation.
SPLADE indexes enriched table representations (table content + title + description) and retrieves the top 5,000 candidates from the full corpus via sparse lexical matching.
Sentence Transformer (all-mpnet-base-v2 / Jina v3) scores mini-tables against the expanded query, narrowing 5,000 candidates down to the top 100.
OpenAI text-embedding-3 (small/large) or gemini-embedding-001 performs high-precision reranking of the 100 candidates to select the final top-k tables.
The top-n mini-tables retrieved by Stage 3 are concatenated and passed to an LLM (Llama-3.1-8B, Qwen2.5-7B, Mistral-7B, GPT-4o, etc.) to generate the final answer. Using mini-tables instead of full tables reduces token consumption by up to 5× while preserving the most relevant evidence rows.
A large-scale benchmark built on Natural Questions, where each query is answered by a single Wikipedia table. Tables are retrieved from a corpus of Wikipedia HTML tables, making lexical matching challenging due to sparse natural-language context.
Open-domain Table&Text QA requiring multi-hop reasoning over both structured tables and unstructured Wikipedia passages. Answers often span heterogeneous evidence, making it a significantly harder retrieval challenge.
| System | Training | R@1 | R@10 | R@50 |
|---|---|---|---|---|
| BM25 | None | 18.3 | 43.2 | 60.1 |
| DTR (DPR) | Fine-tuned | 32.1 | 68.4 | 82.6 |
| SSDR | Fine-tuned | 40.5 | 76.3 | 88.9 |
| THYME | Fine-tuned | 48.55 | 84.90 | 94.61 |
| CRAFT Stage 1 (Ours) | None | 34.38 | 72.90 | — |
| CRAFT Stage 2 (Ours) | None | — | 82.91 | — |
| CRAFT Stage 3 (Ours) | None | 49.84 | 86.83 | 97.17 |
Table 2: Retrieval performance on NQ-Tables test set. CRAFT sets new SOTA without any fine-tuning.
On the significantly harder OTT-QA benchmark, CRAFT achieves R@10 = 89.88 zero-shot — within 1.2 points of fine-tuned THYME (91.10), demonstrating strong generalization across very different retrieval conditions.
Under paraphrased query evaluation (Table 4 in the paper), fine-tuned DTR degrades by 8–12 points on average. CRAFT Stage 3 is essentially invariant to query reformulation, confirming that semantic embedding-based retrieval generalizes far beyond lexical patterns learned during fine-tuning.
The plots below show the trade-off between answer F1 and average context length (log scale) across different top-n configurations for 7B-scale LLMs. CRAFT Stage 3 consistently reaches high F1 at lower context budgets than fine-tuned baselines (DTR).
Figure 2: Average F1 vs. context length (log scale) on NQ-Tables for Mistral-7B, Llama-3.1-8B, and Qwen2.5-7B. Points are labeled with the optimal top-n value.
For large-scale LLMs (Mistral-Small, Llama-3.3-70B, Qwen2.5-72B), CRAFT Stage 3 with Gemini embeddings approaches gold-table performance at modest context lengths, while Stage 2 (MT) offers a lightweight, cost-effective alternative.
Figure 3: Average F1 vs. context length (log scale) on OTT-QA for large-scale models. "×" marks gold-table upper bound.
CRAFT's mini-table construction (top-5 relevant rows per table) dramatically reduces the average token count compared to providing full tables, while preserving the most query-relevant evidence. The table below shows average token counts across different models and top-n values.
Table 7: Average token counts (sub-table vs. full table) across models and top-k values. Sub-tables use 4–8× fewer tokens at every setting.
CRAFT is entirely plug-and-play. Any improvement in upstream embedding models (SPLADE, Sentence Transformers, OpenAI) directly translates to better retrieval.
While fine-tuned models degrade significantly under query reformulation, CRAFT's semantic embeddings are near-invariant — critical for real-world deployment.
Generating natural-language titles and summaries for tables closes the lexical gap between structured data and free-text queries — a broadly applicable technique.
Mini-tables reduce context length by 4–8×, enabling cost-effective inference with smaller LLMs while maintaining competitive answer accuracy.
Strong zero-shot transfer from NQ-Tables to OTT-QA (a completely different benchmark) demonstrates CRAFT's generalizability beyond a single retrieval setting.
Each stage can be replaced independently. Practitioners can swap in custom sparse retrievers, embeddings, or LLMs to adapt CRAFT to new corpora or languages.
If you find CRAFT useful, please cite our work:
@misc{singh2025crafttrainingfreecascadedretrieval,
title={CRAFT: Training-Free Cascaded Retrieval for Tabular QA},
author={Adarsh Singh and Kushal Raj Bhandari and Jianxi Gao
and Soham Dan and Vivek Gupta},
year={2025},
eprint={2505.14984},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2505.14984},
}