No Universal Prompt: Unifying Reasoning through Adaptive Prompting for Temporal Table Reasoning

About

No Universal Prompt: Unifying Reasoning through Adaptive Prompting for Temporal Table Reasoning

Temporal Table Reasoning is a critical challenge for Large Language Models (LLMs), requiring effective reasoning to extract relevant insights. Despite the existence of multiple prompting methods, their impact on table reasoning remains largely unexplored. Furthermore, model performance varies drastically across different table and context structures, making it difficult to determine an optimal approach.

This work investigates multiple prompting techniques on diverse table types to determine that performance depends on factors such as entity type, table structure, requirement of additional context and question complexity, with NO single method consistently outperforming others.

TL;DR: We introduce SEAR, an adaptive prompting framework inspired by human reasoning that dynamically adjusts to context and integrates structured reasoning. Our results demonstrate that SEAR achieves superior performance across all table types compared to baseline prompting techniques. Additionally, we explore the impact of table structure refactoring, finding that a unified representation enhances model reasoning.

Why is Temporal Table Reasoning Challenging?

Temporal table QA requires models to reason over structured data while accounting for time-dependent relationships. This challenge arises from three key factors:

Structural Variability

Tables range from simple grids to hierarchical or semi-structured layouts with merged cells and implicit links (e.g., HiTab's multi-level indexes, HybridQA's tables mixed with text). They also come in diverse file formats (CSV, HTML, Markdown), so parsing must be flexible. SEAR first flattens and standardizes these varied structures, making them easier for downstream reasoning.

Domain-Specific Complexity

Reasoning strategies must adapt to the table's domain. Wikipedia-based datasets like WikiTableQuestions demand general factual reasoning and entity linking. Financial datasets like FinQA or TAT-QA emphasize numerical reasoning, requiring multi-step arithmetic and temporal trend analysis. SEAR dynamically adapts to these needs by identifying relevant entities and values, then applying suitable prompting strategies such as F-CoT or PoT.

Question Complexity

Temporal QA questions range from direct lookups (e.g., "What year did the team win?") to complex reasoning (e.g., "What was the profit two quarters after policy X?"). These often require temporal anchoring, arithmetic, and sequential logic. SEAR addresses this by decomposing questions and tailoring its strategy based on both table and query characteristics.

**Figure 1:** Examples of Different Table and Contextual Structures

Adaptive Reasoning Framework: SEAR

Inspired by human problem-solving, we propose the SEAR (Select-Elaborate-Answer & Reasoning) framework designed to dynamically adapt reasoning strategies based on the structure and complexity of the given table.

SEAR: Three-Step Process

Step 1: Select Crucial Steps

Identify key reasoning steps without answering directly, creating an efficient problem-solving path. This includes:

Problem Understanding: Define the question's objective and analyze table structure
Reasoning Process: Select single or multiple strategies from extracting relevant evidence, decomposing complex queries, applying logical steps, and generating Python code if needed
Optimization tips: Simplify steps, retrieve direct answers when possible, and use code for numerical operations

**Figure:** SEAR Step 1 - Select Crucial Steps Prompt

Step 2: Elaborate Crucial Steps

Refine and comprehend selected steps for clarity and effectiveness:

Add contextual details, specify exact table elements, and refine decomposition
Ensure a structured and logically coherent flow toward the final answer

**Figure:** SEAR Step 2 - Elaborate Crucial Steps Prompt

Step 3: Answer & Reasoning

Execute the structured steps to derive an accurate, well-supported answer:

Follow elaborated steps precisely, referencing extracted evidence
Justify answers with logical explanations, when possible directly answer from evidence
Integrate Python code for calculations when needed

**Figure:** SEAR Step 3 - Answer & Reasoning Prompt

SEAR_Unified: Single-Step Adaptive Prompting

Standard SEAR is a three-step process that adds overhead and can impact efficiency. To address this, we propose SEAR_Unified, a single-step adaptive prompt that merges SEAR's structured reasoning into a unified framework.

It dynamically selects and refines reasoning steps based on the query and table structure, retrieving key information, decomposing complex queries when needed, and selectively using Python for numerical operations. SEAR_Unified validates intermediate steps and performs error checks to ensure accuracy while reducing redundant complexity.

**Figure:** SEAR_Unified Prompt and Reasoning Path

Table Refactoring

We introduce table and context refactoring as a preprocessing step that clarifies headers, aligns data, and removes irrelevant context. This improves retrieval precision, reduces reasoning errors, and enhances adaptability across diverse tabular formats.

Experimental Setup

Datasets

We selected eight diverse tabular datasets spanning structured, semi-structured, hierarchical, and hybrid tables to ensure a comprehensive evaluation. These datasets present challenges such as entity relations, numerical reasoning, and textual integration:

FeTaQA: Wikipedia tables; long-form answers from discontinuous facts (1,582 questions)
FinQA: Financial reports; multi-step numerical reasoning (962 questions)
HiTab: Hierarchical tables; fine-grained numeric questions (897 questions)
HybridQA: Wiki tables + linked text; hybrid reasoning (1,528 questions)
MultiHierTT: Finance; multiple hierarchical tables + long text (1,587 questions)
Squall: WikiTableQ + SQL alignments; structured query tasks (774 questions)
TAT-QA: Finance; tables + text with arithmetic/counting (2,244 questions)
WikiTableQuestions: Wikipedia trivia; factual + numeric Q over large tables (1,504 questions)

Dataset	Structure			Domain		Reasoning		Question Types			Answer Types
Dataset	Flat	Hierarchical	Hybrid	Wikipedia	Finance	Numerical	Textual	Lookup	Multi-step	Temporal	Long-form	SQL
FeTaQA	✓	✗	✗	✓	✗	✗	✓	✓	✓	✗	✓	✗
FinQA	✓	✗	✗	✗	✓	✓	✓	✗	✓	✓	✗	✗
HiTab^†	✗	✓	✗	✓	✓	✓	✗	✓	✓	✓	✗	✗
HybridQA	✗	✗	✓	✓	✗	✓	✓	✓	✓	✗	✗	✗
MultiHierTT	✗	✓	✗	✗	✓	✓	✓	✗	✓	✓	✗	✗
Squall	✓	✗	✗	✓	✗	✓	✗	✓	✓	✗	✗	✓
TAT-QA	✓	✗	✓	✗	✓	✓	✓	✓	✓	✓	✗	✗
WikiTableQuestions	✓	✗	✗	✓	✗	✓	✗	✓	✗	✗	✗	✗

Table 1: Comparison of Temporal Table QA datasets by structure, domain, reasoning, and question types. ^†HiTab spans Wikipedia and financial domains. Binary indicators simplify complex question types (e.g., SQL, long-form).

Models

We used 3 state-of-the-art LLM models:

GPT-4o-mini
Gemini 1.5 Flash
LLaMA 3.1 70B

Prompting Methods & Baselines

We evaluated 13 prompting strategies spanning direct, structured, temporal, and agentic approaches:

Baseline	Brief description	Category
Chain‑of‑Thought (CoT) Wei et al., 2022	Step‑by‑step natural‑language rationale	Direct
Evidence Extraction (EE)	Extracts supporting cells first, then answers	Direct
Decomposed Prompting (Decomp) Khot et al., 2023	Splits complex queries into simpler sub‑prompts	Direct
Faithful CoT (F‑CoT) Lyu et al., 2023	Adds consistency checks to Chain‑of‑Thought	Direct
Program‑of‑Thought (PoT) Chen et al., 2023	Generates executable code (e.g., Python) for reasoning	Direct
Self‑Discover Zhou et al., 2024	Model autonomously picks reasoning modules	Structured
Self‑Ask Press et al., 2023	Iteratively asks and answers sub‑questions	Structured
Plan & Solve Wang et al., 2023	Separates plan generation from execution	Structured
C.L.E.A.R. Deng et al., 2025	Injects temporal cues for semi‑structured tables	Temporal
Narration of Thought (NoT) Zhang et al., 2024	Requires chronological narration to keep temporal order	Temporal
Self‑Consistency Prompting (SCP) Wang et al., 2023	Samples multiple CoTs and votes	Agentic
Tree of Thought (ToT) Yao et al., 2023	Searches a tree of reasoning states with pruning	Agentic
Graph of Thought (GoT) Besta et al., 2023	Generalises ToT to graph search	Agentic

Table 4: Prompting baselines grouped by category.

Evaluation Metric

We propose the Hybrid Correctness Score (HCS), which balances lexical and semantic accuracy by combining Relaxed Exact Match Score (REMS, F1-based) and Contextual Answer Evaluation (CAE, LLM-based). A response is considered correct if its REMS score exceeds 80% or if CAE deems it correct.

Results & Analysis

Key Findings

No Universal Prompt: A Key Discovery

Performance varies depending on table structure, domain, and question complexity. As observed in Gemini 1.5 Flash results, CoT performs best on HybridQA, Evidence Extraction excels in HiTab, TATQA, FeTaQA and Squall, while Decomposition is most effective for WikiTabQA and FinQA. PoT shows the highest performance in MultiHierTT, whereas F-CoT does not emerge as the best baseline in any dataset. Thus, no single prompting method universally outperforms others.

SEAR: Adaptive Framework Success

SEAR dynamically selects its reasoning path, primarily leveraging Evidence Extraction, Decomposition, and Logical Steps (CoT) while integrating Python Program for numerical reasoning. By design, it optimally combines dominant reasoning strategies with computation support. SEAR outperforms baseline in 5 datasets for Gemini, 2 datasets for GPT, and 4 datasets for LLaMA.

SEAR_Unified: Superior Performance

SEAR_Unified optimizes reasoning by merging and refining steps into a single adaptive prompt, reducing overhead while enhancing flexibility. SEAR_Unified outperforms baselines across all datasets for Gemini, while for GPT and LLaMA, it surpasses baselines in 6 datasets, demonstrating its superiority and ability to generalize effectively across diverse datasets and models.

Comparison with Structured Baselines

We compared our methods with recent structured and modular reasoning approaches, including Self-Discover, Self-Ask, and Plan & Solve. Our approach consistently outperforms these baselines, with particularly strong gains on Multi-HierTT, HiTabs, Squall, and HybridQA. Among them, Self-Discover performs the closest, underscoring the value of modular and adaptive reasoning.

Comparison with Temporal & Agentic Methods

We also benchmarked against temporal (NoT, C.L.E.A.R.) and agentic (ToT, GoT, SCP) strategies. Although NoT, C.L.E.A.R., and GoT perform well on FetaQA, TAT-QA, and HiTabs, they fail to deliver consistent improvements on more complex benchmarks.

Table Refactoring Impact

Refactoring tabular data enhances LLM accuracy by improving clarity, structure, and accessibility. Standardizing tables to Markdown format significantly improves performance. For instance, the Squall dataset, originally in JSON, benefits from this transformation. GPT-4o-mini with SEAR + Refactoring (79.33%) outperforms SEAR (69.64%) by 9.69%.

Error Analysis: Evidence Extraction is the Bottleneck

We conduct fine-grained error analysis across six datasets and find that evidence extraction is the most common failure mode, accounting for the majority of errors in five out of six cases. These errors arise from shallow string matching, ambiguous headers, and missed qualifiers (e.g., years, units, footnotes), leading models to anchor to plausible but incorrect cells.

Reasoning Path Analysis

The Adaptive Framework consistently generalizes across multiple datasets by dynamically selecting appropriate reasoning paths. Evidence Extraction is always included, helping the model focus on relevant information. For lookup-based questions, Evidence Extraction alone suffices, while more complex tasks require a combination of reasoning methods.

Domain-Specific Patterns

Datasets with long-form answers (FeTaQA) benefit from textual strategies. FinQA, which is heavy on numerical computation, favors symbolic methods (PoT, F-CoT). This pattern extends across datasets, with chosen reasoning paths aligning with their respective strengths, demonstrating SEAR's adaptability.

Performance Tables

	wiki	multi	hitab	finqa	tatqa	fetaqa	squall	hybridqa
CoT	73.60	58.79	79.04	60.08	87.30	71.30	69.90	80.76
F‑CoT	66.89	60.68	52.06	62.16	78.79	56.13	61.11	17.93
Decomp	78.52	61.00	75.47	62.58	91.67	67.07	67.57	74.67
EE	76.33	60.43	80.82	55.93	92.20	77.62	72.32	80.10
PoT	74.40	61.12	70.68	60.52	79.68	50.88	63.57	38.48
NoT	75.19	46.12	81.60	51.03	86.54	87.89	69.12	79.84
ToT	81.98	58.72	77.81	51.24	91.04	79.26	75.32	82.52
GoT	74.86	56.08	84.05	50.83	90.95	84.57	66.14	81.02
SCP	81.71	60.42	80.93	52.70	91.22	84.32	72.35	84.29
CLEAR	82.71	55.57	79.71	53.95	93.27	84.00	78.81	84.48
Self Ask	78.52	45.43	79.15	64.66	81.42	80.15	70.67	63.48
Plan & Solve	81.72	39.51	67.56	66.32	90.60	81.83	77.00	62.63
Self Discover	80.32	59.42	78.93	65.49	91.35	81.16	74.81	80.43
SEAR	81.45	60.18	79.71	65.90	90.02	82.87	80.23	81.15
SEAR_U	82.18	61.75	82.61	68.71	92.78	79.84	81.52	82.00
SEAR+R	82.71	58.54	81.05	65.49	89.39	84.20	78.04	65.90
SEAR_U+R	83.38	56.58	82.83	67.36	91.53	85.52	77.91	67.08

Table 6: HCS scores (in %) using Gemini 1.5 Flash. R stands for "Refactoring" and U stands for "Unified". Bold represents the best performer and underlined represents the second best performer.

	wiki	multi	hitab	finqa	tatqa	fetaqa	squall	hybridqa
CoT	78.92	57.97	77.59	64.14	92.91	84.13	67.57	78.21
F‑CoT	71.61	55.32	71.35	64.97	91.04	77.81	56.46	34.62
Decomp	79.79	57.03	76.14	65.18	92.65	78.45	62.40	77.68
EE	80.12	56.77	79.38	56.03	92.81	83.88	66.67	79.58
PoT	79.59	57.91	76.25	56.13	90.15	72.00	72.35	61.98
NoT	65.82	44.54	80.82	50.41	88.01	85.46	52.58	76.83
ToT	81.91	56.89	79.04	55.40	96.60	82.30	66.67	80.49
GoT	71.54	52.04	74.58	51.35	90.90	81.68	53.61	75.58
SCP	79.05	57.59	79.71	55.19	92.29	84.19	66.53	80.01
CLEAR	82.84	58.09	78.26	55.92	85.22	84.00	68.08	82.26
Self Ask	78.66	54.38	79.60	66.11	90.76	83.03	72.09	63.48
Plan & Solve	82.65	56.77	78.26	64.97	90.34	83.92	77.26	62.63
Self Discover	82.71	56.46	79.60	65.70	91.67	84.51	70.28	80.43
SEAR	80.19	57.40	77.37	67.26	92.42	83.38	69.64	75.33
SEAR_U	79.92	61.00	78.93	71.10	92.91	84.89	76.74	78.27
SEAR + R	82.91	56.65	78.82	66.94	91.84	84.77	79.33	68.72
SEAR_U + R	84.18	59.29	80.27	69.75	91.44	84.39	79.20	70.48

Table 7: HCS scores (in %) using GPT-4o mini. R stands for "Refactoring" and U stands for "Unified". Bold represents the best performer and underlined represents the second best performer.

	wiki	multi	hitab	finqa	tatqa	fetaqa	squall	hybridqa
CoT	81.05	57.59	82.95	66.22	91.00	86.03	75.45	81.66
F‑CoT	66.22	39.82	64.55	51.77	45.12	52.78	61.11	33.31
Decomp	82.85	59.29	81.84	65.28	93.18	84.51	73.51	80.53
EE	81.91	58.92	82.84	61.75	92.54	86.62	80.10	81.07
PoT	76.53	58.98	67.56	66.42	91.40	50.44	68.22	37.76
NoT	55.57	39.76	49.83	42.23	48.57	61.18	44.85	65.32
ToT	84.57	45.35	74.99	57.58	82.67	83.50	78.29	83.18
GoT	71.27	52.61	68.45	40.24	72.73	88.49	59.19	74.80
SCP	82.96	57.80	79.38	52.52	85.22	85.46	74.96	79.75
CLEAR	86.23	54.93	76.39	56.23	92.15	86.97	79.84	79.71
Self Ask	81.98	56.84	82.06	67.46	91.69	85.98	76.10	72.32
Plan & Solve	82.65	55.95	80.39	66.57	92.51	83.96	76.23	70.55
Self Discover	85.77	57.91	83.95	66.11	92.87	86.09	79.33	83.25
SEAR	82.65	59.61	83.05	66.63	92.34	85.52	81.40	79.78
SEAR_U	82.05	62.19	82.39	70.17	93.27	87.04	82.04	80.27
SEAR + R	82.65	57.09	82.39	67.26	91.67	86.85	76.87	67.74
SEAR_U + R	85.11	58.16	83.05	69.67	92.89	87.23	82.49	72.16

Table 8: HCS scores (in %) using LLaMA 3.1 70B. R stands for "Refactoring" and U stands for "Unified". Bold represents the best performer and underlined represents the second best performer.

Error Distribution

**Figure 2:** Distribution of Error Types Across Datasets

Conclusion

This paper introduces SEAR, an adaptive reasoning strategy for LLMs to tackle Temporal Table QA tasks, along with its consolidated version, SEAR_Unified. Additionally, we take a step toward a unified table representation by incorporating table refactoring as an enhancement.

Our study provides a comprehensive analysis of various reasoning strategies across eight diverse datasets, benchmarking SEAR and SEAR_Unified against multiple baselines. Results demonstrate that SEAR, SEAR_Unified and with Table Refactoring significantly outperforms popular LLM reasoning methods, with SEAR_Unified surpassing SEAR itself, showcasing its ability to optimize and streamline reasoning with minimal overhead.

This highlights the capability of modern LLMs to dynamically adjust reasoning within a single prompt, reducing the need for explicit multi-step processes. Our findings reinforce the importance of adaptive reasoning and structured table representation, paving the way for further advancements in LLM-based temporal table reasoning.

BibTeX

@misc{anonymous2025nouniversal,
      title={No Universal Prompt: Unifying Reasoning through Adaptive Prompting for Temporal Table Reasoning},
      author={Anonymous},
      year={2025},
      eprint={2506.11246},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2506.11246},
}