No Universal Prompt: Unifying Reasoning through Adaptive Prompting for Temporal Table Reasoning

1University of Utah, 2Arizona State University, 3University of California Riverside, 4University of Pennsylvania
*Equal contribution

About

No Universal Prompt: Unifying Reasoning through Adaptive Prompting for Temporal Table Reasoning

Temporal Table Reasoning is a critical challenge for Large Language Models (LLMs), requiring effective reasoning to extract relevant insights. Despite the existence of multiple prompting methods, their impact on table reasoning remains largely unexplored. Furthermore, model performance varies drastically across different table and context structures, making it difficult to determine an optimal approach.

This work investigates multiple prompting techniques on diverse table types to determine that performance depends on factors such as entity type, table structure, requirement of additional context and question complexity, with NO single method consistently outperforming others.

TL;DR: We introduce SEAR, an adaptive prompting framework inspired by human reasoning that dynamically adjusts to context and integrates structured reasoning. Our results demonstrate that SEAR achieves superior performance across all table types compared to baseline prompting techniques. Additionally, we explore the impact of table structure refactoring, finding that a unified representation enhances model reasoning.

Why is Temporal Table Reasoning Challenging?

Temporal table QA requires models to reason over structured data while accounting for time-dependent relationships. This challenge arises from three key factors:

Structural Variability

Tables range from simple grids to hierarchical or semi-structured layouts with merged cells and implicit links (e.g., HiTab's multi-level indexes, HybridQA's tables mixed with text). They also come in diverse file formats (CSV, HTML, Markdown), so parsing must be flexible. SEAR first flattens and standardizes these varied structures, making them easier for downstream reasoning.

Domain-Specific Complexity

Reasoning strategies must adapt to the table's domain. Wikipedia-based datasets like WikiTableQuestions demand general factual reasoning and entity linking. Financial datasets like FinQA or TAT-QA emphasize numerical reasoning, requiring multi-step arithmetic and temporal trend analysis. SEAR dynamically adapts to these needs by identifying relevant entities and values, then applying suitable prompting strategies such as F-CoT or PoT.

Question Complexity

Temporal QA questions range from direct lookups (e.g., "What year did the team win?") to complex reasoning (e.g., "What was the profit two quarters after policy X?"). These often require temporal anchoring, arithmetic, and sequential logic. SEAR addresses this by decomposing questions and tailoring its strategy based on both table and query characteristics.

Examples of Different Table and Contextual Structures
Figure 1: Examples of Different Table and Contextual Structures

Adaptive Reasoning Framework: SEAR

Inspired by human problem-solving, we propose the SEAR (Select-Elaborate-Answer & Reasoning) framework designed to dynamically adapt reasoning strategies based on the structure and complexity of the given table.

SEAR: Three-Step Process

Step 1: Select Crucial Steps

Identify key reasoning steps without answering directly, creating an efficient problem-solving path. This includes:

  • Problem Understanding: Define the question's objective and analyze table structure
  • Reasoning Process: Select single or multiple strategies from extracting relevant evidence, decomposing complex queries, applying logical steps, and generating Python code if needed
  • Optimization tips: Simplify steps, retrieve direct answers when possible, and use code for numerical operations
SEAR Step 1 - Select Crucial Steps Prompt
Figure: SEAR Step 1 - Select Crucial Steps Prompt

Step 2: Elaborate Crucial Steps

Refine and comprehend selected steps for clarity and effectiveness:

  • Add contextual details, specify exact table elements, and refine decomposition
  • Ensure a structured and logically coherent flow toward the final answer
SEAR Step 2 - Elaborate Crucial Steps Prompt
Figure: SEAR Step 2 - Elaborate Crucial Steps Prompt

Step 3: Answer & Reasoning

Execute the structured steps to derive an accurate, well-supported answer:

  • Follow elaborated steps precisely, referencing extracted evidence
  • Justify answers with logical explanations, when possible directly answer from evidence
  • Integrate Python code for calculations when needed
SEAR Step 3 - Answer & Reasoning Prompt
Figure: SEAR Step 3 - Answer & Reasoning Prompt

SEAR_Unified: Single-Step Adaptive Prompting

Standard SEAR is a three-step process that adds overhead and can impact efficiency. To address this, we propose SEAR_Unified, a single-step adaptive prompt that merges SEAR's structured reasoning into a unified framework.

It dynamically selects and refines reasoning steps based on the query and table structure, retrieving key information, decomposing complex queries when needed, and selectively using Python for numerical operations. SEAR_Unified validates intermediate steps and performs error checks to ensure accuracy while reducing redundant complexity.

SEAR_Unified Prompt and Reasoning Path
Figure: SEAR_Unified Prompt and Reasoning Path

Table Refactoring

We introduce table and context refactoring as a preprocessing step that clarifies headers, aligns data, and removes irrelevant context. This improves retrieval precision, reduces reasoning errors, and enhances adaptability across diverse tabular formats.

Table Refactoring Example
Figure: Table Refactoring Example

Experimental Setup

Datasets

We selected eight diverse tabular datasets spanning structured, semi-structured, hierarchical, and hybrid tables to ensure a comprehensive evaluation. These datasets present challenges such as entity relations, numerical reasoning, and textual integration:

  • FeTaQA: Wikipedia tables; long-form answers from discontinuous facts (1,582 questions)
  • FinQA: Financial reports; multi-step numerical reasoning (962 questions)
  • HiTab: Hierarchical tables; fine-grained numeric questions (897 questions)
  • HybridQA: Wiki tables + linked text; hybrid reasoning (1,528 questions)
  • MultiHierTT: Finance; multiple hierarchical tables + long text (1,587 questions)
  • Squall: WikiTableQ + SQL alignments; structured query tasks (774 questions)
  • TAT-QA: Finance; tables + text with arithmetic/counting (2,244 questions)
  • WikiTableQuestions: Wikipedia trivia; factual + numeric Q over large tables (1,504 questions)
Dataset Structure Domain Reasoning Question Types Answer Types
Flat Hierarchical Hybrid Wikipedia Finance Numerical Textual Lookup Multi-step Temporal Long-form SQL
FeTaQA
FinQA
HiTab
HybridQA
MultiHierTT
Squall
TAT-QA
WikiTableQuestions

Table 1: Comparison of Temporal Table QA datasets by structure, domain, reasoning, and question types. HiTab spans Wikipedia and financial domains. Binary indicators simplify complex question types (e.g., SQL, long-form).

Models

We used 3 state-of-the-art LLM models:

  • GPT-4o-mini
  • Gemini 1.5 Flash
  • LLaMA 3.1 70B

Prompting Methods & Baselines

We evaluated 13 prompting strategies spanning direct, structured, temporal, and agentic approaches:

Baseline Brief description Category
Chain‑of‑Thought (CoT) Wei et al., 2022 Step‑by‑step natural‑language rationale Direct
Evidence Extraction (EE) Extracts supporting cells first, then answers Direct
Decomposed Prompting (Decomp) Khot et al., 2023 Splits complex queries into simpler sub‑prompts Direct
Faithful CoT (F‑CoT) Lyu et al., 2023 Adds consistency checks to Chain‑of‑Thought Direct
Program‑of‑Thought (PoT) Chen et al., 2023 Generates executable code (e.g., Python) for reasoning Direct
Self‑Discover Zhou et al., 2024 Model autonomously picks reasoning modules Structured
Self‑Ask Press et al., 2023 Iteratively asks and answers sub‑questions Structured
Plan & Solve Wang et al., 2023 Separates plan generation from execution Structured
C.L.E.A.R. Deng et al., 2025 Injects temporal cues for semi‑structured tables Temporal
Narration of Thought (NoT) Zhang et al., 2024 Requires chronological narration to keep temporal order Temporal
Self‑Consistency Prompting (SCP) Wang et al., 2023 Samples multiple CoTs and votes Agentic
Tree of Thought (ToT) Yao et al., 2023 Searches a tree of reasoning states with pruning Agentic
Graph of Thought (GoT) Besta et al., 2023 Generalises ToT to graph search Agentic

Table 4: Prompting baselines grouped by category.

Evaluation Metric

We propose the Hybrid Correctness Score (HCS), which balances lexical and semantic accuracy by combining Relaxed Exact Match Score (REMS, F1-based) and Contextual Answer Evaluation (CAE, LLM-based). A response is considered correct if its REMS score exceeds 80% or if CAE deems it correct.

Results & Analysis

Key Findings

Performance Tables

wiki multi hitab finqa tatqa fetaqa squall hybridqa
CoT 73.60 58.79 79.04 60.08 87.30 71.30 69.90 80.76
F‑CoT 66.89 60.68 52.06 62.16 78.79 56.13 61.11 17.93
Decomp 78.52 61.00 75.47 62.58 91.67 67.07 67.57 74.67
EE 76.33 60.43 80.82 55.93 92.20 77.62 72.32 80.10
PoT 74.40 61.12 70.68 60.52 79.68 50.88 63.57 38.48
NoT 75.19 46.12 81.60 51.03 86.54 87.89 69.12 79.84
ToT 81.98 58.72 77.81 51.24 91.04 79.26 75.32 82.52
GoT 74.86 56.08 84.05 50.83 90.95 84.57 66.14 81.02
SCP 81.71 60.42 80.93 52.70 91.22 84.32 72.35 84.29
CLEAR 82.71 55.57 79.71 53.95 93.27 84.00 78.81 84.48
Self Ask 78.52 45.43 79.15 64.66 81.42 80.15 70.67 63.48
Plan & Solve 81.72 39.51 67.56 66.32 90.60 81.83 77.00 62.63
Self Discover 80.32 59.42 78.93 65.49 91.35 81.16 74.81 80.43
SEAR 81.45 60.18 79.71 65.90 90.02 82.87 80.23 81.15
SEAR_U 82.18 61.75 82.61 68.71 92.78 79.84 81.52 82.00
SEAR+R 82.71 58.54 81.05 65.49 89.39 84.20 78.04 65.90
SEAR_U+R 83.38 56.58 82.83 67.36 91.53 85.52 77.91 67.08

Table 6: HCS scores (in %) using Gemini 1.5 Flash. R stands for "Refactoring" and U stands for "Unified". Bold represents the best performer and underlined represents the second best performer.

wiki multi hitab finqa tatqa fetaqa squall hybridqa
CoT 78.92 57.97 77.59 64.14 92.91 84.13 67.57 78.21
F‑CoT 71.61 55.32 71.35 64.97 91.04 77.81 56.46 34.62
Decomp 79.79 57.03 76.14 65.18 92.65 78.45 62.40 77.68
EE 80.12 56.77 79.38 56.03 92.81 83.88 66.67 79.58
PoT 79.59 57.91 76.25 56.13 90.15 72.00 72.35 61.98
NoT 65.82 44.54 80.82 50.41 88.01 85.46 52.58 76.83
ToT 81.91 56.89 79.04 55.40 96.60 82.30 66.67 80.49
GoT 71.54 52.04 74.58 51.35 90.90 81.68 53.61 75.58
SCP 79.05 57.59 79.71 55.19 92.29 84.19 66.53 80.01
CLEAR 82.84 58.09 78.26 55.92 85.22 84.00 68.08 82.26
Self Ask 78.66 54.38 79.60 66.11 90.76 83.03 72.09 63.48
Plan & Solve 82.65 56.77 78.26 64.97 90.34 83.92 77.26 62.63
Self Discover 82.71 56.46 79.60 65.70 91.67 84.51 70.28 80.43
SEAR 80.19 57.40 77.37 67.26 92.42 83.38 69.64 75.33
SEAR_U 79.92 61.00 78.93 71.10 92.91 84.89 76.74 78.27
SEAR + R 82.91 56.65 78.82 66.94 91.84 84.77 79.33 68.72
SEAR_U + R 84.18 59.29 80.27 69.75 91.44 84.39 79.20 70.48

Table 7: HCS scores (in %) using GPT-4o mini. R stands for "Refactoring" and U stands for "Unified". Bold represents the best performer and underlined represents the second best performer.

wiki multi hitab finqa tatqa fetaqa squall hybridqa
CoT 81.05 57.59 82.95 66.22 91.00 86.03 75.45 81.66
F‑CoT 66.22 39.82 64.55 51.77 45.12 52.78 61.11 33.31
Decomp 82.85 59.29 81.84 65.28 93.18 84.51 73.51 80.53
EE 81.91 58.92 82.84 61.75 92.54 86.62 80.10 81.07
PoT 76.53 58.98 67.56 66.42 91.40 50.44 68.22 37.76
NoT 55.57 39.76 49.83 42.23 48.57 61.18 44.85 65.32
ToT 84.57 45.35 74.99 57.58 82.67 83.50 78.29 83.18
GoT 71.27 52.61 68.45 40.24 72.73 88.49 59.19 74.80
SCP 82.96 57.80 79.38 52.52 85.22 85.46 74.96 79.75
CLEAR 86.23 54.93 76.39 56.23 92.15 86.97 79.84 79.71
Self Ask 81.98 56.84 82.06 67.46 91.69 85.98 76.10 72.32
Plan & Solve 82.65 55.95 80.39 66.57 92.51 83.96 76.23 70.55
Self Discover 85.77 57.91 83.95 66.11 92.87 86.09 79.33 83.25
SEAR 82.65 59.61 83.05 66.63 92.34 85.52 81.40 79.78
SEAR_U 82.05 62.19 82.39 70.17 93.27 87.04 82.04 80.27
SEAR + R 82.65 57.09 82.39 67.26 91.67 86.85 76.87 67.74
SEAR_U + R 85.11 58.16 83.05 69.67 92.89 87.23 82.49 72.16

Table 8: HCS scores (in %) using LLaMA 3.1 70B. R stands for "Refactoring" and U stands for "Unified". Bold represents the best performer and underlined represents the second best performer.

Error Distribution

Distribution of Error Types Across Datasets
Figure 2: Distribution of Error Types Across Datasets

Conclusion

This paper introduces SEAR, an adaptive reasoning strategy for LLMs to tackle Temporal Table QA tasks, along with its consolidated version, SEAR_Unified. Additionally, we take a step toward a unified table representation by incorporating table refactoring as an enhancement.

Our study provides a comprehensive analysis of various reasoning strategies across eight diverse datasets, benchmarking SEAR and SEAR_Unified against multiple baselines. Results demonstrate that SEAR, SEAR_Unified and with Table Refactoring significantly outperforms popular LLM reasoning methods, with SEAR_Unified surpassing SEAR itself, showcasing its ability to optimize and streamline reasoning with minimal overhead.

This highlights the capability of modern LLMs to dynamically adjust reasoning within a single prompt, reducing the need for explicit multi-step processes. Our findings reinforce the importance of adaptive reasoning and structured table representation, paving the way for further advancements in LLM-based temporal table reasoning.

BibTeX

@misc{anonymous2025nouniversal,
      title={No Universal Prompt: Unifying Reasoning through Adaptive Prompting for Temporal Table Reasoning},
      author={Anonymous},
      year={2025},
      eprint={2506.11246},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2506.11246},
}