Map&Make: Schema Guided Text to Table Generation

1Arizona State University
*Equal contribution

ACL 2025 (Main)

Map&Make vs Traditional Methods

Map&Make scales test-time compute through modular decomposition for improved text-to-table generation

Abstract

Transforming dense, unstructured text into interpretable tables—commonly referred to as Text-to-Table generation—is a key task in information extraction. Existing methods often overlook what complex information to extract and how to infer it from text. We present Map&Make, a versatile approach that decomposes text into atomic propositions to infer latent schemas, which are then used to generate tables capturing both qualitative nuances and quantitative facts. We evaluate our method on three challenging datasets: Rotowire, known for its complex, multi-table schema; Livesum which requires numerical aggregation; and Wiki40 which require open text extraction from mulitple domains. By correcting hallucination errors in Rotowire, we also provide a cleaner benchmark. Our method shows significant gains in both accuracy and interpretability across comprehensive comparative and referenceless metrics. Finally, ablation studies highlight the key factors driving performance and validate the utility of our approach in structured summarization.

Introduction

Text-to-table generation requires extracting structured information from unstructured narratives—a task that challenges current LLMs in determining what to extract and how to infer missing information. Map&Make addresses these challenges through a modular pipeline that scales test-time computation: atomizing text into propositions, extracting schemas, and generating tables.

This approach improves performance across datasets with different characteristics—from sports summaries requiring complex schemas to live commentary demanding numerical reasoning to open-domain content needing flexible extraction.

Datasets

We evaluate on three diverse benchmarks that stress-test different aspects of text-to-table generation:

🏀 RotoWire (Wiseman et al., 2017): 728 NBA game summaries requiring complex multi-table schemas with player and team statistics. Following concerns raised by Wu et al. (2022) and Strucbench (2024), we provide a corrected test set addressing hallucination errors in the original annotations; available on Hugging Face.

⚽ Livesum (Deng et al., 2021): 1,462 line-by-line football commentaries requiring numerical aggregation into team tables, testing numerical reasoning across diverse event categories.

📚 Wiki40B [EN] (Guo et al., 2020): 500 open-domain Wikipedia articles spanning diverse topics, requiring flexible schema extraction without predefined structures.

Evaluation

We employ a comprehensive evaluation suite combining String-Similarity metrics and Specialized metrics:

  • String-Similarity Metrics: Exact Match, chrF (Popovic´, ACL 2015), and BERTScore (Zhang* et al., ICLR 2020) for cell-level accuracy against gold annotations. These metrics however are suitable to match table structures extracted attributes and player names (rows and column headers), they faily to capture discrepencies in cell level values.
  • TabEval (Ramu et al., EMNLP 2024): Entailment-based evaluation that captures table semantics by comparing atomic statements, providing stronger correlation with human judgments.
  • AutoQA:(Jain et al., ACL 2024) Question-answering fidelity to measure information coverage without requiring gold tables.
  • Numerical Metrics: Error rates and RMSE for aggregation accuracy on Livesum.

Key Idea: Map&Make Pipeline

🎯 Core Innovation: Scaling test-time compute through modular decomposition

Map&Make is a modular 3-step prompting pipeline that enables step-by-step reasoning:

1️⃣ Atomization: Decomposes text into atomic propositions for granular information extraction

2️⃣ Schema Extraction: Identifies latent table structures from atomized facts

3️⃣ Table Generation: Populates cells using extracted schema and information

This decomposition enables better information coverage and integration compared to end-to-end approaches, allowing models to reason explicitly about what to extract and how to structure it.

Map&Make Framework

Results

🏀 RotoWire: String-Similarity Metrics

Map&Make outperforms Chain-of-Thought and T3 baselines across cell-level metrics (Exact Match, chrF, BERTScore) on both GPT-4o and Gemini-2.0. The modular pipeline's step-by-step reasoning improves extraction accuracy, particularly for complex multi-table schemas with player and team statistics.

🏀 RotoWire: Specialized Metrics

Using TabEval (entailment-based evaluation) and AutoQA (question-answering fidelity, referenceless), Map&Make demonstrates superior table-text consistency and information coverage. These specialized metrics confirm that our pipeline generates tables more faithful to the source text, beyond just matching gold annotations.

⚽ Livesum: Numerical Aggregation

On Livesum, which requires aggregating line-by-line football commentary into player and team statistics, Map&Make achieves lower error rates and RMSE across different event categories. The pipeline's atomization step helps identify individual events before aggregation, reducing both undercounting and overcounting errors in numerical reasoning.

📚 Wiki40B: Open-Domain Extraction

Wiki40B tests open-domain table extraction from diverse Wikipedia articles without predefined schemas. Map&Make's schema extraction step enables flexible adaptation to varied content types, demonstrating the pipeline's generalizability beyond sports and numerical aggregation tasks to arbitrary structured information extraction.

Analysis

📊 Error Patterns on RotoWire

Analyzing error patterns reveals that Map&Make reduces both row and column errors compared to baselines, demonstrating improved schema coverage and entity recognition. The modular approach helps maintain consistency across table dimensions.

🔢 Numerical Reasoning on Livesum

We analyze overcounting (values exceeding ground truth) vs. undercounting (values below ground truth). Map&Make shows balanced RMSE across both error types, indicating more reliable numerical aggregation through its atomization-based approach.

📈 Schema Coverage Scaling

As table sizes increase on RotoWire, Map&Make maintains higher schema coverage compared to baselines. The explicit schema extraction step enables the model to identify and populate more diverse column types, especially for complex multi-table scenarios.

Fine-Tuning Smaller Models

48.65
AutoQA Score
vs. 41.42 CoT
61.58%
Player Recall
vs. 57.74% CoT

Beyond inference-time prompting, Map&Make serves as a scalable data generation engine. We used Gemini 2.0 with our pipeline to generate high-quality supervision on the RotoWire training set, then fine-tuned LLaMA 3 8B Instruct. Results show that fine-tuning with Map&Make-generated data yields clear gains over Chain-of-Thought prompting in AutoQA and player-level recall.

However, team-level recall drops (41.61 vs. 56.26), indicating challenges in aggregating broader content—likely due to formatting consistency and long-context reasoning limitations in smaller models. While the full Map&Make pipeline (inference-time) still outperforms both alternatives, these results demonstrate the framework's effectiveness as a data generation tool for low-resource and multilingual settings where fine-tuning is necessary.

BibTeX

@inproceedings{ahuja-etal-2025-map,
    title = "Map{\&}Make: Schema Guided Text to Table Generation",
    author = "Ahuja, Naman  and
      Bardoliya, Fenil  and
      Baral, Chitta  and
      Gupta, Vivek",
    editor = "Che, Wanxiang  and
      Nabende, Joyce  and
      Shutova, Ekaterina  and
      Pilehvar, Mohammad Taher",
    booktitle = "Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = jul,
    year = "2025",
    address = "Vienna, Austria",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.acl-long.1460/",
    pages = "30249--30262",
    ISBN = "979-8-89176-251-0",
    abstract = "Transforming dense, unstructured text into interpretable tables{---}commonly referred to as Text-to-Table generation{---}is a key task in information extraction. Existing methods often overlook what complex information to extract and how to infer it from text. We present Map{\&}Make, a versatile approach that decomposes text into atomic propositions to infer latent schemas, which are then used to generate tables capturing both qualitative nuances and quantitative facts. We evaluate our method on three challenging datasets: Rotowire, known for its complex, multi-table schema; Livesum which requires numerical aggregation; and Wiki40 which require open text extraction from mulitple domains. By correcting hallucination errors in Rotowire, we also provide a cleaner benchmark. Our method shows significant gains in both accuracy and interpretability across comprehensive comparative and referenceless metrics. Finally, ablation studies highlight the key factors driving performance and validate the utility of our approach in structured summarization. Code and data are available at: https://coral-lab-asu.github.io/map-make."
}