Moneyball with LLMs:Analyzing Tabular Summarization in Sports Narratives

Decomposition helps, but the main bottleneck is consistent multi-entity memory over long contexts.

Benchmark

The benchmark isolates long-context state tracking in two multi-entity settings.

Each event updates structured state for the right entity and role.

Decomposition helps. EntityCoT best supports the memory bottleneck claim.

CoT baseline

Divide +14.9pp

EntityCoT +15.1pp

T3 +20.6pp

Perturbations change failure mode, not just score.

HOI: Perturbations shift error direction Gemini, often Qwen → hallucination-heavy GPT-4.1 → omission-heavy Entanglement can flip direction in cricket

Exposure effects are real, but they differ by model and domain.

Decomposition improves scores. Stable multi-entity memory remains the main bottleneck.