Moneyball with LLMs:Analyzing Tabular Summarization in Sports Narratives
Decomposition helps, but the main bottleneck is consistent multi-entity memory over long contexts.
Benchmark
The benchmark isolates long-context state tracking in two multi-entity settings.
Each event updates structured state for the right entity and role.
Results
Decomposition helps. EntityCoT best supports the memory bottleneck claim.
CoT
baseline
Divide
+14.9pp
EntityCoT
+15.1pp
T3
+20.6pp
- Decomposition always helps.
- EntityCoT reduces multi-entity interference.
- T3 is best overall, but expensive.
Robustness
Perturbations change failure mode, not just score.
HOI: Perturbations shift error direction
Gemini, often Qwen → hallucination-heavy
GPT-4.1 → omission-heavy
Entanglement can flip direction in cricket
Memorization
Exposure effects are real, but they differ by model and domain.
Conclusion
Decomposition improves scores. Stable multi-entity memory remains the main bottleneck.