Moneyball with LLMs:Analyzing Tabular Summarization in Sports Narratives

Decomposition helps, but the main bottleneck is consistent multi-entity memory over long contexts.

Benchmark

The benchmark isolates long-context state tracking in two multi-entity settings.

Ball-by-ball commentary updates batsman and bowler tables over time.

Each event updates structured state for the right entity and role.

Results

Decomposition helps. EntityCoT best supports the memory bottleneck claim.

CoT baseline
Divide +14.9pp
EntityCoT +15.1pp
T3 +20.6pp
  • Decomposition always helps.
  • EntityCoT reduces multi-entity interference.
  • T3 is best overall, but expensive.

Robustness

Perturbations change failure mode, not just score.

HOI: Perturbations shift error direction Gemini, often Qwen → hallucination-heavy GPT-4.1 → omission-heavy Entanglement can flip direction in cricket

Memorization

Exposure effects are real, but they differ by model and domain.

Conclusion

Decomposition improves scores. Stable multi-entity memory remains the main bottleneck.