We evaluate on three diverse benchmarks that stress-test different aspects of text-to-table generation:
🏀 RotoWire (Wiseman et al., 2017): 728 NBA game summaries requiring complex multi-table schemas with player and team statistics. Following concerns raised by Wu et al. (2022) and Strucbench (2024), we provide a corrected test set addressing hallucination errors in the original annotations; available on Hugging Face.
⚽ Livesum (Deng et al., 2021): 1,462 line-by-line football commentaries requiring numerical aggregation into team tables, testing numerical reasoning across diverse event categories.
📚 Wiki40B [EN] (Guo et al., 2020): 500 open-domain Wikipedia articles spanning diverse topics, requiring flexible schema extraction without predefined structures.