Resources

Data

Datasets

NTSEBENCH: Cognitive Reasoning Benchmark for Vision Language Models

Pranshu Pandya, Vatsal Gupta, Agney S Talwarr, Tushar Kataria, Dan Roth, Vivek Gupta NAACL 2025

NTSEBENCH is a benchmark dataset of 2,728 multiple-choice questions with 4,642 images across 26 categories, sourced from India’s nationwide NTSE exam. It evaluates cognitive multimodal reasoning skills, covering puzzles, series, analogies, and aptitude tasks beyond rote learning.

data

TempTabQA: Temporal Table Question Answering

Irwin Deng, Kushagra Dixit, Dan Roth, Vivek Gupta NAACL 2025

TempTabQA is a dataset of 11,454 question–answer pairs from 1,208 Wikipedia Infobox tables across 90+ domains, designed to test temporal reasoning in semi-structured data. Evaluations show that leading NLP models lag human performance by over 13.5 F1 points. It serves as a benchmark to advance models’ ability to handle temporal information in complex data structures.

data

MapWise: Spatial Reasoning over Maps

Srija Mukhopadhyay, Abhishek Rajgaria, Prerana Khatiwada, Manish Shrivastava, Dan Roth, Vivek Gupta NAACL 2025

MAPWise is a multimodal benchmark comprising 3,000 manually annotated question–answer pairs (1,000 per country) derived from official statistics of India, the USA, and China. The maps are generated in multiple formats—2 legend types (discrete, continuous), 2 annotation settings (with, without), hatching textures, 3 colormap variations, and 2 background colors—systematically varying visual complexity. Questions span 3 types (binary, direct value extraction, region association) and 6 answer formats (Yes/No, single word, count, list, range, ranking), enabling fine-grained statistical evaluation of models’ numerical, categorical, and geospatial reasoning performance.

data

TempTabQA-C

Irwin Deng, Kushagra Dixit, Dan Roth, Vivek Gupta NAACL 2025

TEMPTABQA-C is a controlled benchmark of 200,000 question–answer pairs generated from Wikipedia infoboxes, stored in a relational schema to test temporal tabular reasoning. Questions are annotated along three axes: Original vs. Counterfactual (fact perturbations), Small vs. Large tables, and Easy → Hard reasoning difficulty. The dataset supports evaluation of direct prompting versus symbolic SQL generation, showing SQL integration significantly improves robustness to fact changes, table size, and reasoning complexity.TEMPTABQA-C is a large-scale, semi-automatically generated dataset tailored for evaluating temporal reasoning in LLMs.

data

MMT-Bench

Suyash Vardhan Mathur, Jainit Sushil Bafna, Kunal Kartik, Harshita Khandelwal, Manish Shrivastava, Vivek Gupta, Mohit Bansal, Dan Roth EMNLP 2024

MMTabQA is a large-scale multimodal table QA dataset containing 69,740 questions over 25,026 tables, created by augmenting four existing datasets—WikiSQL (21,472 Qs, 9,784 tables), WikiTableQuestions (10,052 Qs, 1,259 tables), FeTaQA (7,476 Qs, 5,898 tables), and HybridQA (30,470 Qs, 8,085 tables). It integrates multiple reasoning styles: SQL-based parsing, complex multi-row/column reasoning, long-form answers, and hybrid table–text reasoning with contextual passages. Questions are categorized into 4 types—Explicit (24,797), Implicit (21,453), Visual (5,763), and Answer-Mention (17,727)—with an average of 14.10 images per table, enabling evaluation of entity parsing, visual grounding, and multimodal reasoning skills.

data

FlowVQA: Visual Question Answering over Flowcharts

Shubhankar Singh, Purvi Chaurasia, Yerram Varun, Pranshu Pandya, Vatsal Gupta, Vivek Gupta, Dan Roth ACL 2024

FlowVQA is a visual question answering benchmark built around 2,272 human-verified flowchart images and 22,413 Q&A pairs, sourced from WikiHow (1,121), Instructables (701), and FloCo (450). Questions span four categories—Fact Retrieval, Applied Scenario, Flow Referential, and Topological—to assess multimodal LLM skills in visual grounding, logical progression, and spatial reasoning. Baseline evaluations show even the best-performing approach (GPT-4 with directive-based few-shot prompting) reaches only 68.42% majority voting accuracy, highlighting the challenge of structured visual logic understanding.

data

InfoSync: Synchronizing Information Across Tables

Sidharth Khincha, Chelsi Jain, Vivek Gupta, Tushar Kataria, Shuo Zhang ACL 2023

InfoSync is a large-scale multilingual table synchronization dataset containing ~99,440 Wikipedia infoboxes (1,078,717 rows) across 14 languages, spanning 22 diverse categories such as Airport, Album, City, Company, Country, and Person. It includes both translation-based and native-annotated test sets (~3,500 table pairs) for evaluating two core tasks: Information Alignment (row mapping across languages) and Information Update (propagating missing or outdated rows). Human-assisted Wikipedia edits using InfoSync updates achieved a 77.28% acceptance rate, demonstrating its utility for improving cross-lingual consistency in semi-structured data.

data

InfoTab: A Benchmark for Table-based Information Extraction

Vivek Gupta, Maitrey Mehta, Pegah Nokhiz, Vivek Srikumar ACL 2020

INFOTABS is a dataset of 23,738 premise–hypothesis pairs, where all premises are info-boxes and the hypotheses are short sentences. It contains 2,540 unique info-boxes from Wikipedia articles across various categories, with hypotheses written by Amazon Mechanical Turk workers. Determining the correct label—entailment, contradiction, or unrelated—often requires multiple inferences across table rows combined with world knowledge, and verification confirms its high quality.

data

Demo

Software

PRAISE: Enhancing Product Descriptions with LLM-Driven Structured Insights

Adnan Qidwai, Srija Mukhopadhyay, Prerana Khatiwada, Dan Roth, Vivek Gupta ACL 2025

Accurate and complete product descriptions are crucial for e-commerce, yet seller-provided information often falls short. Customer reviews offer valuable details but are laborious to sift through manually. We present PRAISE: Product Review Attribute Insight Structuring Engine, a novel system that uses Large Language Models (LLMs) to automatically extract, compare, and structure insights from customer reviews and seller descriptions. PRAISE provides users with an intuitive interface to identify missing, contradictory, or partially matching details between these two sources, presenting the discrepancies in a clear, structured format alongside supporting evidence from reviews. This allows...

demo