MAMMQA: Rethinking Information Synthesis in Multimodal Question Answering

Overview

Recent advances in multimodal question answering have primarily focused on combining heterogeneous modalities or fine-tuning multimodal large language models. While these approaches have shown strong performance, they often rely on a single, generalized reasoning strategy, overlooking the unique characteristics of each modality—ultimately limiting both accuracy and interpretability.

MAMMQA is a multi-agent QA framework for multimodal inputs spanning text, tables, and images. Our system includes two Visual Language Model (VLM) agents and one text-based Large Language Model (LLM) agent. The first VLM decomposes the user query into sub-questions and sequentially retrieves partial answers from each modality. The second VLM synthesizes and refines these results through cross-modal reasoning. Finally, the LLM integrates the insights into a cohesive answer. This modular design enhances interpretability by making the reasoning process transparent and allows each agent to operate within its domain of expertise. Experiments on diverse multimodal QA benchmarks demonstrate that our cooperative, multi-agent framework consistently outperforms existing baselines in both accuracy and robustness.

Why is Multimodal QA Hard?

Existing multimodal QA often uses a single, generalized reasoning strategy or flattens inputs, overlooking the unique characteristics of text, tables, and images. This limits both accuracy and interpretability.

MAMMQA addresses these limitations with a cooperative, multi-agent design that makes the reasoning process transparent while letting each agent operate in its domain of expertise.

How MAMMQA Works

MAMMQA uses two Visual Language Model (VLM) agents and one text-based Large Language Model (LLM). The first VLM decomposes the user query into sub-questions and retrieves partial answers from each modality. The second VLM synthesizes these results via cross-modal reasoning. Finally, the LLM integrates insights into a cohesive answer.

Architecture — two VLM agents (decompose, cross-modal synthesis) and one LLM aggregator.

Agents and Roles

Modality Experts (VLM): Decompose the question; probe text, table, and image independently; extract grounded evidence per modality.
Cross-Modal Expert (VLM): Join and reconcile signals across modalities; perform conflict checks and produce a consolidated evidence set.
Aggregator (LLM): Answer strictly from the consolidated evidence; can abstain when evidence is insufficient (question-agnostic variant improves calibration).

What’s New in MAMMQA

MAMMQA: A fully prompt-driven, multi-agent MMQA framework that splits reasoning into three interpretable stages — modality experts, cross-modal synthesis, and evidence-grounded aggregation — without any fine-tuning.
Unified, role-consistent agents: A single prompt template reused across text, table, and image experts, enabling dynamic activation, efficient inference, and transparent error tracing.
Zero-shot performance: Outperforms CoT, CapCoT, and ToT baselines and matches or exceeds several fine-tuned models on MultiModalQA and ManyModalQA across both proprietary and open-source LLMs.
Robustness and calibration: Static agents beat dynamic search (e.g., ToT) by over 10 points while reducing over-confident, ungrounded answers, aided by a question-agnostic aggregator variant.

Qualitative Results

Representative Cases

Modality disambiguation: For questions where the relevant source is not explicit, modality experts independently probe text, table, and image. The aggregator selects the best-supported answer or abstains when evidence is insufficient. Takeaway: avoids guessing; prefers grounded evidence.
Cross-modal synthesis: Table values are verified against textual statements and visual cues (e.g., labels in figures). The cross-modal expert resolves conflicts before aggregation. Takeaway: improves factual consistency across modalities.
Failure/abstention behavior: Under broken context (e.g., shuffled text), agents abstain rather than hallucinate, propagating abstention to the aggregator. Takeaway: calibrated responses under uncertainty.

Images can be added for each case (modality disambiguation, cross-modal join, and a failure example) if available.

The Benchmark

ManyModalQA: Ambiguous modality questions across text, table, image — stresses modality disambiguation and targeted retrieval.
MultiModalQA: 29,918 QA pairs; 35.7% cross-modal — stresses cross-modal composition and information fusion.

Dataset Snapshot

ManyModalQA

Questions 10,190

Images 2,873

Passages 3,789

Tables 3,528

Ambiguous modality

Splits: 2,036 train / 3,055 dev

MultiModalQA

QA Pairs 29,918

Cross-modal share

35.7%

Composition

Splits: 23,817 train / 2,442 dev / 3,660 test

At-a-glance stats; cross-modal proportion shown for MultiModalQA.

Quantitative Results

Alignment with Human Judgment

MAMMQA’s modular agents yield stronger agreement with expert judgments across benchmarks, improving accuracy and robustness over single-strategy baselines.

Results on ManyModalQA

Methods	Text	Table	Image	Total
Human	92.00	89.60	94.00	91.60
Voting	23.70	22.90	15.50	21.10
MMQA	48.60	40.40	27.20	39.70
MMQA^†	59.30	46.30	29.00	46.30
UniMMQA Finetuned T5 Model
Base	46.60	60.70	30.20	45.40
Large	48.50	67.50	34.90	50.00
3B	49.80	58.30	40.90	52.10
OpenAI 4o-mini
CoT	87.20	94.23	57.33	81.21
CoT^*	68.22	70.51	59.42	66.54
CapCoT	87.68	94.05	68.26	84.41
ToT	84.94	93.19	72.90	84.70
Ours	92.50	96.78	78.02	89.90
Gemini 1.5-Flash 8B
CoT	86.05	91.52	68.77	82.81
CoT^*	54.93	61.15	34.77	51.41
CapCoT	85.74	91.40	63.14	81.34
ToT	86.08	86.81	62.81	79.80
Ours	89.76	94.52	77.33	87.91
Qwen 2.5 VL 7B Instruct
CoT	59.84	68.71	45.47	58.87
CoT^*	61.80	66.73	54.53	61.46
CapCoT	83.50	92.86	71.07	83.41
ToT	81.95	90.41	69.29	81.89
Ours	87.11	96.31	77.56	87.61
Qwen 2.5 VL 3B Instruct
CoT	70.08	75.61	50.70	66.54
CoT^*	58.77	64.55	59.51	58.77
CapCoT	80.79	91.38	67.13	80.63
ToT	82.66	86.14	68.11	80.42
Ours	88.79	94.90	72.67	86.37

Quantitative results on ManyModalQA. Superscript † denotes oracle; * indicates the no-context (open-book QA) setting.

Quantitative Analysis on MultiModalQA

Best overall on 4o-mini, Gemini-8B, Qwen-7B; competitive on Qwen-3B.
Largest gains on table-centric and cross-modal (Tb, Tb | Txt, Tb | Img).
Comparable on image-only/text–image; gains come from stronger table grounding and fusion.

Modality	Img	Tb \| Img	Tb \| Txt	Tb	Txt \| Img	Txt	Total
OpenAI 4o Mini
CoT	33.15	53.81	66.67	84.55	55.95	77.67	64.60
CapCoT	53.91	64.98	69.05	84.14	61.90	77.33	70.39
ToT	54.97	63.35	64.37	67.70	61.11	69.65	64.88
Ours	61.31	70.30	81.58	89.16	59.75	85.57	76.37
Gemini 1.5-Flash 8B
CoT	47.41	53.38	58.88	74.73	46.43	72.82	62.16
CapCoT	47.84	50.02	55.87	74.88	39.29	72.42	60.66
ToT	36.93	43.06	52.32	53.72	33.33	70.61	53.10
Ours	51.23	54.12	57.42	83.69	42.86	79.47	65.84
Qwen 2.5 VL 7B Instruct
CoT	29.11	32.58	30.66	38.75	17.86	38.28	33.84
CapCoT	48.10	53.94	60.56	71.52	41.67	71.31	61.54
ToT	55.90	47.82	52.50	60.83	41.64	64.44	57.12
Ours	50.74	55.88	63.68	81.35	53.26	80.51	67.56
Qwen 2.5 VL 3B Instruct
CoT	11.86	23.71	22.14	32.25	14.29	25.52	23.15
CapCoT	48.10	42.08	47.08	64.94	39.29	65.04	53.98
ToT	42.01	43.65	48.40	52.57	33.74	66.51	52.91
Ours	33.73	43.10	45.33	62.29	35.52	67.73	52.12

Columns: Img, Tb | Img, Tb | Txt, Tb, Txt | Img, Txt, Total. Metric is answer accuracy (%), higher is better. Rows group prompting strategies under each backbone; bold marks best and underline marks second-best within a row block.

Dynamic vs Static (Table 1): On Qwen-7B, ToT 57.12 vs Ours 67.56 (+10.44); a lean static 3-agent pipeline is both more accurate and compute-efficient. See details in Ablations.

Overall Comparison on MultiModalQA

Overall vs CoT: +28.9 (3B) / +33.7 (7B).
7B: 67.56 Overall; 73.16 Single (competitive with finetuned).
Multi trails top finetuned but gap narrows without task-specific training.

Model	Single	Multi	Overall
Finetuned Models
AutoRouting	51.7	34.2	44.7
ImplicitDecomp	51.6	44.6	48.8
Binder	–	–	51.0
SKURG	66.1	52.5	59.8
PERQA	69.7	54.7	62.8
Solar	69.7	55.5	59.8
UniRaG	71.7	62.3	67.4
AETGA	69.8	64.7	68.8
PReasM L	–	–	59.0
MMQA-T5 L	–	–	57.9
UniMMQA (T5 B)	–	–	67.9
UniMMQA (T5 L)	–	–	71.3
UniMMQA (T5 3B)	–	–	75.5
Zero-Shot Models
CoT Qwen 3B	23.75	22.24	23.15
CoT Qwen 7B	36.07	30.91	33.84
Our Agent 3B	57.72	43.39	52.12
Our Agent 7B	73.16	58.93	67.56

Columns: Single (one-modality), Multi (cross-modal), Overall (full test mix). Metric is answer accuracy (%), higher is better. Top block shows finetuned systems; bottom compares zero-shot CoT vs our agent across 3B/7B backbones.

Ablations & Robustness

Static vs. Dynamic Agents

Compared to Tree-of-Thoughts (dynamic search), our lean, static 3-agent pipeline is more accurate and calibrated (Qwen-7B: 57.12 → 67.56; +10.44; see Table 1), avoiding confidently incorrect answers while reducing compute.

Robustness Under Perturbations

Model (7B)	Original	Text Shuffle	Irrelevant Context
TreeOfThoughts	57.12	33.01 (−42.21%)	52.45 (−08.18%)
CoT	33.84	31.18 (−07.86%)	29.54 (−12.71%)
CapCoT	61.54	37.47 (−39.11%)	55.39 (−09.99%)
OurAgent	67.56	05.92 (−91.24%)	63.74 (−05.65%)

Model (3B)	Original	Text Shuffle	Irrelevant Context
TreeOfThoughts	52.91	49.22 (−06.97%)	47.11 (−10.96%)
CoT	23.15	20.48 (−11.53%)	19.62 (−15.25%)
CapCoT	53.98	49.22 (−08.82%)	47.12 (−12.71%)
OurAgent	52.12	07.66 (−85.30%)	48.05 (−07.81%)

Robustness of different reasoning strategies under perturbations across model sizes.

Calibration via Question-Agnostic Aggregator

CoT often answers confidently without grounded evidence. Our agents separate extraction from generation and allow abstention; making the aggregator question-agnostic further reduces bias toward priors and improves factuality.

Aggregator performance with and without question

Aggregator Agent performance with and without the original question on MultiModalQA.

BibTeX


@inproceedings{anvekar-etal-2025-rethinking,
  title = "Rethinking Information Synthesis in Multimodal Question Answering A Multi-Agent Perspective",
  author = "Anvekar, Tejas  and
    Rajput, Krishna Singh  and
    Baral, Chitta  and
    Gupta, Vivek",
  editor = "Inui, Kentaro  and
    Sakti, Sakriani  and
    Wang, Haofen  and
    Wong, Derek F.  and
    Bhattacharyya, Pushpak  and
    Banerjee, Biplab  and
    Ekbal, Asif  and
    Chakraborty, Tanmoy  and
    Singh, Dhirendra Pratap",
  booktitle = "Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics",
  month = dec,
  year = "2025",
  address = "Mumbai, India",
  publisher = "The Asian Federation of Natural Language Processing and The Association for Computational Linguistics",
  url = "https://aclanthology.org/2025.ijcnlp-long.192/",
  pages = "3674--3686",
  ISBN = "979-8-89176-298-5"
}