MAMMQA

Rethinking Information Synthesis in Multimodal Question Answering — A Multi‑Agent Perspective

Tejas Anvekar, Krishna Singh Rajput, Chitta Baral, Vivek Gupta

Overview

Recent advances in multimodal question answering have primarily focused on combining heterogeneous modalities or fine‑tuning multimodal large language models. While these approaches have shown strong performance, they often rely on a single, generalized reasoning strategy, overlooking the unique characteristics of each modality—ultimately limiting both accuracy and interpretability.

MAMMQA is a multi‑agent QA framework for multimodal inputs spanning text, tables, and images. Our system includes two Visual Language Model (VLM) agents and one text‑based Large Language Model (LLM) agent. The first VLM decomposes the user query into sub‑questions and sequentially retrieves partial answers from each modality. The second VLM synthesizes and refines these results through cross‑modal reasoning. Finally, the LLM integrates the insights into a cohesive answer. This modular design enhances interpretability by making the reasoning process transparent and allows each agent to operate within its domain of expertise. Experiments on diverse multimodal QA benchmarks demonstrate that our cooperative, multi‑agent framework consistently outperforms existing baselines in both accuracy and robustness.

Why is Multimodal QA Hard?

Existing multimodal QA often uses a single, generalized reasoning strategy or flattens inputs, overlooking the unique characteristics of text, tables, and images. This limits both accuracy and interpretability.

MAMMQA addresses these limitations with a cooperative, multi‑agent design that makes the reasoning process transparent while letting each agent operate in its domain of expertise.

How MAMMQA Works

MAMMQA uses two Visual Language Model (VLM) agents and one text‑based Large Language Model (LLM). The first VLM decomposes the user query into sub‑questions and retrieves partial answers from each modality. The second VLM synthesizes these results via cross‑modal reasoning. Finally, the LLM integrates insights into a cohesive answer.

MAMMQA architecture diagram

Architecture — two VLM agents (decompose, cross‑modal synthesis) and one LLM aggregator.

Agents and Roles

  • Modality Experts (VLM): Decompose the question; probe text, table, and image independently; extract grounded evidence per modality.
  • Cross‑Modal Expert (VLM): Join and reconcile signals across modalities; perform conflict checks and produce a consolidated evidence set.
  • Aggregator (LLM): Answer strictly from the consolidated evidence; can abstain when evidence is insufficient (question‑agnostic variant improves calibration).

What’s New in MAMMQA

  • MAMMQA: A fully prompt‑driven, multi‑agent MMQA framework that splits reasoning into three interpretable stages — modality experts, cross‑modal synthesis, and evidence‑grounded aggregation — without any fine‑tuning.
  • Unified, role‑consistent agents: A single prompt template reused across text, table, and image experts, enabling dynamic activation, efficient inference, and transparent error tracing.
  • Zero‑shot performance: Outperforms CoT, CapCoT, and ToT baselines and matches or exceeds several fine‑tuned models on MultiModalQA and ManyModalQA across both proprietary and open‑source LLMs.
  • Robustness and calibration: Static agents beat dynamic search (e.g., ToT) by over 10 points while reducing over‑confident, ungrounded answers, aided by a question‑agnostic aggregator variant.

Qualitative Results

Representative Cases

  • Modality disambiguation: For questions where the relevant source is not explicit, modality experts independently probe text, table, and image. The aggregator selects the best‑supported answer or abstains when evidence is insufficient. Takeaway: avoids guessing; prefers grounded evidence.
  • Cross‑modal synthesis: Table values are verified against textual statements and visual cues (e.g., labels in figures). The cross‑modal expert resolves conflicts before aggregation. Takeaway: improves factual consistency across modalities.
  • Failure/abstention behavior: Under broken context (e.g., shuffled text), agents abstain rather than hallucinate, propagating abstention to the aggregator. Takeaway: calibrated responses under uncertainty.

Images can be added for each case (modality disambiguation, cross‑modal join, and a failure example) if available.

The Benchmark

  • ManyModalQA: Ambiguous modality questions across text, table, image — stresses modality disambiguation and targeted retrieval.
  • MultiModalQA: 29,918 QA pairs; 35.7% cross‑modal — stresses cross‑modal composition and information fusion.

Dataset Snapshot

ManyModalQA

 Questions 10,190
 Images 2,873
 Passages 3,789
 Tables 3,528
 Ambiguous modality

Splits: 2,036 train / 3,055 dev

MultiModalQA

 QA Pairs 29,918

Cross‑modal share

35.7%
 Composition

Splits: 23,817 train / 2,442 dev / 3,660 test

At‑a‑glance stats; cross‑modal proportion shown for MultiModalQA.

Quantitative Results

Alignment with Human Judgment

MAMMQA’s modular agents yield stronger agreement with expert judgments across benchmarks, improving accuracy and robustness over single‑strategy baselines.

Results on ManyModalQA

Methods Text Table Image Total
Human92.0089.6094.0091.60
Voting23.7022.9015.5021.10
MMQA48.6040.4027.2039.70
MMQA59.3046.3029.0046.30
UniMMQA Finetuned T5 Model
Base46.6060.7030.2045.40
Large48.5067.5034.9050.00
3B49.8058.3040.9052.10
OpenAI 4o‑mini
CoT87.2094.2357.3381.21
CoT*68.2270.5159.4266.54
CapCoT87.6894.0568.2684.41
ToT84.9493.1972.9084.70
Ours92.5096.7878.0289.90
Gemini 1.5‑Flash 8B
CoT86.0591.5268.7782.81
CoT*54.9361.1534.7751.41
CapCoT85.7491.4063.1481.34
ToT86.0886.8162.8179.80
Ours89.7694.5277.3387.91
Qwen 2.5 VL 7B Instruct
CoT59.8468.7145.4758.87
CoT*61.8066.7354.5361.46
CapCoT83.5092.8671.0783.41
ToT81.9590.4169.2981.89
Ours87.1196.3177.5687.61
Qwen 2.5 VL 3B Instruct
CoT70.0875.6150.7066.54
CoT*58.7764.5559.5158.77
CapCoT80.7991.3867.1380.63
ToT82.6686.1468.1180.42
Ours88.7994.9072.6786.37

Quantitative results on ManyModalQA. Superscript † denotes oracle; * indicates the no‑context (open‑book QA) setting.

Quantitative Analysis on MultiModalQA

  • Best overall on 4o‑mini, Gemini‑8B, Qwen‑7B; competitive on Qwen‑3B.
  • Largest gains on table‑centric and cross‑modal (Tb, Tb | Txt, Tb | Img).
  • Comparable on image‑only/text–image; gains come from stronger table grounding and fusion.
Modality Img Tb | Img Tb | Txt Tb Txt | Img Txt Total
OpenAI 4o Mini
CoT33.1553.8166.6784.5555.9577.6764.60
CapCoT53.9164.9869.0584.1461.9077.3370.39
ToT54.9763.3564.3767.7061.1169.6564.88
Ours61.3170.3081.5889.1659.7585.5776.37
Gemini 1.5‑Flash 8B
CoT47.4153.3858.8874.7346.4372.8262.16
CapCoT47.8450.0255.8774.8839.2972.4260.66
ToT36.9343.0652.3253.7233.3370.6153.10
Ours51.2354.1257.4283.6942.8679.4765.84
Qwen 2.5 VL 7B Instruct
CoT29.1132.5830.6638.7517.8638.2833.84
CapCoT48.1053.9460.5671.5241.6771.3161.54
ToT55.9047.8252.5060.8341.6464.4457.12
Ours50.7455.8863.6881.3553.2680.5167.56
Qwen 2.5 VL 3B Instruct
CoT11.8623.7122.1432.2514.2925.5223.15
CapCoT48.1042.0847.0864.9439.2965.0453.98
ToT42.0143.6548.4052.5733.7466.5152.91
Ours33.7343.1045.3362.2935.5267.7352.12

Columns: Img, Tb | Img, Tb | Txt, Tb, Txt | Img, Txt, Total. Metric is answer accuracy (%), higher is better. Rows group prompting strategies under each backbone; bold marks best and underline marks second‑best within a row block.

Dynamic vs Static (Table 1): On Qwen‑7B, ToT 57.12 vs Ours 67.56 (+10.44); a lean static 3‑agent pipeline is both more accurate and compute‑efficient. See details in Ablations.

Overall Comparison on MultiModalQA

  • Overall vs CoT: +28.9 (3B) / +33.7 (7B).
  • 7B: 67.56 Overall; 73.16 Single (competitive with finetuned).
  • Multi trails top finetuned but gap narrows without task‑specific training.
Model Single Multi Overall
Finetuned Models
AutoRouting51.734.244.7
ImplicitDecomp51.644.648.8
Binder51.0
SKURG66.152.559.8
PERQA69.754.762.8
Solar69.755.559.8
UniRaG71.762.367.4
AETGA69.864.768.8
PReasM L59.0
MMQA‑T5 L57.9
UniMMQA (T5 B)67.9
UniMMQA (T5 L)71.3
UniMMQA (T5 3B)75.5
Zero‑Shot Models
CoT Qwen 3B23.7522.2423.15
CoT Qwen 7B36.0730.9133.84
Our Agent 3B57.7243.3952.12
Our Agent 7B73.1658.9367.56

Columns: Single (one‑modality), Multi (cross‑modal), Overall (full test mix). Metric is answer accuracy (%), higher is better. Top block shows finetuned systems; bottom compares zero‑shot CoT vs our agent across 3B/7B backbones.

Ablations & Robustness

Static vs. Dynamic Agents

Compared to Tree‑of‑Thoughts (dynamic search), our lean, static 3‑agent pipeline is more accurate and calibrated (Qwen‑7B: 57.12 → 67.56; +10.44; see Table 1), avoiding confidently incorrect answers while reducing compute.

Robustness Under Perturbations

Model (7B)OriginalText ShuffleIrrelevant Context
TreeOfThoughts57.1233.01 (−42.21%)52.45 (−08.18%)
CoT33.8431.18 (−07.86%)29.54 (−12.71%)
CapCoT61.5437.47 (−39.11%)55.39 (−09.99%)
OurAgent67.5605.92 (−91.24%)63.74 (−05.65%)
Model (3B)OriginalText ShuffleIrrelevant Context
TreeOfThoughts52.9149.22 (−06.97%)47.11 (−10.96%)
CoT23.1520.48 (−11.53%)19.62 (−15.25%)
CapCoT53.9849.22 (−08.82%)47.12 (−12.71%)
OurAgent52.1207.66 (−85.30%)48.05 (−07.81%)

Robustness of different reasoning strategies under perturbations across model sizes.

Calibration via Question‑Agnostic Aggregator

CoT often answers confidently without grounded evidence. Our agents separate extraction from generation and allow abstention; making the aggregator question‑agnostic further reduces bias toward priors and improves factuality.

Aggregator performance with and without question

Aggregator Agent performance with and without the original question on MultiModalQA.

BibTeX