MAMMQA

Rethinking Information Synthesis in Multimodal Question Answering

Multi-agent reasoning across text, tables, and images
Tejas Anvekar, Krishna Singh Rajput, Chitta Baral, Vivek Gupta

MAMMQA replaces one-size-fits-all multimodal reasoning with a cooperative pipeline of modality experts, cross-modal synthesis, and evidence-grounded aggregation for stronger accuracy, transparency, and robustness.

Why It Matters

Reasoning should respect modality

MAMMQA avoids flattening text, tables, and images into one generic workflow, which improves both faithfulness and interpretability.

How It Works

Three-agent cooperative pipeline

Two VLM agents decompose and synthesize grounded evidence, then an LLM aggregator produces the final answer.

What We Found

Static agents beat dynamic search

The lean multi-agent design outperforms CoT, CapCoT, and ToT across ManyModalQA and MultiModalQA.

Highlights

  • Interpretable decomposition: MAMMQA splits retrieval, synthesis, and answer generation into explicit stages.
  • Prompt-only framework: No fine-tuning is required; a shared agent template adapts across modalities.
  • Strong zero-shot results: The method outperforms standard prompting baselines on both ManyModalQA and MultiModalQA.
  • Better calibration: The aggregator can abstain when evidence is weak, reducing overconfident unsupported answers.
3 Cooperating Agents
29,918 MultiModalQA Pairs
10.44 Qwen-7B Gain over ToT
0 Fine-Tuning Required

Overview

A modular answer synthesis pipeline

Recent multimodal QA systems often rely on a single generalized reasoning policy. MAMMQA instead distributes the task across specialized agents so each modality can be analyzed on its own terms before fusion.

  • Modality experts: Probe text, tables, and images for grounded evidence.
  • Cross-modal synthesis: Reconcile evidence and resolve conflicts before answering.
  • LLM aggregation: Generate a final response strictly from the consolidated evidence.
MAMMQA architecture diagram

Two VLM agents handle decomposition and synthesis, while a final LLM aggregator produces an evidence-grounded answer.

Why is Multimodal QA Hard?

Existing multimodal QA pipelines often use one generalized reasoning strategy or flatten all modalities into a shared format, which obscures the unique structure of text, tables, and images.

This weakens both answer quality and trustworthiness. MAMMQA addresses that gap with a cooperative multi-agent design that preserves modality-specific reasoning while making the intermediate steps explicit.

How MAMMQA Works

MAMMQA uses two Visual Language Model agents and one text-based Large Language Model. The first VLM decomposes the question and retrieves partial evidence from each modality. The second VLM synthesizes those findings through cross-modal reasoning. The LLM then integrates the evidence into a final answer.

Step 01

Decompose and Retrieve

Break the question into modality-aware subproblems and gather grounded evidence from text, table, and image sources.

Step 02

Cross-Modal Synthesis

Join partial findings, verify consistency across modalities, and surface a consolidated evidence set.

Step 03

Aggregate or Abstain

Generate the final answer only from the synthesized evidence, with the option to abstain if the evidence is insufficient.

Agents and Roles

  • Modality Experts (VLM): Decompose the question, inspect each modality independently, and extract grounded evidence.
  • Cross-Modal Expert (VLM): Reconcile signals across modalities and resolve conflicts before answering.
  • Aggregator (LLM): Produce the final response from evidence only, with a question-agnostic variant that improves calibration.

What’s New in MAMMQA

Framework

Prompt-driven multi-agent MMQA

MAMMQA organizes multimodal reasoning into interpretable stages without relying on task-specific fine-tuning.

Agents

Unified role-consistent prompting

A single prompt template is reused across text, table, and image experts, enabling transparent error tracing and efficient inference.

Performance

Stronger zero-shot results

The framework outperforms CoT, CapCoT, and ToT on major multimodal QA benchmarks and rivals several fine-tuned systems.

Robustness

Improved calibration under noise

Question-agnostic aggregation and explicit evidence routing reduce unsupported confident answers.

Qualitative Results

Representative Cases

  • Modality disambiguation: Experts independently probe text, table, and image sources before the aggregator selects the best-supported answer or abstains.
  • Cross-modal synthesis: Table values are verified against textual claims and visual evidence before final answer generation.
  • Failure behavior: When context is broken or shuffled, the system prefers abstention over hallucination.

The page structure is ready for qualitative visual examples if you want to add screenshots or step-by-step agent traces later.

The Benchmark

Dataset 01

ManyModalQA

Ambiguous-modality questions across text, tables, and images. This benchmark stresses modality disambiguation and targeted retrieval.

 Questions 10,190
 Images 2,873
 Passages 3,789
 Tables 3,528
 Ambiguous modality

Splits: 2,036 train / 3,055 dev

Dataset 02

MultiModalQA

29,918 question-answer pairs, with 35.7% requiring cross-modal reasoning, making it a strong test for composition and synthesis.

 QA Pairs 29,918

Cross-modal share

35.7%
 Cross-modal composition

Splits: 23,817 train / 2,442 dev / 3,660 test

Quantitative Results

ManyModalQA

MAMMQA improves answer accuracy across modalities and consistently outperforms standard prompting baselines under several backbone models.

Methods Text Table Image Total
Human92.0089.6094.0091.60
Voting23.7022.9015.5021.10
MMQA48.6040.4027.2039.70
MMQA59.3046.3029.0046.30
UniMMQA Finetuned T5 Model
Base46.6060.7030.2045.40
Large48.5067.5034.9050.00
3B49.8058.3040.9052.10
OpenAI 4o-mini
CoT87.2094.2357.3381.21
CoT*68.2270.5159.4266.54
CapCoT87.6894.0568.2684.41
ToT84.9493.1972.9084.70
Ours92.5096.7878.0289.90
Gemini 1.5-Flash 8B
CoT86.0591.5268.7782.81
CoT*54.9361.1534.7751.41
CapCoT85.7491.4063.1481.34
ToT86.0886.8162.8179.80
Ours89.7694.5277.3387.91
Qwen 2.5 VL 7B Instruct
CoT59.8468.7145.4758.87
CoT*61.8066.7354.5361.46
CapCoT83.5092.8671.0783.41
ToT81.9590.4169.2981.89
Ours87.1196.3177.5687.61
Qwen 2.5 VL 3B Instruct
CoT70.0875.6150.7066.54
CoT*58.7764.5559.5158.77
CapCoT80.7991.3867.1380.63
ToT82.6686.1468.1180.42
Ours88.7994.9072.6786.37

Superscript † denotes oracle; * indicates the no-context setting.

MultiModalQA

  • Best overall on 4o-mini, Gemini-8B, and Qwen-7B, with especially strong gains on table-centric and cross-modal questions.
  • The method stays competitive on smaller backbones while preserving interpretability.
Modality Img Tb | Img Tb | Txt Tb Txt | Img Txt Total
OpenAI 4o Mini
CoT33.1553.8166.6784.5555.9577.6764.60
CapCoT53.9164.9869.0584.1461.9077.3370.39
ToT54.9763.3564.3767.7061.1169.6564.88
Ours61.3170.3081.5889.1659.7585.5776.37
Gemini 1.5-Flash 8B
CoT47.4153.3858.8874.7346.4372.8262.16
CapCoT47.8450.0255.8774.8839.2972.4260.66
ToT36.9343.0652.3253.7233.3370.6153.10
Ours51.2354.1257.4283.6942.8679.4765.84
Qwen 2.5 VL 7B Instruct
CoT29.1132.5830.6638.7517.8638.2833.84
CapCoT48.1053.9460.5671.5241.6771.3161.54
ToT55.9047.8252.5060.8341.6464.4457.12
Ours50.7455.8863.6881.3553.2680.5167.56
Qwen 2.5 VL 3B Instruct
CoT11.8623.7122.1432.2514.2925.5223.15
CapCoT48.1042.0847.0864.9439.2965.0453.98
ToT42.0143.6548.4052.5733.7466.5152.91
Ours33.7343.1045.3362.2935.5267.7352.12
Static vs dynamic search: On Qwen-7B, MAMMQA reaches 67.56 overall versus 57.12 for ToT, a gain of 10.44 points with a leaner reasoning pipeline.

Overall Comparison and Robustness

Model Single Multi Overall
Finetuned Models
AutoRouting51.734.244.7
ImplicitDecomp51.644.648.8
Binder51.0
SKURG66.152.559.8
PERQA69.754.762.8
Solar69.755.559.8
UniRaG71.762.367.4
AETGA69.864.768.8
PReasM L59.0
MMQA-T5 L57.9
UniMMQA (T5 B)67.9
UniMMQA (T5 L)71.3
UniMMQA (T5 3B)75.5
Zero-Shot Models
CoT Qwen 3B23.7522.2423.15
CoT Qwen 7B36.0730.9133.84
Our Agent 3B57.7243.3952.12
Our Agent 7B73.1658.9367.56
Model (7B) Original Text Shuffle Irrelevant Context
TreeOfThoughts57.1233.01 (-42.21%)52.45 (-08.18%)
CoT33.8431.18 (-07.86%)29.54 (-12.71%)
CapCoT61.5437.47 (-39.11%)55.39 (-09.99%)
OurAgent67.5605.92 (-91.24%)63.74 (-05.65%)
Aggregator performance with and without question

Question-agnostic aggregation improves calibration by separating evidence synthesis from answer priors.

BibTeX

@inproceedings{anvekar-etal-2025-rethinking,
  title = "Rethinking Information Synthesis in Multimodal Question Answering A Multi-Agent Perspective",
  author = "Anvekar, Tejas  and
    Rajput, Krishna Singh  and
    Baral, Chitta  and
    Gupta, Vivek",
  editor = "Inui, Kentaro  and
    Sakti, Sakriani  and
    Wang, Haofen  and
    Wong, Derek F.  and
    Bhattacharyya, Pushpak  and
    Banerjee, Biplab  and
    Ekbal, Asif  and
    Chakraborty, Tanmoy  and
    Singh, Dhirendra Pratap",
  booktitle = "Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics",
  month = dec,
  year = "2025",
  address = "Mumbai, India",
  publisher = "The Asian Federation of Natural Language Processing and The Association for Computational Linguistics",
  url = "https://aclanthology.org/2025.ijcnlp-long.192/",
  pages = "3674--3686",
  ISBN = "979-8-89176-298-5"
}