MA-EgoQA: Multi-Agent Egocentric Video Question Answering

Abstract

As embodied models become powerful, humans will collaborate with multiple embodied AI agents at their workplace or home in the future. To ensure better communication between human users and the multi-agent system, it is crucial to interpret incoming information from agents in parallel and refer to the appropriate context for each query.

In this work, we formally define the novel problem of understanding multiple long-horizon egocentric videos simultaneously collected from embodied agents. We introduce MultiAgent-EgoQA (MA-EgoQA), a benchmark designed to systematically evaluate existing models in this scenario. MA-EgoQA provides 1,741 questions unique to multiple egocentric streams, spanning five categories: social interaction, task coordination, theory-of-mind, temporal reasoning, and environmental interaction.

We further propose EgoMAS, a simple baseline that leverages shared memory across embodied agents and agent-wise dynamic retrieval. Comprehensive evaluation reveals that current approaches are unable to effectively handle multiple egocentric streams, highlighting the need for future advances in system-level understanding across agents.

MA-EgoQA Benchmark

MA-EgoQA is built on the EgoLife dataset, where 6 people lived together for 7 days wearing egocentric cameras. Every question requires information from more than two agents, making it fundamentally different from single-agent benchmarks.

Five Benchmark Categories

💬

Social Interaction (SI)

Localizing and grounding casual conversations and affiliative behaviors across multiple video streams. Questions cover group behaviors, interpersonal engagements, and meaningful information exchanged during interactions.

🤝

Task Coordination (TC)

How agents divide roles, sequence actions, and make decisions toward shared goals. Reflects practical importance of multi-agent collaboration and task planning.

🧠

Theory of Mind (ToM)

Reasoning about mental states: what an agent believed, misunderstood, or intended. Requires diverse perspectives to capture contextual cues and infer latent cognitive states. The hardest category.

⏱️

Temporal Reasoning (TR)

Aligning timelines across egocentric streams. Includes concurrency (what was each agent doing simultaneously?) and comparison (relative ordering of events across agents).

🏠

Environmental Interaction (EI)

Tracking object usage distributed across multiple agents, including frequency of use, first-time use, and which agent interacted most with shared objects in the environment.

QA Examples

Representative QA examples from each category. Note that all questions are unique to multi-agent settings: they cannot be answered by looking at a single agent's video and often require reasoning across multiple timestamps.

Dataset Statistics

By Category

By Question Type

By Recording Day

By Agent

MA-EgoQA questions are well-balanced across all five categories, seven recording days, six agents, and diverse question types (What / Who / Why / How / When / Which), ensuring comprehensive and unbiased evaluation.

Benchmark Construction Pipeline

MA-EgoQA uses a rigorous multi-stage pipeline to ensure question quality and difficulty:
GPT-based generation → LLM filtering → human verification.

Benchmark construction pipeline. QA pairs are generated for each category through its dedicated process, and refined through LLM filtering and manual verification.

1

GPT-based Generation

Single-span (5-min windows), multi-span (grouped by semantic similarity), and template-based QA generation using GPT-4o and GPT-5. Generates 5 options per question (1 correct + 4 distractors).

2

LLM Filtering (3 Stages)

Zero-shot filtering removes trivially answerable questions. Single-agent filtering removes questions solvable from one agent's memory. Cross-model validation uses Gemini-2.5-Flash and Claude-Sonnet-4 to catch biases.

3

Human Verification

4 human verifiers reviewed 3,436 candidates with full access to captions, transcripts, and videos. 1,741 high-quality samples passed all stages.

EgoMAS: Our Baseline

EgoMAS (Egocentric Multi-Agent System) is a training-free baseline that tackles multi-agent egocentric reasoning through two key components: event-based shared memory and agent-wise dynamic retrieval.

EgoMAS architecture. A centralized manager integrates 10-minute captions from all agents into a shared event-based memory using 4W1H (When, What, Where, Who, How) structured records. Given a query, EgoMAS first retrieves relevant system-level memories, then dynamically generates agent-specific sub-queries for fine-grained retrieval.

🗂️ Event-based Shared Memory

Every 10 minutes, each agent submits a caption of its observations. A centralized manager integrates these into a structured 4W1H record:

When: timestamp What: event description Where: location Who: agents involved How: manner of action

This structured format aligns agent perspectives while preserving critical details, enabling coherent global understanding across all agents.

🔍 Agent-wise Dynamic Retrieval

Given a query, EgoMAS performs a two-stage retrieval:

System-level retrieval: top-n shared memory entries via BM25 ranking to identify relevant events globally
Agent-level retrieval: dynamically generates per-agent sub-queries and retrieves top-k memories from each relevant agent's memory (filtered by relevance threshold τ)

This selective approach uses only 4–7k tokens of context per query, compared to 128k–1M tokens for non-retrieval baselines, while achieving better accuracy.

Evaluation Results

We evaluate 16 baselines across three categories (caption concat, frame concat, RAG-based) plus EgoMAS on MA-EgoQA. Random chance is 20% (5 options per question).

Model	Context	SI	TC	ToM	TR	EI	Avg.
Random Chance	—	20.0	20.0	20.0	20.0	20.0	20.0
All Caption Concat Baselines
Gemini-2.5-Flash	1M	41.2	36.4	24.3	46.6	34.0	36.9
GPT-5	272k	36.2	33.9	22.6	39.7	38.7	34.8
Qwen2.5-7B-1M	500k	27.1	28.9	20.9	24.0	26.2	26.1
Qwen3-30b-a3b	240k	29.5	27.7	16.6	23.0	26.5	25.6
gpt-oss-120b	128k	28.7	27.5	22.6	28.2	25.6	26.8
Llama-3.1-Nemotron-8B	1M	20.5	22.9	20.4	22.7	21.2	21.7
All Frame Concat Baselines
Qwen2.5-VL-7B	128k	26.3	26.5	21.3	22.2	25.1	25.2
VideoChat-Flash	32k	23.7	25.0	20.0	24.4	20.7	23.1
VideoXL-2	32k	20.2	19.8	19.2	20.9	21.7	20.4
RAG Baselines
BM25 + Qwen3VL-8B-Inst	8.1k	44.7	37.6	30.2	33.5	30.6	36.0
WorldMM-8B	4.1k	29.3	34.5	21.7	25.1	22.6	27.6
VideoRAG	8.0k	27.1	26.0	20.9	25.1	25.6	25.3
EgoRAG	—	22.3	21.1	18.3	15.3	18.9	19.6
Ours (EgoMAS)
EgoMAS (Gemini-2.5-Flash)	4.6k	41.5	41.3	33.6	39.4	48.2	41.4
EgoMAS (Qwen3VL-8B-Thinking)	5.4k	38.0	39.9	28.9	47.4	44.9	40.3
EgoMAS (Qwen3VL-8B-Instruct)	7.4k	43.1	39.1	24.3	35.5	40.7	37.7
EgoMAS (Qwen2.5VL-7B-Instruct)	5.4k	37.8	38.6	26.0	34.2	36.5	35.6
Oracle (Upper Bound)
Oracle (Gemini-2.5-Flash)	—	86.4	97.6	70.2	88.2	81.3	83.8
Oracle (Qwen3VL-8B)	—	76.9	81.0	57.0	77.4	69.9	74.0

Key Takeaways

⚠️

Even the best model (Gemini-2.5-Flash with 1M context) achieves only 36.9%, barely above random with all captions concatenated.

💡

Retrieval-based approaches consistently outperform non-retrieval methods, while using 100× less context (4–8k vs 128k–1M tokens).

🏆

EgoMAS with a small 8B model (Qwen3VL-8B-Thinking) surpasses both Gemini-2.5-Flash and GPT-5 baselines at 40.3%.

📈

The large gap to Oracle (~83.8%) shows there is substantial room for improvement. This remains a challenging open problem.

Analysis

Does MA-EgoQA Require Multi-Agent Memory?

Restricting EgoMAS to a single agent's memory causes a substantial performance drop across all categories, confirming that MA-EgoQA genuinely requires integrating information from multiple agents, not just one.

Impact of Number of Required Agents

Performance vs Number of Required Agents

Performance consistently decreases as the number of agents required to answer a question increases. This highlights that current multi-agent knowledge fusion approaches remain limited; building coherent global understanding is an open challenge.

Multi-span Questions Are Significantly Harder

Models achieve substantially lower accuracy on queries grounded in multiple timestamps (Multi-span in SI/TC, Comparison in TR), showing that identifying and relating multiple events across seven days of video requires stronger retrieval capabilities.

Efficiency: Accuracy vs. Latency

EgoMAS achieves the highest accuracy among retrieval-based models at only 1.3 seconds per query. Non-retrieval models incur orders-of-magnitude higher latency, making EgoMAS a practical choice for real-world deployment.

Ablation Study on EgoMAS Sub-modules

Table 4 compares memory construction strategies: the event-based 4W1H memory clearly outperforms alternatives, demonstrating its ability to abstract and fuse events across agents. Table 5 compares retrieval methods: while NV-Embed-v2 achieves the highest accuracy, it requires 7B parameters with substantial overhead. BM25 delivers competitive performance with lightweight keyword-based retrieval, making EgoMAS practical for real-world deployment.

Case Study

Gemini-2.5-Flash and VideoChat-Flash receive excessive information and fail to focus on relevant events. WorldMM retrieves memories but without cross-agent aggregation, it fails. EgoMAS uses shared memory to locate target events, performs dynamic retrieval for details, and answers correctly.

BibTeX

If you find MA-EgoQA or EgoMAS useful in your research, please cite our paper:

@misc{kim2026maegoqa,
  title={MA-EgoQA: Question Answering over Egocentric Videos from Multiple Embodied Agents}, 
  author={Kangsan Kim and Yanlai Yang and Suji Kim and Woongyeong Yeo and Youngwan Lee and Mengye Ren and Sung Ju Hwang},
  year={2026},
  eprint={2603.09827},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2603.09827}, 
}