IndustryEQA: Pushing the Frontiers of Embodied Question Answering in Industrial Scenarios

1Michigan State University, 2University of North Carolina at Chapel Hill, 3Independent Researcher, 4Arizona State University
*Equal contribution

IndustryEQA introduces the first benchmark for embodied question answering in safety-critical warehouse environments, bridging the gap between household EQA and real-world industrial applications.

1,344 QA Pairs Total
76 Episodic Memory Videos
6 Categories Covered
50%+ Safety-Focused QAs

Abstract

Existing Embodied Question Answering (EQA) benchmarks primarily focus on household environments, often overlooking safety-critical aspects and reasoning processes pertinent to industrial settings. This drawback limits the evaluation of agent readiness for real-world industrial applications.

To bridge this gap, we introduce IndustryEQA, the first benchmark dedicated to evaluating embodied agent capabilities within safety-critical warehouse scenarios. Built upon the NVIDIA Isaac Sim platform, IndustryEQA provides high-fidelity episodic memory videos featuring diverse industrial assets, dynamic human agents, and carefully designed hazardous situations inspired by real-world safety guidelines.

The benchmark includes rich annotations covering six categories: equipment safety, human safety, object recognition, attribute recognition, temporal understanding, and spatial understanding. Additionally, it provides extra reasoning evaluation based on these categories.

We propose a comprehensive evaluation framework, including various baseline models, to assess their general perception and reasoning abilities in industrial environments. IndustryEQA aims to steer EQA research towards developing more robust, safety-aware, and practically applicable embodied agents for complex industrial environments.

Benchmark Overview

Comparison of small and large warehouse environments
Small vs Large warehouses: Empty, without humans, and with humans scenarios

🛡️ Safety Categories

Equipment Safety: Risks associated with warehouse machinery, pathway obstructions, improper stacking, equipment placement

Human Safety: Direct risks to human safety, potential collisions, falling hazards, ergonomic issues, PPE usage

🔍 Perception Categories

Object & Attribute Recognition: Identifying objects and their characteristics (color, size, shape, state)

Spatial & Temporal Understanding: Object positions, distances, directions, and sequence of events

Data Generation Pipeline

IndustryEQA data generation pipeline
Three-step pipeline: Video Capture → QA Generation & Refinement → Human Filtering

Dataset Statistics

Question category distribution pie chart
Distribution across Equipment Safety, Human Safety, Object Recognition, Attribute Recognition, Temporal Understanding, and Spatial Understanding
Warehouse Type Number of Videos QA Pairs Reasoning QAs Avg Duration (s)
Small Warehouse 60 971 ~650 85.2
Large Warehouse 16 373 ~250 240.0
Total 76 1,344 ~900 -

Example QA Pairs

Leaderboard

Comprehensive evaluation results showing Direct and Reasoning scores across different model categories, human presence scenarios, and warehouse sizes on the IndustryEQA benchmark.
Method Direct Score Reasoning Score
Human Presence Warehouse Size Human Presence Warehouse Size
Human No Human Small Large Human No Human Small Large
Blind LLMs
GPT-4o-2024-11-20 38.10 41.68 40.06 36.53 28.67 32.18 30.54 29.52
Gemini-2.0-Flash 35.99 40.88 38.67 33.38 28.67 33.06 31.01 28.21
DeepSeek-R1 37.81 40.51 39.29 33.91 27.08 30.87 29.10 27.91
DeepSeek-V3-0324 36.10 43.98 40.42 33.18 27.83 32.84 30.50 27.51
Multi-Frame VLLMs
LLaMA-4-Scout 51.25 50.99 51.11 52.80 46.25 40.18 43.02 42.01
Qwen2.5-VL-72B 52.62 55.31 54.09 53.42 44.00 47.29 45.75 40.06
InternVL2.5-78B 60.71 59.73 60.17 58.58 55.00 50.44 52.57 49.60
Claude-3.5-Haiku 54.10 55.31 54.76 53.22 47.08 50.15 48.71 44.18
GPT-4o-2024-11-20 57.23 57.52 57.39 61.39 51.50 49.49 50.43 46.39
GPT-4.1-2025-04-14 63.95 63.53 63.72 66.42 60.33 52.49 56.16 55.22
o4-mini-2025-04-16 70.22 69.22 69.67 69.03 67.58 67.82 67.71 63.25
Video VLLMs
Gemini-2.0-Flash 56.95 59.87 58.55 65.82 38.00 38.64 38.34 54.72
Gemini-2.5-Flash 65.21 68.05 66.76 70.24 60.67 59.68 60.14 61.45

Key Findings

🎯 Visual Grounding is Critical

VLLMs substantially outperform Blind LLMs, with leading models achieving 65%+ Direct Scores vs much lower performance without visual input.

🧠 Reasoning Remains Challenging

Models face significant difficulty with complex causal, spatial, and temporal understanding crucial for safety awareness.

🏆 Leading Architectures

Gemini-2.5-Flash, o1-mini, and GPT-4.1 demonstrate superior performance across both direct and reasoning tasks.

⚠️ Safety Comprehension Gaps

Models show comparable performance on Equipment vs Human Safety but reveal substantial room for improvement in both domains.

Category-wise Performance Analysis

Category-wise performance analysis
Performance breakdown across all six categories for top-performing models

Ablation Studies

Impact of sampled frame density on performance
Performance vs number of sampled frames (5-50 frames)
LLM judge sensitivity analysis
Consistency across different LLM judges (GPT-4o-mini vs Gemini-2.0-flash)

BibTeX

@article{li2025industryeqa,
  title={IndustryEQA: Pushing the Frontiers of Embodied Question Answering in Industrial Scenarios},
  author={Li, Yifan and Chen, Yuhang and Dao, Anh and Li, Lichi and Cai, Zhongyi and Tan, Zhen and Chen, Tianlong and Kong, Yu},
  journal={arXiv preprint arXiv:2505.20640},
  year={2025}
}