: Pushing the Frontiers of Embodied Question Answering in Industrial Scenarios

Yifan Li^1*, Yuhang Chen^2*, Anh Dao^1*, Lichi Li³, Zhongyi Cai¹, Zhen Tan⁴, Tianlong Chen², Yu Kong¹

¹Michigan State University, ²University of North Carolina at Chapel Hill, ³Independent Researcher, ⁴Arizona State University

*Equal contribution

Paper arXiv Dataset Code Leaderboard

introduces the first benchmark for embodied question answering in safety-critical warehouse environments, bridging the gap between household EQA and real-world industrial applications.

1,344 QA Pairs Total

76 Episodic Memory Videos

6 Categories Covered

50%+ Safety-Focused QAs

Abstract

Existing Embodied Question Answering (EQA) benchmarks primarily focus on household environments, often overlooking safety-critical aspects and reasoning processes pertinent to industrial settings. This drawback limits the evaluation of agent readiness for real-world industrial applications.

To bridge this gap, we introduce IndustryEQA, the first benchmark dedicated to evaluating embodied agent capabilities within safety-critical warehouse scenarios. Built upon the NVIDIA Isaac Sim platform, IndustryEQA provides high-fidelity episodic memory videos featuring diverse industrial assets, dynamic human agents, and carefully designed hazardous situations inspired by real-world safety guidelines.

The benchmark includes rich annotations covering six categories: equipment safety, human safety, object recognition, attribute recognition, temporal understanding, and spatial understanding. Additionally, it provides extra reasoning evaluation based on these categories.

We propose a comprehensive evaluation framework, including various baseline models, to assess their general perception and reasoning abilities in industrial environments. IndustryEQA aims to steer EQA research towards developing more robust, safety-aware, and practically applicable embodied agents for complex industrial environments.

Benchmark Overview

Comparison of small and large warehouse environments — Small vs Large warehouses: Empty, without humans, and with humans scenarios

🛡️ Safety Categories

Equipment Safety: Risks associated with warehouse machinery, pathway obstructions, improper stacking, equipment placement

Human Safety: Direct risks to human safety, potential collisions, falling hazards, ergonomic issues, PPE usage

🔍 Perception Categories

Object & Attribute Recognition: Identifying objects and their characteristics (color, size, shape, state)

Spatial & Temporal Understanding: Object positions, distances, directions, and sequence of events

Data Generation Pipeline

Dataset Statistics

Question category distribution pie chart — Distribution across Equipment Safety, Human Safety, Object Recognition, Attribute Recognition, Temporal Understanding, and Spatial Understanding

Warehouse Type	Number of Videos	QA Pairs	Reasoning QAs	Avg Duration (s)
Small Warehouse	60	971	~650	85.2
Large Warehouse	16	373	~250	240.0
Total	76	1,344	~900	-

Example QA Pairs

Equipment Safety Example - Wide Warehouse View

ID: 658 Equipment Safety

Q: Concerning the overhead utilities, which statement is correct? (A) All cables and pipes are enclosed in conduit, (B) Some electrical cables are exposed and hanging, (C) No overhead utilities are visible?

A: (B) Some electrical cables are exposed and hanging.

Reasoning: One side of the aisle is bounded by a racking structure and stacked pallets with very little buffer space, while the opposite side remains open.

ID: 762 Human Safety

Q: Is the crouching worker at center using a neutral spine and bent-knee lifting posture when handling the carton?

A: No.

Reasoning: The worker's back is rounded and the knees are sharply flexed in a deep squat, indicating awkward stooping rather than the recommended power-lift stance.

ID: 686 Object Recognition

Q: Which of the following pieces of equipment is NOT present in the scene? A) Hand trucks (dollies), B) Ladder, C) Pallet jack, D) Forklift?

A: D) Forklift.

Reasoning: Hand trucks and a ladder are clearly visible, but there is no forklift machinery in any part of the visible area.

ID: 709 Attribute Recognition

Q: What is the dominant colour of the parked industrial vehicle on the far-right side of the frame?

A: Blue (with black lift-mast and forks).

Reasoning: The body panels of the forklift are clearly painted a bright blue, while only the mast and forks are black, making blue the dominant visible colour.

ID: 711 Temporal Understanding

Q: Which of the following objects are positioned in the extreme back-corner of the warehouse? Choose ALL that apply: (A) Stacked wooden pallets, (B) Blue forklifts, (C) Cardboard shipping boxes, (D) Cylindrical metal drums?

A: A, B, C.

Reasoning: The corner area contains neat piles of timber pallets, two idle blue forklifts, and several cardboard cartons on pallets; there are no cylindrical drums visible anywhere in the scene.

ID: 841 Spatial Understanding

Q: Which direction does the central aisle run relative to the camera?

A: From front to back.

Reasoning: The aisle leads straight away from the camera toward the far end of the warehouse.

ID: 788 Equipment Safety

Q: What primary risk is posed by the unattended flat-bed trolley left length-wise in the main aisle?

A: Collision with moving vehicles or pedestrians.

Reasoning: The trolley narrows the aisle width and lacks any chocks or brakes, so a shallow impact could propel it into a forklift's path.

ID: 736 Attribute Recognition

Q: What is the material of the clear warehouse floor—polished concrete or wooden planks?

A: Polished concrete.

Reasoning: The floor shows continuous joint lines and a reflective, slightly mottled texture characteristic of sealed concrete, not wood grain.

ID: 793 Causal Reasoning

Q: Is the cordoned-off area with cones and tape likely a result of routine scheduled maintenance or an unexpected incident creating a potential hazard?

A: An unexpected incident creating a potential hazard.

Reasoning: The presence of what appears to be spilled or damaged goods, along with the reactive response of multiple workers, suggests an unplanned event requiring hazard control.

ID: 759 Object Recognition

Q: What kind of seating furniture is unusually placed at the worktable?

A: Upholstered dining chair.

Reasoning: A tall, light-coloured cushioned backrest with decorative studs resembles a household dining chair, contrasting with typical industrial stools.

Leaderboard

Comprehensive evaluation results showing Direct and Reasoning scores across different model categories, human presence scenarios, and warehouse sizes on the IndustryEQA benchmark.

Method	Direct Score				Reasoning Score
	Human Presence		Warehouse Size		Human Presence		Warehouse Size
	Human	No Human	Small	Large	Human	No Human	Small	Large
Blind LLMs
GPT-4o-2024-11-20	38.10	41.68	40.06	36.53	28.67	32.18	30.54	29.52
Gemini-2.0-Flash	35.99	40.88	38.67	33.38	28.67	33.06	31.01	28.21
DeepSeek-R1	37.81	40.51	39.29	33.91	27.08	30.87	29.10	27.91
DeepSeek-V3-0324	36.10	43.98	40.42	33.18	27.83	32.84	30.50	27.51
Multi-Frame VLLMs
LLaMA-4-Scout	51.25	50.99	51.11	52.80	46.25	40.18	43.02	42.01
Qwen2.5-VL-72B	52.62	55.31	54.09	53.42	44.00	47.29	45.75	40.06
InternVL2.5-78B	60.71	59.73	60.17	58.58	55.00	50.44	52.57	49.60
Claude-3.5-Haiku	54.10	55.31	54.76	53.22	47.08	50.15	48.71	44.18
GPT-4o-2024-11-20	57.23	57.52	57.39	61.39	51.50	49.49	50.43	46.39
GPT-4.1-2025-04-14	63.95	63.53	63.72	66.42	60.33	52.49	56.16	55.22
o4-mini-2025-04-16	70.22	69.22	69.67	69.03	67.58	67.82	67.71	63.25
Video VLLMs
Gemini-2.0-Flash	56.95	59.87	58.55	65.82	38.00	38.64	38.34	54.72
Gemini-2.5-Flash	65.21	68.05	66.76	70.24	60.67	59.68	60.14	61.45

Key Findings

🎯 Visual Grounding is Critical

VLLMs substantially outperform Blind LLMs, with leading models achieving 65%+ Direct Scores vs much lower performance without visual input.

🧠 Reasoning Remains Challenging

Models face significant difficulty with complex causal, spatial, and temporal understanding crucial for safety awareness.

🏆 Leading Architectures

Gemini-2.5-Flash, o1-mini, and GPT-4.1 demonstrate superior performance across both direct and reasoning tasks.

⚠️ Safety Comprehension Gaps

Models show comparable performance on Equipment vs Human Safety but reveal substantial room for improvement in both domains.

Category-wise Performance Analysis

Ablation Studies

Impact of sampled frame density on performance — Performance vs number of sampled frames (5-50 frames)

LLM judge sensitivity analysis — Consistency across different LLM judges (GPT-4o-mini vs Gemini-2.0-flash)

BibTeX

@article{li2025industryeqa,
  title={IndustryEQA: Pushing the Frontiers of Embodied Question Answering in Industrial Scenarios},
  author={Li, Yifan and Chen, Yuhang and Dao, Anh and Li, Lichi and Cai, Zhongyi and Tan, Zhen and Chen, Tianlong and Kong, Yu},
  journal={arXiv preprint arXiv:2505.20640},
  year={2025}
}