When Agents Get Lost: Dissecting Failure Modes in Graph-Based Navigation Instruction Evaluation

Farzad Shami¹, Kimia Abedini², Seyed Hossein Hosseini¹, Henrikki Tenkanen¹

Aalto University (¹ Built Environment; ² Computer Science)

Paper Dataset Code Data Explorer Annotation

Abstract: Vision-and-Language Navigation (VLN) requires agents to interpret natural language instructions for spatial reasoning, yet evaluating instruction quality remains challenging when agents fail. To address this, we present a taxonomy of navigation instruction failures that clusters failure cases into four categories: (i) linguistic properties, (ii) topological constraints, (iii) agent limitations, and (iv) execution barriers. We then introduce a dataset of over 450 annotated failure navigation traces collected from GROKE, a vision-free evaluation framework that utilizes OpenStreetMap (OSM) data. Our analysis demonstrates that agent limitations (74.2%) constitute the dominant error category, with stop-location errors and planning failures as the most frequent subcategories. The dataset and taxonomy together provide actionable insights that enable instruction generation systems to identify and avoid under-specification patterns while allowing evaluation frameworks to systematically distinguish between instruction quality issues and agent-specific artifacts.

Task & Motivation

Large Language Models (LLMs) have demonstrated remarkable capabilities across diverse language understanding tasks, yet their spatial reasoning abilities remain fundamentally limited. In Vision-and-Language Navigation (VLN), this gap manifests through systematic failures in instruction following. When navigation fails, a critical question arises: is the failure attributable to agent limitations, instruction quality, or their interaction?

Existing failure analysis frameworks predominantly focus on agent-level diagnostics, assuming that navigation instructions themselves are correct and complete. The instruction-level failure modes—such as linguistic ambiguity, topological inaccuracy, or insufficient spatial detail—remain largely unexamined. Vision-free evaluation frameworks like GROKE offer an opportunity to isolate instruction properties from perceptual challenges, enabling direct assessment of whether instructions contain sufficient, correct, and actionable spatial information.

Failure Taxonomy

We developed a hierarchical taxonomy by analyzing failure patterns and agentic system design methodologies reported in prior navigation research. The taxonomy covers four dimensions:

Linguistic (L): Under-specification, over-specification, temporal ambiguity, numerical inconsistency, referential complexity, negation confusion, directional ambiguity, scale/distance vagueness, and landmark co-reference.
Topological (T): Path ambiguity, landmark displacement, scale mismatch, stop-location errors, and junction complexity.
Agent (A): POI grounding failure, heading initialization, sub-goal segmentation, context window saturation, perception failures, planning and reasoning errors, memory and state tracking, and multi-agent coordination.
Execution (E): Looping behavior, exploration inefficiency, action execution errors, timing and temporal errors, and verification failures.

Overview of the four failure taxonomy dimensions: Linguistic, Topological, Agent, and Execution. — Figure 1: Overview of failure taxonomy dimensions and their subcategories.

Data Exploration

Our annotated dataset of 492 navigation failure traces can be explored through the interactive Data Explorer tool. Each trace includes the agent's reasoning at each step, identified sub-goals, extracted POIs, the map with annotated paths, and the graph network showing the traversed path with marked POIs.

Key Findings

Our analysis of 492 failed navigation instances reveals a clear distribution of error sources:

Agent Limitations (74.2%) are the dominant error category. Stop-location errors (49.9% of agent failures) and planning/reasoning errors (32.1%) are the most frequent subcategories.
Execution Failures (46.5%) are the second most common, with timing and temporal errors (49.8%) dominating this category.
Linguistic Properties (23.2%) contribute notably, with over-specification (42.1%) as the leading linguistic failure mode.
Topological Constraints (12.4%) appear least frequently, though junction complexity accounts for 73.8% within this dimension.

Notably, half of all samples exhibit errors spanning multiple dimensions, indicating that navigation failures frequently result from compounded issues rather than isolated problems.

Breakdown of error frequencies across all taxonomy subcategories. — Figure 2: Error breakdown across taxonomy subcategories.

Co-occurrence matrix showing how failure dimensions overlap across samples. — Figure 3: Co-occurrence matrix of failure dimensions.

Cascade patterns showing how failures propagate across navigation steps. — Figure 4: Cascade patterns of failure propagation.

BibTeX

@misc{shami2026grokevisionfreenavigationinstruction,
      title={GROKE: Vision-Free Navigation Instruction Evaluation via Graph Reasoning on OpenStreetMap},
      author={Farzad Shami and Subhrasankha Dey and Nico Van de Weghe and Henrikki Tenkanen},
      year={2026},
      eprint={2601.07375},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2601.07375},
}