Large Language Models (LLMs) have demonstrated remarkable capabilities across diverse language understanding tasks, yet their spatial reasoning abilities remain fundamentally limited. In Vision-and-Language Navigation (VLN), this gap manifests through systematic failures in instruction following. When navigation fails, a critical question arises: is the failure attributable to agent limitations, instruction quality, or their interaction?
Existing failure analysis frameworks predominantly focus on agent-level diagnostics, assuming that navigation instructions themselves are correct and complete. The instruction-level failure modes—such as linguistic ambiguity, topological inaccuracy, or insufficient spatial detail—remain largely unexamined. Vision-free evaluation frameworks like GROKE offer an opportunity to isolate instruction properties from perceptual challenges, enabling direct assessment of whether instructions contain sufficient, correct, and actionable spatial information.