google

StreetReaderAI: Towards making street view accessible via context-aware multimodal AI (opens in new tab)

StreetReaderAI is a research prototype designed to make immersive street-level imagery accessible to the blind and low-vision community through multimodal AI. By integrating real-time scene analysis with context-aware geographic data, the system transforms visual mapping data into an interactive, audio-first experience. This framework allows users to virtually explore environments and plan routes with a level of detail and independence previously unavailable through traditional screen readers.

Navigation and Spatial Awareness

The system offers an immersive, first-person exploration interface that mimics the mechanics of accessible gaming.

  • Users navigate using keyboard shortcuts or voice commands, taking "virtual steps" forward or backward and panning their view in 360 degrees.
  • Real-time audio feedback provides cardinal and intercardinal directions, such as "Now facing North," to maintain spatial orientation.
  • Distance tracking informs the user how far they have traveled between panoramic images, while "teleport" features allow for quick jumps to specific addresses or landmarks.

Context-Aware AI Describer

At the core of the tool is a subsystem backed by Gemini that synthesizes visual and geographic data to generate descriptions.

  • The AI Describer combines the current field-of-view image with dynamic metadata about nearby roads, intersections, and points of interest.
  • Two distinct modes cater to different user needs: a "Default" mode focusing on pedestrian safety and navigation, and a "Tour Guide" mode that provides historical and architectural details.
  • The system utilizes Gemini to proactively predict and suggest follow-up questions relevant to the specific scene, such as details about crosswalks or building entrances.

Interactive Dialogue and Session Memory

StreetReaderAI utilizes the Multimodal Live API to facilitate real-time, natural language conversations about the environment.

  • The AI Chat agent maintains a large context window of approximately 1,048,576 tokens, allowing it to retain a "memory" of up to 4,000 previous images and interactions.
  • This memory allows users to ask retrospective spatial questions, such as "Where was that bus stop I just passed?", with the agent providing relative directions based on the user's current location.
  • By tracking every pan and movement, the agent can provide specific details about the environment that were captured in previous steps of the virtual walk.

User Evaluation and Practical Application

Testing with blind screen reader users confirmed the system's utility in practical, real-world scenarios.

  • Participants successfully used the prototype to evaluate potential walking routes, identifying critical environmental features like the presence of benches or shelters at bus stops.
  • The study highlighted the importance of multimodal inputs—combining image recognition with structured map data—to provide a more accurate and reliable description than image analysis alone could offer.

While StreetReaderAI remains a proof-of-concept, it demonstrates that the integration of multimodal LLMs and spatial data can bridge significant accessibility gaps in digital mapping. Future implementation of these technologies could transform how visually impaired individuals interact with the world, turning static street imagery into a functional tool for independent mobility and exploration.