Sensible Agent: A framework for unobtrusive interaction with proactive AR agents (opens in new tab)
Sensible Agent is a research prototype designed to move AR agents beyond explicit voice commands toward proactive, context-aware assistance. By leveraging real-time multimodal sensing of a user's environment and physical state, the framework ensures digital help is delivered unobtrusively through the most appropriate interaction modalities. This approach fundamentally reshapes human-computer interaction by anticipating user needs while minimizing cognitive and social disruption.
Contextual Understanding via Multimodal Parsing
The framework begins by analyzing the user's immediate surroundings to establish a baseline for assistance.
- A Vision-Language Model (VLM) processes egocentric camera feeds from the AR headset to identify high-level activities and locations.
- YAMNet, a pre-trained audio event classifier, monitors environmental noise levels to determine if audio feedback is appropriate.
- The system synthesizes these inputs into a parsed context that accounts for situational impairments, such as when a user’s hands are occupied.
Reasoning with Proactive Query Generation
Once the context is established, the system determines the specific type of assistance required through a sophisticated reasoning process.
- The framework uses chain-of-thought (CoT) reasoning to decompose complex problems into intermediate logical steps.
- Few-shot learning, guided by examples from data collection studies, helps the model decide between actions like providing translations or displaying a grocery list.
- The generator outputs a structured suggestion that includes the specific action, the query format (e.g., binary choice or icons), and the presentation modality (visual, audio, or both).
Dynamic Modality and Interaction Management
The final stage of the framework manages how the agent communicates with the user and how the user can respond without breaking their current flow.
- The prototype, built on Android XR and WebXR, utilizes a UI Manager to render visual panels or generate text-to-speech (TTS) prompts based on the agent's decision.
- An Input Modality Manager activates the most discreet response methods available, such as head gestures (nods), hand gestures (thumbs up), or gaze tracking.
- This adaptive selection ensures that if a user is in a noisy room or a social setting, the agent can switch from verbal interaction to subtle visual cues and gesture-based confirmations.
By prioritizing social awareness and context-sensitivity, Sensible Agent provides a blueprint for AR systems that feel like helpful companions rather than intrusive tools. Implementing such frameworks is essential for making proactive digital assistants practical and acceptable for long-term, everyday use in public and private spaces.