vision-language-models

2 posts

google

Google Earth AI: Unlocking geospatial insights with foundation models and cross-modal reasoning (opens in new tab)

Google Earth AI introduces a framework of geospatial foundation models and reasoning agents designed to solve complex, planetary-scale challenges through cross-modal reasoning. By integrating Gemini-powered orchestrators with specialized imagery, population, and environmental models, the system deconstructs multifaceted queries into actionable multi-step plans. This approach enables a holistic understanding of real-world events, such as disaster response and disease forecasting, by grounding AI insights in diverse, grounded geospatial data. ## Geospatial Reasoning Agents * Utilizes Gemini models as intelligent orchestrators to manage complex queries that require data from multiple domains. * The agent deconstructs a high-level question—such as predicting hurricane landfalls and community vulnerability—into a sequence of smaller, executable tasks. * It executes these plans by autonomously calling specialized foundation models, querying vast datastores, and utilizing geospatial tools to fuse disparate data points into a single, cohesive answer. ## Remote Sensing and Imagery Foundations * Employs vision-language models and open-vocabulary object detection trained on a large corpus of high-resolution overhead imagery paired with text descriptions. * Enables "zero-shot" capabilities, allowing users to find specific objects like "flooded roads" or "building damage" using natural language without needing to retrain the model for specific classes. * Technical evaluations show a 16% average improvement on text-based image search tasks and more than double the baseline accuracy for detecting novel objects in a zero-shot setting. ## Population Dynamics and Mobility * Focuses on the interplay between people and places using globally-consistent embeddings across 17 countries. * Includes monthly updated embeddings that capture shifting human activity patterns, which are essential for time-sensitive forecasting. * Research conducted with the University of Oxford showed that incorporating these population embeddings into a Dengue fever forecasting model in Brazil improved the R² metric from 0.456 to 0.656 for long-range 12-month predictions. ## Environmental and Disaster Forecasting * Integrates established Google research into weather nowcasting, flood forecasting, and wildfire boundary mapping. * Provides the reasoning agent with the data necessary to evaluate environmental risks alongside population density and infrastructure imagery. * Aims to provide Search and Maps users with real-time, accurate alerts regarding natural disasters grounded in planetary-scale environmental data. Developers and enterprises looking to solve high-level geospatial problems can now express interest in accessing these capabilities through Google Earth and Google Cloud. By leveraging these foundation models, organizations can automate the analysis of satellite imagery and human mobility data to better prepare for environmental and social challenges.

google

Sensible Agent: A framework for unobtrusive interaction with proactive AR agents (opens in new tab)

Sensible Agent is a research prototype designed to move AR agents beyond explicit voice commands toward proactive, context-aware assistance. By leveraging real-time multimodal sensing of a user's environment and physical state, the framework ensures digital help is delivered unobtrusively through the most appropriate interaction modalities. This approach fundamentally reshapes human-computer interaction by anticipating user needs while minimizing cognitive and social disruption. ## Contextual Understanding via Multimodal Parsing The framework begins by analyzing the user's immediate surroundings to establish a baseline for assistance. * A Vision-Language Model (VLM) processes egocentric camera feeds from the AR headset to identify high-level activities and locations. * YAMNet, a pre-trained audio event classifier, monitors environmental noise levels to determine if audio feedback is appropriate. * The system synthesizes these inputs into a parsed context that accounts for situational impairments, such as when a user’s hands are occupied. ## Reasoning with Proactive Query Generation Once the context is established, the system determines the specific type of assistance required through a sophisticated reasoning process. * The framework uses chain-of-thought (CoT) reasoning to decompose complex problems into intermediate logical steps. * Few-shot learning, guided by examples from data collection studies, helps the model decide between actions like providing translations or displaying a grocery list. * The generator outputs a structured suggestion that includes the specific action, the query format (e.g., binary choice or icons), and the presentation modality (visual, audio, or both). ## Dynamic Modality and Interaction Management The final stage of the framework manages how the agent communicates with the user and how the user can respond without breaking their current flow. * The prototype, built on Android XR and WebXR, utilizes a UI Manager to render visual panels or generate text-to-speech (TTS) prompts based on the agent's decision. * An Input Modality Manager activates the most discreet response methods available, such as head gestures (nods), hand gestures (thumbs up), or gaze tracking. * This adaptive selection ensures that if a user is in a noisy room or a social setting, the agent can switch from verbal interaction to subtle visual cues and gesture-based confirmations. By prioritizing social awareness and context-sensitivity, Sensible Agent provides a blueprint for AR systems that feel like helpful companions rather than intrusive tools. Implementing such frameworks is essential for making proactive digital assistants practical and acceptable for long-term, everyday use in public and private spaces.