Google Research / gemini

28 posts

google

Gemini provides automated feedback for theoretical computer scientists at STOC 2026 (opens in new tab)

Google Research launched an experimental program for the STOC 2026 conference using a specialized Gemini model to provide automated, rigorous feedback on theoretical computer science submissions. By identifying critical logical errors and proof gaps within a 24-hour window, the tool demonstrated that advanced AI can serve as a powerful pre-vetting collaborator for high-level mathematical research. The overwhelmingly positive reception from authors indicates that AI can effectively augment the human peer-review process by improving paper quality before formal submission. ## Advanced Reasoning via Inference Scaling - The tool utilized an advanced version of Gemini 2.5 Deep Think specifically optimized for mathematical rigor. - It employed inference scaling methods, allowing the model to explore and combine multiple possible solutions and reasoning traces simultaneously. - This non-linear approach to problem-solving helps the model focus on the most salient technical issues while significantly reducing the likelihood of hallucinations. ## Structured Technical Feedback - Feedback was delivered in a structured format that included a high-level summary of the paper's core contributions. - The model provided a detailed analysis of potential mistakes, specifically targeting errors within lemmas, theorems, and logical proofs. - Authors also received a categorized list of minor corrections, such as inconsistent variable naming and typographical errors. ## Identified Technical Issues and Impact - The pilot saw high engagement, with over 80% of STOC 2026 submitters opting in for the AI-generated review. - The tool successfully identified "critical bugs" and calculation errors that had previously evaded human authors for months. - Survey results showed that 97% of participants found the feedback helpful, and 81% reported that the tool improved the overall clarity and readability of their work. ## Expert Verification and Hallucinations - Because the users were domain experts, they were able to act as a filter, distinguishing between deep technical insights and occasional model hallucinations. - While the model sometimes struggled to parse complex notation or interpret figures, authors valued the "neutral tone" and the speed of the two-day turnaround. - The feedback was used as a starting point for human verification, allowing researchers to refine their arguments rather than blindly following the model's output. ## Future Outlook and Educational Potential - Beyond professional research, 75% of surveyed authors see significant educational value in using the tool to train students in mathematical rigor. - The experiment's success has led to 88% of participants expressing interest in having continuous access to such a tool throughout their entire research and drafting process. The success of the STOC 2026 pilot suggests that researchers should consider integrating specialized LLMs early in the drafting phase to catch "embarrassing" or logic-breaking errors. While the human expert remains the final arbiter of truth, these tools provide a necessary layer of automated verification that can accelerate the pace of scientific discovery.

google

Generative UI: A rich, custom, visual interactive user experience for any prompt (opens in new tab)

Google Research has introduced a novel Generative UI framework that enables AI models to dynamically construct bespoke, interactive user experiences—including web pages, games, and functional tools—in response to any natural language prompt. This shift from static, predefined interfaces to AI-generated environments allows for highly customized digital spaces that adapt to a user's specific intent and context. Evaluated through human testing, these custom-generated interfaces are strongly preferred over traditional, text-heavy LLM outputs, signaling a fundamental evolution in human-computer interaction. ### Product Integration in Gemini and Google Search The technology is currently being deployed as an experimental feature across Google’s main AI consumer platforms to enhance how users visualize and interact with data. * **Dynamic View and Visual Layout:** These experiments in the Gemini app use agentic coding capabilities to design and code a complete interactive response for every prompt. * **AI Mode in Google Search:** Available for Google AI Pro and Ultra subscribers, this feature uses Gemini 3’s multimodal understanding to build instant, bespoke interfaces for complex queries. * **Contextual Customization:** The system differentiates between user needs, such as providing a simplified interface for a child learning about the microbiome versus a data-rich layout for an adult. * **Task-Specific Tools:** Beyond text, the system generates functional applications like fashion advisors, event planners, and science simulations for topics like RNA transcription. ### Technical Architecture and Implementation The Generative UI implementation relies on a multi-layered approach centered around the Gemini 3 Pro model to ensure the generated code is both functional and accurate. * **Tool Access:** The model is connected to server-side tools, including image generation and real-time web search, to enrich the UI with external data. * **System Instructions:** Detailed guidance provides the model with specific goals, formatting requirements, and technical specifications to avoid common coding errors. * **Agentic Coding:** The model acts as both a designer and a developer, writing the necessary code to render the UI on the fly based on its interpretation of the user’s prompt. * **Post-Processing:** Outputs undergo a series of automated checks to address common issues and refine the final visual experience before it reaches the browser. ### The Shift from Static to Generative Interfaces This research represents a move away from the traditional software paradigm where users must navigate a fixed catalog of applications to find the tool they need. * **Prompt-Driven UX:** Interfaces are generated from prompts as simple as a single word or as complex as multi-paragraph instructions. * **Interactive Comprehension:** By building simulations on the fly, the system creates a dynamic environment optimized for deep learning and task completion. * **Preference Benchmarking:** Research indicates that when generation speed is excluded as a factor, users significantly prefer these custom-built visual tools over standard, static AI responses. To experience this new paradigm, users can select the "Thinking" option from the model menu in Google Search’s AI Mode or engage with the Dynamic View experiment in the Gemini app to generate tailored tools for specific learning or productivity tasks.

google

StreetReaderAI: Towards making street view accessible via context-aware multimodal AI (opens in new tab)

StreetReaderAI is a research prototype designed to make immersive street-level imagery accessible to the blind and low-vision community through multimodal AI. By integrating real-time scene analysis with context-aware geographic data, the system transforms visual mapping data into an interactive, audio-first experience. This framework allows users to virtually explore environments and plan routes with a level of detail and independence previously unavailable through traditional screen readers. ### Navigation and Spatial Awareness The system offers an immersive, first-person exploration interface that mimics the mechanics of accessible gaming. * Users navigate using keyboard shortcuts or voice commands, taking "virtual steps" forward or backward and panning their view in 360 degrees. * Real-time audio feedback provides cardinal and intercardinal directions, such as "Now facing North," to maintain spatial orientation. * Distance tracking informs the user how far they have traveled between panoramic images, while "teleport" features allow for quick jumps to specific addresses or landmarks. ### Context-Aware AI Describer At the core of the tool is a subsystem backed by Gemini that synthesizes visual and geographic data to generate descriptions. * The AI Describer combines the current field-of-view image with dynamic metadata about nearby roads, intersections, and points of interest. * Two distinct modes cater to different user needs: a "Default" mode focusing on pedestrian safety and navigation, and a "Tour Guide" mode that provides historical and architectural details. * The system utilizes Gemini to proactively predict and suggest follow-up questions relevant to the specific scene, such as details about crosswalks or building entrances. ### Interactive Dialogue and Session Memory StreetReaderAI utilizes the Multimodal Live API to facilitate real-time, natural language conversations about the environment. * The AI Chat agent maintains a large context window of approximately 1,048,576 tokens, allowing it to retain a "memory" of up to 4,000 previous images and interactions. * This memory allows users to ask retrospective spatial questions, such as "Where was that bus stop I just passed?", with the agent providing relative directions based on the user's current location. * By tracking every pan and movement, the agent can provide specific details about the environment that were captured in previous steps of the virtual walk. ### User Evaluation and Practical Application Testing with blind screen reader users confirmed the system's utility in practical, real-world scenarios. * Participants successfully used the prototype to evaluate potential walking routes, identifying critical environmental features like the presence of benches or shelters at bus stops. * The study highlighted the importance of multimodal inputs—combining image recognition with structured map data—to provide a more accurate and reliable description than image analysis alone could offer. While StreetReaderAI remains a proof-of-concept, it demonstrates that the integration of multimodal LLMs and spatial data can bridge significant accessibility gaps in digital mapping. Future implementation of these technologies could transform how visually impaired individuals interact with the world, turning static street imagery into a functional tool for independent mobility and exploration.

google

Google Earth AI: Unlocking geospatial insights with foundation models and cross-modal reasoning (opens in new tab)

Google Earth AI introduces a framework of geospatial foundation models and reasoning agents designed to solve complex, planetary-scale challenges through cross-modal reasoning. By integrating Gemini-powered orchestrators with specialized imagery, population, and environmental models, the system deconstructs multifaceted queries into actionable multi-step plans. This approach enables a holistic understanding of real-world events, such as disaster response and disease forecasting, by grounding AI insights in diverse, grounded geospatial data. ## Geospatial Reasoning Agents * Utilizes Gemini models as intelligent orchestrators to manage complex queries that require data from multiple domains. * The agent deconstructs a high-level question—such as predicting hurricane landfalls and community vulnerability—into a sequence of smaller, executable tasks. * It executes these plans by autonomously calling specialized foundation models, querying vast datastores, and utilizing geospatial tools to fuse disparate data points into a single, cohesive answer. ## Remote Sensing and Imagery Foundations * Employs vision-language models and open-vocabulary object detection trained on a large corpus of high-resolution overhead imagery paired with text descriptions. * Enables "zero-shot" capabilities, allowing users to find specific objects like "flooded roads" or "building damage" using natural language without needing to retrain the model for specific classes. * Technical evaluations show a 16% average improvement on text-based image search tasks and more than double the baseline accuracy for detecting novel objects in a zero-shot setting. ## Population Dynamics and Mobility * Focuses on the interplay between people and places using globally-consistent embeddings across 17 countries. * Includes monthly updated embeddings that capture shifting human activity patterns, which are essential for time-sensitive forecasting. * Research conducted with the University of Oxford showed that incorporating these population embeddings into a Dengue fever forecasting model in Brazil improved the R² metric from 0.456 to 0.656 for long-range 12-month predictions. ## Environmental and Disaster Forecasting * Integrates established Google research into weather nowcasting, flood forecasting, and wildfire boundary mapping. * Provides the reasoning agent with the data necessary to evaluate environmental risks alongside population density and infrastructure imagery. * Aims to provide Search and Maps users with real-time, accurate alerts regarding natural disasters grounded in planetary-scale environmental data. Developers and enterprises looking to solve high-level geospatial problems can now express interest in accessing these capabilities through Google Earth and Google Cloud. By leveraging these foundation models, organizations can automate the analysis of satellite imagery and human mobility data to better prepare for environmental and social challenges.

google

Teaching Gemini to spot exploding stars with just a few examples (opens in new tab)

Researchers have demonstrated that Google’s Gemini model can classify cosmic events with 93% accuracy, rivaling specialized machine learning models while providing human-readable explanations. By utilizing few-shot learning with only 15 examples per survey, the model addresses the "black box" limitation of traditional convolutional neural networks used in astronomy. This approach enables scientists to efficiently process the millions of alerts generated by modern telescopes while maintaining a transparent and interactive reasoning process. ## Bottlenecks in Modern Transient Astronomy * Telescopes like the Vera C. Rubin Observatory are expected to generate up to 10 million alerts per night, making manual verification impossible. * The vast majority of these alerts are "bogus" signals caused by satellite trails, cosmic rays, or instrumental artifacts rather than real supernovae. * Existing specialized models often provide binary "real" or "bogus" labels without context, forcing astronomers to either blindly trust the output or spend hours on manual verification. ## Multimodal Few-Shot Learning for Classification * The research utilized few-shot learning, providing Gemini with only 15 annotated examples for three major surveys: Pan-STARRS, MeerLICHT, and ATLAS. * Input data consisted of image triplets—a "new" alert image, a "reference" image of the same sky patch, and a "difference" image—each 100x100 pixels in size. * The model successfully generalized across different telescopes with varying pixel scales, ranging from 0.25" per pixel for Pan-STARRS to 1.8" per pixel for ATLAS. * Beyond simple labels, Gemini generates a textual description of observed features and an interest score to help astronomers prioritize follow-up observations. ## Expert Validation and Self-Assessment * A panel of 12 professional astronomers evaluated the model using a 0–5 coherence rubric, confirming that Gemini’s logic aligned with expert reasoning. * The study found that Gemini can effectively assess its own uncertainty; low self-assigned "coherence scores" were strong indicators of likely classification errors. * This ability to flag its own potential mistakes allows the model to act as a reliable partner, alerting scientists when a specific case requires human intervention. The transition from "black box" classifiers to interpretable AI assistants allows the astronomical community to scale with the data flood of next-generation telescopes. By combining high-accuracy classification with transparent reasoning, researchers can maintain scientific rigor while processing millions of cosmic events in real time.

google

XR Blocks: Accelerating AI + XR innovation (opens in new tab)

XR Blocks is an open-source, cross-platform framework designed to bridge the technical gap between mature AI development ecosystems and high-friction extended reality (XR) prototyping. By providing a modular architecture and high-level abstractions, the toolkit enables creators to rapidly build and deploy intelligent, immersive web applications without managing low-level system integration. Ultimately, the framework empowers developers to move from concept to interactive prototype across both desktop simulators and mobile XR devices using a unified codebase. ### Core Design Principles * **Simplicity and Readability:** Drawing inspiration from the "Zen of Python," the framework prioritizes human-readable abstractions where a developer’s script reflects a high-level description of the experience rather than complex boilerplate code. * **Creator-Centric Workflow:** The architecture is designed to handle the "plumbing" of XR—such as sensor fusion, AI model integration, and cross-platform logic—allowing creators to focus entirely on user interaction and experience. * **Pragmatic Modularity:** Rather than attempting to be a perfect, all-encompassing system, XR Blocks favors an adaptable and simple architecture that can evolve alongside the rapidly changing fields of AI and spatial computing. ### The Reality Model Abstractions * **The Script Primitive:** Acts as the logical center of an application, separating the "what" of an interaction from the "how" of its underlying technical implementation. * **User and World:** Provides built-in support for tracking hands, gaze, and avatars while allowing the system to query the physical environment for depth, estimated lighting conditions, and object recognition. * **AI and Agents:** Facilitates the integration of intelligent assistants, such as the "Sensible Agent," which can provide proactive, context-aware suggestions within the XR environment. * **Virtual Interfaces:** Offers tools to augment blended reality with virtual UI elements that respond to the user's physical context. ### Technical Implementation and Integration * **Web-Based Foundation:** The framework is built upon accessible, standard technologies including WebXR, three.js, and LiteRT (formerly TFLite) to ensure a low barrier to entry for web developers. * **Advanced AI Support:** It features native integration with Gemini for high-level reasoning and context-aware applications. * **Cross-Platform Deployment:** Developers can prototype depth-aware, physics-based interactions in a desktop simulator and deploy the exact same code to Android XR devices. * **Open-Source Resources:** The project includes a comprehensive suite of templates and live demos covering specific use cases like depth mapping, gesture modeling, and lighting estimation. By lowering the barrier to entry for intelligent XR development, XR Blocks serves as a practical starting point for researchers and developers aiming to explore the next generation of human-centered computing. Interested creators can access the source code on GitHub to begin building immersive, AI-driven applications that function seamlessly across the web and specialized XR hardware.

google

The anatomy of a personal health agent (opens in new tab)

Google researchers have developed the Personal Health Agent (PHA), an LLM-powered prototype designed to provide evidence-based, personalized health insights by analyzing multimodal data from wearables and blood biomarkers. By utilizing a specialized multi-agent architecture, the system deconstructs complex health queries into specific tasks to ensure statistical accuracy and clinical grounding. The study demonstrates that this modular approach significantly outperforms standard large language models in providing reliable, data-driven wellness support. ## Multi-Agent System Architecture * The PHA framework adopts a "team-based" approach, utilizing three specialist sub-agents: a Data Science agent, a Domain Expert agent, and a Health Coach. * The system was validated using a real-world dataset from 1,200 participants, featuring longitudinal Fitbit data, health questionnaires, and clinical blood test results. * This architecture was designed after a user-centered study of 1,300 health queries, identifying four key needs: general knowledge, data interpretation, wellness advice, and symptom assessment. * Evaluation involved over 1,100 hours of human expert effort across 10 benchmark tasks to ensure the system outperformed base models like Gemini. ## The Data Science Agent * This agent specializes in "contextualized numerical insights," transforming ambiguous queries (e.g., "How is my fitness trending?") into formal statistical analysis plans. * It operates through a two-stage process: first interpreting the user's intent and data sufficiency, then generating executable code to analyze time-series data. * In benchmark testing, the agent achieved a 75.6% score in analysis planning, significantly higher than the 53.7% score achieved by the base model. * The agent's code generation was validated against 173 rigorous unit tests written by human data scientists to ensure accuracy in handling wearable sensor data. ## The Domain Expert Agent * Designed for high-stakes medical accuracy, this agent functions as a grounded source of health knowledge using a multi-step reasoning framework. * It utilizes a "toolbox" approach, granting the LLM access to authoritative external databases such as the National Center for Biotechnology Information (NCBI) to provide verifiable facts. * The agent is specifically tuned to tailor information to the user’s unique profile, including specific biomarkers and pre-existing medical conditions. * Performance was measured through board certification and coaching exam questions, as well as its ability to provide accurate differential diagnoses compared to human clinicians. While currently a research framework rather than a public product, the PHA demonstrates that a modular, specialist-driven AI architecture is essential for safe and effective personal health management. Developers of future health-tech tools should prioritize grounding LLMs in external clinical databases and implementing rigorous statistical validation stages to move beyond the limitations of general-purpose chatbots.

google

AI as a research partner: Advancing theoretical computer science with AlphaEvolve (opens in new tab)

AlphaEvolve, an LLM-powered coding agent developed by Google DeepMind, facilitates mathematical discovery by evolving code to find complex combinatorial structures that are difficult to design manually. By utilizing a "lifting" technique, the system discovers finite structures that can be plugged into existing proof frameworks to establish new universal theorems in complexity theory. This methodology has successfully produced state-of-the-art results for the MAX-4-CUT problem and tightened bounds on the hardness of certifying properties in random graphs. ## The Role of AlphaEvolve in Mathematical Research * The system uses an iterative feedback loop to morph code snippets, evaluating the resulting mathematical structures and refining the code toward more optimal solutions. * AlphaEvolve operates as a tool-based assistant that generates specific proof elements, which can then be automatically verified by computer programs to ensure absolute mathematical correctness. * By focusing on verifiable finite structures, the agent overcomes the common "hallucination" issues of LLMs, as the final output is a computationally certified object rather than a speculative text-based proof. ## Bridging Finite Discovery and Universal Statements through Lifting * Theoretical computer science often requires proofs that hold true for all problem sizes ($\forall n$), a scale that AI systems typically struggle to address directly. * The "lifting" technique treats a proof as a modular structure where a specific finite component—such as a combinatorial gadget—can be replaced with a more efficient version while keeping the rest of the proof intact. * When AlphaEvolve finds a superior finite structure, the improvement is "lifted" through the existing mathematical framework to yield a stronger universal theorem without requiring a human to redesign the entire logical architecture. ## Optimizing Gadget Reductions and MAX-k-CUT * Researchers applied the agent to "gadget reductions," which are recipes used to map known intractable problems to new ones to prove computational hardness (NP-hardness). * AlphaEvolve discovered complex gadgets that were previously unknown because they were too intricate for researchers to construct by hand. * These discoveries led to a new state-of-the-art inapproximability result for the MAX-4-CUT problem, defining more precise limits on how accurately the problem can be solved by any efficient algorithm. ## Advancing Average-Case Hardness in Random Graphs * The agent was tasked with uncovering structures related to the average-case hardness of certifying properties within random graphs. * By evolving better combinatorial structures for these specific instances, the team was able to tighten existing mathematical bounds, providing a clearer picture of when certain graph properties become computationally intractable to verify. This research demonstrates that LLM-based agents can serve as genuine research partners by focusing on the discovery of verifiable, finite components within broader theoretical frameworks. For researchers in mathematics and computer science, this "lifting" approach provides a practical roadmap for using AI to solve bottleneck problems that were previously restricted by the limits of manual construction.

google

Towards better health conversations: Research insights on a “wayfinding” AI agent based on Gemini (opens in new tab)

Google Research has developed "Wayfinding AI," a research prototype based on Gemini designed to transform health information seeking from a passive query-response model into a proactive, context-seeking dialogue. By prioritizing clarifying questions and iterative guidance, the agent addresses the common struggle users face when attempting to articulate complex or ambiguous medical concerns. User studies indicate that this proactive approach results in health information that participants find significantly more helpful, relevant, and tailored to their specific needs than traditional AI responses. ### Challenges in Digital Health Navigation * Formative research involving 33 participants highlighted that users often struggle to articulate health concerns because they lack the clinical background to know which details are medically relevant. * The study found that users typically "throw words" at a search engine and sift through generic, impersonal results that do not account for their unique context. * Initial UX testing revealed a strong user preference for a "deferred-answer" approach, where the AI mimics a medical professional by asking clarifying questions before jumping to a conclusion. ### Core Design Principles of Wayfinding AI * **Proactive Conversational Guidance:** At every turn, the agent asks up to three targeted questions to reduce ambiguity and help users systematically share their "health story." * **Best-Effort Answers:** To ensure immediate utility, the AI provides the best possible information based on the data available at that moment, while noting that the answer will improve as the user provides more context. * **Transparent Reasoning:** The system explicitly explains how the user’s most recent answers have helped refine the previous response, making the AI’s internal logic understandable. ### Split-Stream User Interface * To prevent clarifying questions from being buried in long paragraphs, the prototype uses a two-column layout. * The left column is dedicated to the interactive chat and specific follow-up questions to keep the user focused on the dialogue. * The right column displays the "best information so far" and detailed explanations, allowing users to dive into the technical content only when they feel enough context has been established. ### Comparative Evaluation and Performance * A randomized study with 130 participants compared the Wayfinding AI against a baseline Gemini 2.5 Flash model. * Participants interacted with both models for at least three minutes regarding a personal health question and rated them across six dimensions: helpfulness, question relevance, tailoring, goal understanding, ease of use, and efficiency. * The proactive agent outperformed the baseline significantly, with participants reporting that the context-seeking behavior felt more professional and increased their confidence in the AI's suggestions. The research suggests that for sensitive and complex topics like health, AI should move beyond being a passive knowledge base. By adopting a "wayfinding" strategy that guides users through their own information needs, AI agents can provide more personalized and empowering experiences that better mirror expert human consultation.