computer-vision

9 posts

toss

Toss's AI Technology Recognized (opens in new tab)

Toss ML Engineer Jin-woo Lee presents FedLPA, a novel Federated Learning algorithm accepted at NeurIPS 2025 that addresses the critical challenges of data sovereignty and non-uniform data distributions. By allowing AI models to learn from localized data without transferring sensitive information across borders, this research provides a technical foundation for expanding services like Toss Face Pay into international markets with strict privacy regulations. ### The Challenge of Data Sovereignty in Global AI * Traditional AI development requires centralizing data on a single server, which is often impossible due to international privacy laws and data sovereignty regulations. * Federated Learning offers a solution by sending the model to the user’s device (client) rather than moving the data, ensuring raw biometric information never leaves the local environment. * Standard Federated Learning fails in real-world scenarios where data is non-IID (Independent and Identically Distributed), meaning user patterns in different countries or regions vary significantly. ### Overcoming Limitations in Category Discovery * Existing models assume all users share similar data distributions and that all data classes are known beforehand, which leads to performance degradation when encountering new demographics. * FedLPA incorporates Generalized Category Discovery (GCD) to identify both known classes and entirely "novel classes" (e.g., new fraud patterns or ethnic features) that were not present in the initial training set. * This approach prevents the model from becoming obsolete as it encounters new environments, allowing it to adapt to local characteristics autonomously. ### The FedLPA Three-Step Learning Pipeline * **Confidence-guided Local Structure Discovery (CLSD):** The system builds a similarity graph by comparing feature vectors of local data. It refines these connections using "high-confidence" samples—data points the model is certain about—to strengthen the quality of the relational map. * **InfoMap Clustering:** Instead of requiring a human to pre-define the number of categories, the algorithm uses the InfoMap community detection method. This allows the client to automatically estimate the number of unique categories within its own local data through random walks on the similarity graph. * **Local Prior Alignment (LPA):** The model uses self-distillation to ensure consistent predictions across different views of the same data. Most importantly, an LPA regularizer forces the model’s prediction distribution to align with the "Empirical Prior" discovered in the clustering phase, preventing the model from becoming biased toward over-represented classes. ### Business Implications and Strategic Value * **Regulatory Compliance:** FedLPA removes technical barriers to entry for markets like the EU or Southeast Asia by maintaining high model performance while strictly adhering to local data residency requirements. * **Hyper-personalization:** Financial services such as Fraud Detection Systems (FDS) and Credit Scoring Systems (CSS) can be trained on local patterns, allowing for more accurate detection of region-specific scams or credit behaviors. * **Operational Efficiency:** By enabling models to self-detect and learn from new patterns without manual labeling or central intervention, the system significantly reduces the cost and time required for global maintenance. Implementing localized Federated Learning architectures like FedLPA is a recommended strategy for tech organizations seeking to scale AI services internationally while navigating the complex landscape of global privacy regulations and diverse data distributions.

google

Separating natural forests from other tree cover with AI for deforestation-free supply chains (opens in new tab)

Researchers from Google DeepMind and Google Research have developed "Natural Forests of the World 2020," an AI-powered global map that distinguishes natural ecosystems from commercial tree plantations. By utilizing high-resolution satellite data and machine learning, the project provides a critical 10-meter resolution baseline to support deforestation-free supply chain regulations like the EUDR. This tool enables governments and companies to monitor biodiversity-rich areas with unprecedented accuracy, ensuring that natural forests are protected from industrial degradation. **The Limitation of Traditional Tree Cover Maps** * Existing maps frequently conflate all woody vegetation into a generic "tree cover" category, leading to "apples-to-oranges" comparisons between different land types. * This lack of distinction makes it difficult to differentiate between the harvesting of short-term plantations and the permanent loss of ancient, biodiversity-rich natural forests. * Precise mapping is now a legal necessity due to regulations like the European Union Regulation on Deforestation-free Products (EUDR), which bans products from land deforested or degraded after December 31, 2020. **The MTSViT Modeling Approach** * To accurately identify forest types, researchers developed the Multi-modal Temporal-Spatial Vision Transformer (MTSViT). * Rather than relying on a single snapshot, the AI "observes" 1280 x 1280 meter patches over the course of a year to identify seasonal, spectral, and textural signatures. * The model integrates multi-modal data, including Sentinel-2 satellite imagery, topographical information (such as elevation and slope), and specific geographical coordinates. * This temporal-spatial analysis allows the AI to recognize the complex patterns of natural forests that distinguish them from the uniform, fast-growing structures of commercial plantations. **Dataset Scale and Global Validation** * The model was trained on a massive dataset comprising over 1.2 million global patches at 10-meter resolution. * The final map provides seamless global coverage, achieving a best-in-class validation accuracy of 92.2% against an independent global dataset. * The research was a collaborative effort involving the World Resources Institute and the International Institute for Applied Systems Analysis to ensure scientific rigor and practical utility. The "Natural Forests of the World 2020" dataset is publicly available via Google Earth Engine and other open repositories. Organizations should leverage this high-resolution baseline to conduct environmental due diligence, support government monitoring, and target conservation efforts in preparation for global climate milestones like COP30.

google

StreetReaderAI: Towards making street view accessible via context-aware multimodal AI (opens in new tab)

StreetReaderAI is a research prototype designed to make immersive street-level imagery accessible to the blind and low-vision community through multimodal AI. By integrating real-time scene analysis with context-aware geographic data, the system transforms visual mapping data into an interactive, audio-first experience. This framework allows users to virtually explore environments and plan routes with a level of detail and independence previously unavailable through traditional screen readers. ### Navigation and Spatial Awareness The system offers an immersive, first-person exploration interface that mimics the mechanics of accessible gaming. * Users navigate using keyboard shortcuts or voice commands, taking "virtual steps" forward or backward and panning their view in 360 degrees. * Real-time audio feedback provides cardinal and intercardinal directions, such as "Now facing North," to maintain spatial orientation. * Distance tracking informs the user how far they have traveled between panoramic images, while "teleport" features allow for quick jumps to specific addresses or landmarks. ### Context-Aware AI Describer At the core of the tool is a subsystem backed by Gemini that synthesizes visual and geographic data to generate descriptions. * The AI Describer combines the current field-of-view image with dynamic metadata about nearby roads, intersections, and points of interest. * Two distinct modes cater to different user needs: a "Default" mode focusing on pedestrian safety and navigation, and a "Tour Guide" mode that provides historical and architectural details. * The system utilizes Gemini to proactively predict and suggest follow-up questions relevant to the specific scene, such as details about crosswalks or building entrances. ### Interactive Dialogue and Session Memory StreetReaderAI utilizes the Multimodal Live API to facilitate real-time, natural language conversations about the environment. * The AI Chat agent maintains a large context window of approximately 1,048,576 tokens, allowing it to retain a "memory" of up to 4,000 previous images and interactions. * This memory allows users to ask retrospective spatial questions, such as "Where was that bus stop I just passed?", with the agent providing relative directions based on the user's current location. * By tracking every pan and movement, the agent can provide specific details about the environment that were captured in previous steps of the virtual walk. ### User Evaluation and Practical Application Testing with blind screen reader users confirmed the system's utility in practical, real-world scenarios. * Participants successfully used the prototype to evaluate potential walking routes, identifying critical environmental features like the presence of benches or shelters at bus stops. * The study highlighted the importance of multimodal inputs—combining image recognition with structured map data—to provide a more accurate and reliable description than image analysis alone could offer. While StreetReaderAI remains a proof-of-concept, it demonstrates that the integration of multimodal LLMs and spatial data can bridge significant accessibility gaps in digital mapping. Future implementation of these technologies could transform how visually impaired individuals interact with the world, turning static street imagery into a functional tool for independent mobility and exploration.

google

Introducing interactive on-device segmentation in Snapseed (opens in new tab)

Google has introduced a new "Object Brush" feature in Snapseed that enables intuitive, real-time selective photo editing through a novel on-device segmentation technology. By leveraging a high-performance interactive AI model, users can isolate complex subjects with simple touch gestures in under 20 milliseconds, bridging the gap between professional-grade editing and mobile convenience. This breakthrough is achieved through a sophisticated teacher-student training architecture that prioritizes both pixel-perfect accuracy and low-latency performance on consumer hardware. ### High-Performance On-Device Inference * The system is powered by the Interactive Segmenter model, which is integrated directly into the Snapseed "Adjust" tool to facilitate immediate object-based modifications. * To ensure a fluid user experience, the model utilizes the MediaPipe framework and LiteRT’s GPU acceleration to process selections in less than 20ms. * The interface supports dynamic refinement, allowing users to provide real-time feedback by tracing lines or tapping to add or subtract specific areas of an image. ### Teacher-Student Model Distillation * The development team first created "Interactive Segmenter: Teacher," a large-scale model fine-tuned on 30,000 high-quality, pixel-perfect manual annotations across more than 350 object categories. * Because the Teacher model’s size and computational requirements are prohibitive for mobile use, researchers developed "Interactive Segmenter: Edge" through knowledge distillation. * This distillation process utilized a dataset of over 2 million weakly annotated images, allowing the smaller Edge model to inherit the generalization capabilities of the Teacher model while maintaining a footprint suitable for mobile devices. ### Training via Synthetic User Prompts * To make the model universally capable across all object types, the training process uses a class-agnostic approach based on the Big Transfer (BiT) strategy. * The model learns to interpret user intent through "prompt generation," which simulates real-world interactions such as random scribbles, taps, and lasso (box) selections. * During training, both the Teacher and Edge models receive identical prompts—such as red foreground scribbles and blue background scribbles—to ensure the student model learns to produce high-quality masks even from imprecise user input. This advancement significantly lowers the barrier to entry for complex photo manipulation by moving heavy-duty AI processing directly onto the mobile device. Users can expect a more responsive and precise editing experience that handles everything from fine-tuning a subject's lighting to isolating specific environmental elements like clouds or clothing.

google

From massive models to mobile magic: The tech behind YouTube real-time generative AI effects (opens in new tab)

YouTube has successfully deployed over 20 real-time generative AI effects by distilling the capabilities of massive cloud-based models into compact, mobile-ready architectures. By utilizing a "teacher-student" training paradigm, the system overcomes the computational bottlenecks of high-fidelity generative AI while ensuring the output remains responsive on mobile hardware. This approach allows for complex transformations, such as cartoon style transfer and makeup application, to run frame-by-frame on-device without sacrificing the user’s identity. ### Data Curation and Diversity * The foundation of the effects pipeline relies on high-quality, properly licensed face datasets. * Datasets are meticulously filtered to ensure a uniform distribution across different ages, genders, and skin tones. * The Monk Skin Tone Scale is used as a benchmark to ensure the effects work equitably for all users. ### The Teacher-Student Framework * **The Teacher:** A large, powerful pre-trained model (initially StyleGAN2 with StyleCLIP, later transitioning to Google DeepMind’s Imagen) acts as the "expert" that generates high-fidelity visual effects. * **The Student:** A lightweight UNet-based architecture designed for mobile efficiency. It utilizes a MobileNet backbone for both the encoder and decoder to ensure fast frame-by-frame processing. * The distillation process narrows the scope of the massive teacher model into a student model focused on a single, specific task. ### Iterative Distillation and Training * **Data Generation:** The teacher model processes thousands of images to create "before and after" pairs. These are augmented with synthetic elements like AR glasses, sunglasses, and hand occlusions to improve real-world robustness. * **Optimization:** The student model is trained using a sophisticated combination of loss functions, including L1, LPIPS, Adaptive, and Adversarial loss, to balance numerical accuracy with aesthetic quality. * **Architecture Search:** Neural architecture search is employed to tune "depth" and "width" multipliers, identifying the most efficient model structure for different mobile hardware constraints. ### Addressing the Inversion Problem * A major challenge in real-time effects is the "inversion problem," where the model struggles to represent a real face in latent space, leading to a loss of the user's identity (e.g., changes in skin tone or clothing). * YouTube uses Pivotal Tuning Inversion (PTI) to ensure that the user's specific features are preserved during the generative process. * By editing images in the latent space—a compressed numerical representation—the system can apply stylistic changes while maintaining the core characteristics of the original video stream. By combining advanced model distillation with on-device optimization via MediaPipe, YouTube demonstrates a practical path for bringing heavy generative AI research into consumer-facing mobile applications.

line

The Present State of LY Corporation's (opens in new tab)

Tech-Verse 2025 showcased LY Corporation’s strategic shift toward an AI-integrated ecosystem following the merger of LINE and Yahoo Japan. The event focused on the practical hurdles of deploying generative AI, concluding that the transition from experimental models to production-ready services requires sophisticated evaluation frameworks and deep contextual integration into developer workflows. ## AI-Driven Engineering with Ark Developer LY Corporation’s internal "Ark Developer" solution demonstrates how AI can be embedded directly into the software development life cycle. * The system utilizes a Retrieval-Augmented Generation (RAG) based code assistant to handle tasks such as code completion, security reviews, and automated test generation. * Rather than treating codebases as simple text documents, the tool performs graph analysis on directory structures to maintain structural context during code synthesis. * Real-world application includes a seamless integration with GitHub for automated Pull Request (PR) creation, with internal users reporting higher satisfaction compared to off-the-shelf tools like GitHub Copilot. ## Quantifying Quality in Generative AI A significant portion of the technical discussion centered on moving away from subjective "vibes-based" assessments toward rigorous, multi-faceted evaluation of AI outputs. * To measure the quality of generated images, developers utilized traditional metrics like Fréchet Inception Distance (FID) and Inception Score (IS) alongside LAION’s Aesthetic Score. * Advanced evaluation techniques were introduced, including CLIP-IQA, Q-Align, and Visual Question Answering (VQA) based on video-language models to analyze image accuracy. * Technical challenges in image translation and inpainting were highlighted, specifically the difficulty of restoring layout and text structures naturally after optical character recognition (OCR) and translation. ## Global Technical Exchange and Implementation The conference served as a collaborative hub for engineers across Japan, Taiwan, and Korea to discuss the implementation of emerging standards like the Model Context Protocol (MCP). * Sessions emphasized the "how-to" of overcoming deployment hurdles rather than just following technical trends. * Poster sessions (Product Street) and interactive Q&A segments allowed developers to share localized insights on LLM agent performance and agentic workflows. * The recurring theme across diverse teams was that the "evaluation and verification" stage is now the primary driver of quality in generative AI services. For organizations looking to scale AI, the key recommendation is to move beyond simple implementation and invest in "evaluation-driven development." By building internal tools that leverage graph-based context and quantitative metrics like Aesthetic Scores and VQA, teams can ensure that generative outputs meet professional service standards.

line

How should we evaluate AI-generated (opens in new tab)

To optimize the Background Person Removal (BPR) feature in image editing services, the LY Corporation AMD team evaluated various generative AI inpainting models to determine which automated metrics best align with human judgment. While traditional research benchmarks often fail to reflect performance in high-resolution, real-world scenarios, this study identifies a framework for selecting models that produce the most natural results. The research highlights that as the complexity and size of the masked area increase, the gap between model performance becomes more pronounced, requiring more sophisticated evaluation strategies. ### Background Person Removal Workflow * **Instance Segmentation:** The process begins by identifying individual pixels to classify objects such as people, buildings, or trees within the input image. * **Salient Object Detection:** This step distinguishes the main subjects of the photo from background elements to ensure only unwanted figures are targeted for removal. * **Inpainting Execution:** Once the background figures are removed, inpainting technology is used to reconstruct the empty space so it blends seamlessly with the surrounding environment. ### Comparison of Inpainting Technologies * **Diffusion-based Models:** These models, such as FLUX.1-Fill-dev, restore damaged areas by gradually removing noise. While they excel at restoring complex details, they are generally slower than GANs and can occasionally generate artifacts. * **GAN-based Models:** Using a generator-discriminator architecture, models like LaMa and HINT offer faster generation speeds and competitive performance for lower-resolution or smaller inpainting tasks. * **Performance Discrepancy:** Experiments showed that while most models perform well on small areas, high-resolution images with large missing sections reveal significant quality differences that are not always captured in standard academic benchmarks. ### Evaluation Methodology and Metrics * **BPR Evaluation Dataset:** The team curated a specific dataset of 10 images with high quality-variance to test 11 different inpainting models released between 2022 and 2024. * **Single Image Quality Metrics:** Evaluated models using LAION Aesthetics score-v2, CLIP-IQA, and Q-Align to measure the aesthetic quality of individual generated frames. * **Preference and Reward Models:** Utilized PickScore, ImageReward, and HPS v2 to determine which generated images would be most preferred by human users. * **Objective:** The goal of these tests was to find an automated evaluation method that minimizes the need for expensive and time-consuming human reviews while maintaining high reliability. Selecting an inpainting model based solely on paper-presented metrics is insufficient for production-level services. For features like BPR, it is critical to implement an evaluation pipeline that combines both aesthetic scoring and human preference models to ensure consistent quality across diverse, high-resolution user photos.

google

AMIE gains vision: A research AI agent for multimodal diagnostic dialogue (opens in new tab)

Google Research and DeepMind have introduced multimodal AMIE, an advanced research AI agent designed to conduct diagnostic medical dialogues that integrate text, images, and clinical documents. By building on Gemini 2.0 Flash and a novel state-aware reasoning framework, the system can intelligently request and interpret visual data such as skin photos or ECGs to refine its diagnostic hypotheses. This evolution moves AI diagnostic tools closer to real-world clinical practice, where visual evidence is often essential for accurate patient assessment and management. ### Enhancing AMIE with Multimodal Perception To move beyond text-only limitations, researchers integrated vision capabilities that allow the agent to process complex medical information during a conversation. * The system uses Gemini 2.0 Flash as its core component to interpret diverse data types, including dermatology images and laboratory reports. * By incorporating multimodal perception, the agent can resolve diagnostic ambiguities that cannot be addressed through verbal descriptions alone. * Preliminary testing with Gemini 2.5 Flash suggests that further scaling the underlying model continues to improve the agent's reasoning and diagnostic accuracy. ### Emulating Clinical Workflows via State-Aware Reasoning A key technical contribution is the state-aware phase transition framework, which helps the AI mimic the structured yet flexible approach used by experienced clinicians. * The framework orchestrates the conversation through three distinct phases: History Taking, Diagnosis & Management, and Follow-up. * The agent maintains a dynamic internal state that tracks known information about the patient and identifies specific "knowledge gaps." * When the system detects uncertainty, it strategically requests multimodal artifacts—such as a photo of a rash or an image of a lab result—to update its differential diagnosis. * Transitions between conversation phases are only triggered once the system assesses that the objectives of the current phase have been sufficiently met. ### Evaluation through Simulated OSCEs To validate the agent’s performance, the researchers developed a robust simulation environment to facilitate rapid iteration and standardized testing. * The system was tested using patient scenarios grounded in real-world datasets, including the SCIN dataset for dermatology and PTB-XL for ECG measurements. * Evaluation was conducted using a modified version of Objective Structured Clinical Examinations (OSCEs), the global standard for assessing medical students and professionals. * In comparative studies, AMIE's performance was measured against primary care physicians (PCPs) to ensure its behavior, accuracy, and tone aligned with clinical standards. This research demonstrates that multimodal AI agents can effectively navigate the complexities of a medical consultation by combining linguistic empathy with the technical ability to interpret visual clinical evidence. As these systems continue to evolve, they offer a promising path toward high-quality, accessible diagnostic assistance that mirrors the multimodal nature of human medicine.

google

Geospatial Reasoning: Unlocking insights with generative AI and multiple foundation models (opens in new tab)

Google Research is introducing Geospatial Reasoning, a new framework that integrates generative AI with specialized foundation models to streamline complex geographical problem-solving. By combining large language models like Gemini with domain-specific data, the initiative seeks to make large-scale spatial analysis accessible to sectors like public health, urban development, and climate resilience. This research effort moves beyond traditional data silos, enabling agentic workflows that can interpret diverse data types—from satellite imagery to population dynamics—through natural language. ### Specialized Foundation Models for Human Activity * The Population Dynamics Foundation Model (PDFM) captures the complex interplay between human behaviors and their local environments. * A dedicated trajectory-based mobility foundation model has been developed to process and analyze movement patterns. * While initially tested in the US, experimental datasets are expanding to include the UK, Australia, Japan, Canada, and Malawi for selected partners. ### Remote Sensing and Vision Architectures * New models utilize advanced architectures including masked autoencoders, SigLIP, MaMMUT, and OWL-ViT, specifically adapted for the remote sensing domain. * Training involves high-resolution satellite and aerial imagery paired with text descriptions and bounding box annotations to enable precise object detection. * The models support zero-shot classification and retrieval, allowing users to locate specific features—such as "residential buildings with solar panels"—using flexible natural language queries. * Internal evaluations show state-of-the-art performance across multiple benchmarks, including image segmentation and post-disaster damage assessment. ### Agentic Workflows and Industry Collaboration * The Geospatial Reasoning framework utilizes LLMs like Gemini to manage complex datasets and orchestrate "agentic" workflows. * These workflows are grounded in geospatial data to ensure that the insights generated are both useful and contextually accurate. * Google is collaborating with inaugural industry partners, including Airbus, Maxar, Planet Labs, and WPP, to test these capabilities in real-world scenarios. Organizations interested in accelerating their geospatial analysis should consider applying for the trusted tester program to explore how these foundation models can be fine-tuned for specific proprietary data and use cases.