mediapipe

2 posts

google

Introducing interactive on-device segmentation in Snapseed (opens in new tab)

Google has introduced a new "Object Brush" feature in Snapseed that enables intuitive, real-time selective photo editing through a novel on-device segmentation technology. By leveraging a high-performance interactive AI model, users can isolate complex subjects with simple touch gestures in under 20 milliseconds, bridging the gap between professional-grade editing and mobile convenience. This breakthrough is achieved through a sophisticated teacher-student training architecture that prioritizes both pixel-perfect accuracy and low-latency performance on consumer hardware. ### High-Performance On-Device Inference * The system is powered by the Interactive Segmenter model, which is integrated directly into the Snapseed "Adjust" tool to facilitate immediate object-based modifications. * To ensure a fluid user experience, the model utilizes the MediaPipe framework and LiteRT’s GPU acceleration to process selections in less than 20ms. * The interface supports dynamic refinement, allowing users to provide real-time feedback by tracing lines or tapping to add or subtract specific areas of an image. ### Teacher-Student Model Distillation * The development team first created "Interactive Segmenter: Teacher," a large-scale model fine-tuned on 30,000 high-quality, pixel-perfect manual annotations across more than 350 object categories. * Because the Teacher model’s size and computational requirements are prohibitive for mobile use, researchers developed "Interactive Segmenter: Edge" through knowledge distillation. * This distillation process utilized a dataset of over 2 million weakly annotated images, allowing the smaller Edge model to inherit the generalization capabilities of the Teacher model while maintaining a footprint suitable for mobile devices. ### Training via Synthetic User Prompts * To make the model universally capable across all object types, the training process uses a class-agnostic approach based on the Big Transfer (BiT) strategy. * The model learns to interpret user intent through "prompt generation," which simulates real-world interactions such as random scribbles, taps, and lasso (box) selections. * During training, both the Teacher and Edge models receive identical prompts—such as red foreground scribbles and blue background scribbles—to ensure the student model learns to produce high-quality masks even from imprecise user input. This advancement significantly lowers the barrier to entry for complex photo manipulation by moving heavy-duty AI processing directly onto the mobile device. Users can expect a more responsive and precise editing experience that handles everything from fine-tuning a subject's lighting to isolating specific environmental elements like clouds or clothing.

google

From massive models to mobile magic: The tech behind YouTube real-time generative AI effects (opens in new tab)

YouTube has successfully deployed over 20 real-time generative AI effects by distilling the capabilities of massive cloud-based models into compact, mobile-ready architectures. By utilizing a "teacher-student" training paradigm, the system overcomes the computational bottlenecks of high-fidelity generative AI while ensuring the output remains responsive on mobile hardware. This approach allows for complex transformations, such as cartoon style transfer and makeup application, to run frame-by-frame on-device without sacrificing the user’s identity. ### Data Curation and Diversity * The foundation of the effects pipeline relies on high-quality, properly licensed face datasets. * Datasets are meticulously filtered to ensure a uniform distribution across different ages, genders, and skin tones. * The Monk Skin Tone Scale is used as a benchmark to ensure the effects work equitably for all users. ### The Teacher-Student Framework * **The Teacher:** A large, powerful pre-trained model (initially StyleGAN2 with StyleCLIP, later transitioning to Google DeepMind’s Imagen) acts as the "expert" that generates high-fidelity visual effects. * **The Student:** A lightweight UNet-based architecture designed for mobile efficiency. It utilizes a MobileNet backbone for both the encoder and decoder to ensure fast frame-by-frame processing. * The distillation process narrows the scope of the massive teacher model into a student model focused on a single, specific task. ### Iterative Distillation and Training * **Data Generation:** The teacher model processes thousands of images to create "before and after" pairs. These are augmented with synthetic elements like AR glasses, sunglasses, and hand occlusions to improve real-world robustness. * **Optimization:** The student model is trained using a sophisticated combination of loss functions, including L1, LPIPS, Adaptive, and Adversarial loss, to balance numerical accuracy with aesthetic quality. * **Architecture Search:** Neural architecture search is employed to tune "depth" and "width" multipliers, identifying the most efficient model structure for different mobile hardware constraints. ### Addressing the Inversion Problem * A major challenge in real-time effects is the "inversion problem," where the model struggles to represent a real face in latent space, leading to a loss of the user's identity (e.g., changes in skin tone or clothing). * YouTube uses Pivotal Tuning Inversion (PTI) to ensure that the user's specific features are preserved during the generative process. * By editing images in the latent space—a compressed numerical representation—the system can apply stylistic changes while maintaining the core characteristics of the original video stream. By combining advanced model distillation with on-device optimization via MediaPipe, YouTube demonstrates a practical path for bringing heavy generative AI research into consumer-facing mobile applications.