google May 11, 2025

Bringing 3D shoppable products online with generative AI (opens in new tab)

gen-ai diffusion-models veo neural-radiance-fields novel-view-synthesis score-distillation-sampling 3d-reconstruction

Google has developed a series of generative AI techniques to transform standard 2D product images into immersive, interactive 3D visualizations for online shopping. By evolving from early neural reconstruction methods to state-of-the-art video generation models like Veo, Google can now produce high-quality 360-degree spins from as few as three images. This progression significantly reduces the cost and complexity for businesses to create shoppable 3D experiences at scale across diverse product categories.

First Generation: Neural Radiance Fields (NeRFs)

Launched in 2022, this initial approach utilized NeRF technology to synthesize novel views and 360° spins, specifically for footwear on Google Search.
The system required five or more images and relied on complex sub-processes, including background removal, XYZ prediction (NOCS), and camera position estimation.
While a breakthrough, the technology struggled with "noisy" signals and complex geometries, such as the thin structures found in sandals or high heels.

Second Generation: View-Conditioned Diffusion

Introduced in 2023, this version addressed previous limitations by using a diffusion-based architecture to predict unseen viewpoints from limited data.
The model utilized Score Distillation Sampling (SDS), which compares rendered 3D models against generated targets to iteratively refine parameters for better realism.
This approach allowed Google to scale 3D visualizations to the majority of shoes viewed on Google Shopping, handling more diverse and difficult footwear styles.

Third Generation: Generalizing with Veo

The current advancement leverages Google’s Veo video generation model to transform product images into consistent, high-fidelity 360° videos.
By training on millions of synthetic 3D assets, Veo captures complex interactions between light, texture, and geometry, making it effective for shiny surfaces and diverse categories like electronics and furniture.
This method removes the need for precise camera pose estimation, increasing reliability across different environments.
While the model can generate a 3D representation from a single image by "hallucinating" missing details, using three images significantly reduces errors and ensures high-fidelity accuracy.

These technological milestones mark a shift from specialized 3D reconstruction toward generalized AI models that make digital products feel tangible and interactive for consumers.