line

How should we evaluate AI-generated (opens in new tab)

To optimize the Background Person Removal (BPR) feature in image editing services, the LY Corporation AMD team evaluated various generative AI inpainting models to determine which automated metrics best align with human judgment. While traditional research benchmarks often fail to reflect performance in high-resolution, real-world scenarios, this study identifies a framework for selecting models that produce the most natural results. The research highlights that as the complexity and size of the masked area increase, the gap between model performance becomes more pronounced, requiring more sophisticated evaluation strategies.

Background Person Removal Workflow

  • Instance Segmentation: The process begins by identifying individual pixels to classify objects such as people, buildings, or trees within the input image.
  • Salient Object Detection: This step distinguishes the main subjects of the photo from background elements to ensure only unwanted figures are targeted for removal.
  • Inpainting Execution: Once the background figures are removed, inpainting technology is used to reconstruct the empty space so it blends seamlessly with the surrounding environment.

Comparison of Inpainting Technologies

  • Diffusion-based Models: These models, such as FLUX.1-Fill-dev, restore damaged areas by gradually removing noise. While they excel at restoring complex details, they are generally slower than GANs and can occasionally generate artifacts.
  • GAN-based Models: Using a generator-discriminator architecture, models like LaMa and HINT offer faster generation speeds and competitive performance for lower-resolution or smaller inpainting tasks.
  • Performance Discrepancy: Experiments showed that while most models perform well on small areas, high-resolution images with large missing sections reveal significant quality differences that are not always captured in standard academic benchmarks.

Evaluation Methodology and Metrics

  • BPR Evaluation Dataset: The team curated a specific dataset of 10 images with high quality-variance to test 11 different inpainting models released between 2022 and 2024.
  • Single Image Quality Metrics: Evaluated models using LAION Aesthetics score-v2, CLIP-IQA, and Q-Align to measure the aesthetic quality of individual generated frames.
  • Preference and Reward Models: Utilized PickScore, ImageReward, and HPS v2 to determine which generated images would be most preferred by human users.
  • Objective: The goal of these tests was to find an automated evaluation method that minimizes the need for expensive and time-consuming human reviews while maintaining high reliability.

Selecting an inpainting model based solely on paper-presented metrics is insufficient for production-level services. For features like BPR, it is critical to implement an evaluation pipeline that combines both aesthetic scoring and human preference models to ensure consistent quality across diverse, high-resolution user photos.