model-evaluation

3 posts

naver

Recreating the User's (opens in new tab)

The development of NSona, an LLM-based multi-agent persona platform, addresses the persistent gap between user research and service implementation by transforming static data into real-time collaborative resources. By recreating user voices through a multi-party dialogue system, the project demonstrates how AI can serve as an active participant in the daily design and development process. Ultimately, the initiative highlights a fundamental shift in cross-functional collaboration, where traditional role boundaries dissolve in favor of a shared starting point centered on AI-driven user empathy. ## Bridging UX Research and Daily Collaboration * The project was born from the realization that traditional UX research often remains isolated from the actual development cycle, leading to a loss of insight during implementation. * NSona transforms static user research data into dynamic "persona bots" that can interact with project members in real-time. * The platform aims to turn the user voice into a "live" resource, allowing designers and developers to consult the persona during the decision-making process. ## Agent-Centric Engineering and Multi-Party UX * The system architecture is built on an agent-centric structure designed to handle the complexities of specific user behaviors and motivations. * It utilizes a Multi-Party dialogue framework, enabling a collaborative environment where multiple AI agents and human stakeholders can converse simultaneously. * Technical implementation focused on bridging the gap between qualitative UX requirements and LLM orchestration, ensuring the persona's responses remained grounded in actual research data. ## Service-Specific Evaluation and Quality Metrics * The team moved beyond generic LLM benchmarks to establish a "Service-specific" evaluation process tailored to the project's unique UX goals. * Model quality was measured by how vividly and accurately it recreated the intended persona, focusing on the degree of "immersion" it triggered in human users. * Insights from these evaluations helped refine the prompt design and agent logic to ensure the AI's output provided genuine value to the product development lifecycle. ## Redefining Cross-Functional Collaboration * The AI development process reshaped traditional Roles and Responsibilities (RNR); designers became prompt engineers, while researchers translated qualitative logic into agentic structures. * Front-end developers evolved their roles to act as critical reviewers of the AI, treating the model as a subject of critique rather than a static asset. * The workflow shifted from a linear "relay" model to a concentric one, where all team members influence the product's core from the same starting point. To successfully integrate AI into the product lifecycle, organizations should move beyond using LLMs as simple tools and instead view them as a medium for interdisciplinary collaboration. By building multi-agent systems that reflect real user data, teams can ensure that the "user's voice" is not just a research summary, but a tangible participant in the development process.

line

How should we evaluate AI-generated (opens in new tab)

To optimize the Background Person Removal (BPR) feature in image editing services, the LY Corporation AMD team evaluated various generative AI inpainting models to determine which automated metrics best align with human judgment. While traditional research benchmarks often fail to reflect performance in high-resolution, real-world scenarios, this study identifies a framework for selecting models that produce the most natural results. The research highlights that as the complexity and size of the masked area increase, the gap between model performance becomes more pronounced, requiring more sophisticated evaluation strategies. ### Background Person Removal Workflow * **Instance Segmentation:** The process begins by identifying individual pixels to classify objects such as people, buildings, or trees within the input image. * **Salient Object Detection:** This step distinguishes the main subjects of the photo from background elements to ensure only unwanted figures are targeted for removal. * **Inpainting Execution:** Once the background figures are removed, inpainting technology is used to reconstruct the empty space so it blends seamlessly with the surrounding environment. ### Comparison of Inpainting Technologies * **Diffusion-based Models:** These models, such as FLUX.1-Fill-dev, restore damaged areas by gradually removing noise. While they excel at restoring complex details, they are generally slower than GANs and can occasionally generate artifacts. * **GAN-based Models:** Using a generator-discriminator architecture, models like LaMa and HINT offer faster generation speeds and competitive performance for lower-resolution or smaller inpainting tasks. * **Performance Discrepancy:** Experiments showed that while most models perform well on small areas, high-resolution images with large missing sections reveal significant quality differences that are not always captured in standard academic benchmarks. ### Evaluation Methodology and Metrics * **BPR Evaluation Dataset:** The team curated a specific dataset of 10 images with high quality-variance to test 11 different inpainting models released between 2022 and 2024. * **Single Image Quality Metrics:** Evaluated models using LAION Aesthetics score-v2, CLIP-IQA, and Q-Align to measure the aesthetic quality of individual generated frames. * **Preference and Reward Models:** Utilized PickScore, ImageReward, and HPS v2 to determine which generated images would be most preferred by human users. * **Objective:** The goal of these tests was to find an automated evaluation method that minimizes the need for expensive and time-consuming human reviews while maintaining high reliability. Selecting an inpainting model based solely on paper-presented metrics is insufficient for production-level services. For features like BPR, it is critical to implement an evaluation pipeline that combines both aesthetic scoring and human preference models to ensure consistent quality across diverse, high-resolution user photos.

google

Benchmarking LLMs for global health (opens in new tab)

Google Research has introduced a benchmarking pipeline and a dataset of over 11,000 synthetic personas to evaluate how Large Language Models (LLMs) handle tropical and infectious diseases (TRINDs). While LLMs excel at standard medical exams like the USMLE, this study reveals significant performance gaps when models encounter the regional context shifts and localized health data common in low-resource settings. The research concludes that integrating specific environmental context and advanced reasoning techniques is essential for making LLMs reliable decision-support tools for global health. ## Development of the TRINDs Synthetic Dataset * Researchers created a dataset of 11,000+ personas covering 50 tropical and infectious diseases to address the lack of rigorous evaluation data for out-of-distribution medical tasks. * The process began with "seed" templates based on factual data from the WHO, CDC, and PAHO, which were then reviewed by clinicians for clinical relevance. * The dataset was expanded using LLM prompting to include diverse demographic, clinical, and consumer-focused augmentations. * To test linguistic distribution shifts, the seed set was manually translated into French to evaluate how language changes impact diagnostic accuracy. ## Identifying Critical Performance Drivers * Evaluations of Gemini 1.5 models showed that accuracy on TRINDs is lower than reported performance on standard U.S. medical benchmarks, indicating a struggle with "out-of-distribution" disease types. * Contextual information is the primary driver of accuracy; the highest performance was achieved only when specific symptoms were combined with location and risk factors. * The study found that symptoms alone are often insufficient for an accurate diagnosis, emphasizing that LLMs require localized environmental data to differentiate between similar tropical conditions. * Linguistic shifts pose a significant challenge, as model performance dropped by approximately 10% when processing the French version of the dataset compared to the English version. ## Optimization and Reasoning Strategies * Implementing Chain-of-Thought (CoT) prompting—where the model is directed to explain its reasoning step-by-step—led to a significant 10% increase in diagnostic accuracy. * Researchers utilized an LLM-based "autorater" to scale the evaluation process, scoring answers as correct if the predicted diagnosis was meaningfully similar to the ground truth. * In tests regarding social biases, the study found no statistically significant difference in performance across race or gender identifiers within this specific TRINDs context. * Performance remained stable even when clinical language was swapped for consumer-style descriptions, suggesting the models are robust to variations in how patients describe their symptoms. To improve the utility of LLMs for global health, developers should prioritize the inclusion of regional risk factors and location-specific data in prompts. Utilizing reasoning-heavy strategies like Chain-of-Thought and expanding multilingual training sets are critical steps for bridging the performance gap in underserved regions.