ai-agent

12 posts

aws

AWS Weekly Roundup: Kiro CLI latest features, AWS European Sovereign Cloud, EC2 X8i instances, and more (January 19, 2026) | AWS News Blog (opens in new tab)

The January 19, 2026, AWS Weekly Roundup highlights significant advancements in sovereign cloud infrastructure and the general availability of high-performance, memory-optimized compute instances. The update also emphasizes the maturing ecosystem of AI agents, focusing on enhanced developer tooling and streamlined deployment workflows for agentic applications. These releases collectively aim to satisfy stringent regulatory requirements in Europe while pushing the boundaries of enterprise performance and automated productivity. ## Developer Tooling and Kiro CLI Enhancements * New granular controls for web fetch URLs allow developers to use allowlists and blocklists to strictly govern which external resources an agent can access. * The update introduces custom keyboard shortcuts to facilitate seamless switching between multiple specialized agents within a single session. * Enhanced diff views provide clearer visibility into changes, improving the debugging and auditing process for automated workflows. ## AWS European Sovereign Cloud General Availability * Following its initial 2023 announcement, this independent cloud infrastructure is now generally available to all customers. * The environment is purpose-built to meet the most rigorous sovereignty and data residency requirements for European organizations. * It offers a comprehensive set of AWS services within a framework that ensures operational independence and localized data handling. ## High-Performance Computing with EC2 X8i Instances * The memory-optimized X8i instances, powered by custom Intel Xeon 6 processors, have moved from preview to general availability. * These instances feature a sustained all-core turbo frequency of 3.9 GHz, which is currently exclusive to the AWS platform. * The hardware is SAP certified and engineered to provide the highest memory bandwidth and performance for memory-intensive enterprise workloads compared to other Intel-based cloud offerings. ## Agentic AI and Productivity Updates * Amazon Quick Suite continues to expand as a workplace "agentic teammate," designed to synthesize research and execute actions based on organizational insights. * New technical guidance has been released regarding the deployment of AI agents on Amazon Bedrock AgentCore. * The integration of GitHub Actions is now supported to automate the deployment and lifecycle management of these AI agents, bridging the gap between traditional DevOps and agentic AI development. These updates signal a strategic shift toward highly specialized infrastructure, both in terms of regulatory compliance with the Sovereign Cloud and raw performance with the X8i instances. Organizations looking to scale their AI operations should prioritize the new deployment patterns for Bedrock AgentCore to ensure a robust CI/CD pipeline for their autonomous agents.

toss

Tax Refund Automation: AI (opens in new tab)

At Toss Income, QA Manager Suho Jung successfully automated complex E2E testing for diverse tax refund services by leveraging AI as specialized virtual team members. By shifting from manual coding to a "human-as-orchestrator" model, a single person achieved the productivity of a four-to-five-person automation team within just five months. This approach overcame the inherent brittleness of testing long, React-based flows that are subject to frequent policy changes and external system dependencies. ### Challenges in Tax Service Automation The complexity of tax refund services presented unique hurdles that made traditional manual automation unsustainable: * **Multi-Step Dependencies:** Each refund flow averages 15–20 steps involving internal systems, authentication providers, and HomeTax scraping servers, where a single timing glitch can fail the entire test. * **Frequent UI and Policy Shifts:** Minor UI updates or new tax laws required total scenario reconfigurations, making hard-coded tests obsolete almost immediately. * **Environmental Instability:** Issues such as "Target closed" errors during scraping, differing domain environments, and React-specific hydration delays caused constant test flakiness. ### Building an AI-Driven QA Team Rather than using AI as a simple autocomplete tool, the project assigned specific "personas" to different AI models to handle distinct parts of the lifecycle: * **SDET Agent (Claude Sonnet 4.5):** Acted as the lead developer, responsible for designing the Page Object Model (POM) architecture, writing test logic, and creating utility functions. * **Documentation Specialist:** Automatically generated daily retrospectives and updated technical guides by analyzing daily git commits. * **Git Master:** Managed commit history and PR descriptions to ensure high-quality documentation of the project’s evolution. * **Pair Programmers (Cursor & Codex):** Handled real-time troubleshooting, type errors, and comparative analysis of different test scripts. ### Technical Solutions for React and Policy Logic The team implemented several sophisticated technical strategies to ensure test stability: * **React Interaction Readiness:** To solve "Element is not clickable" errors, they developed a strategy that waits not just for visibility, but for event handlers to bind to the DOM (Hydration). * **Safe Interaction Fallbacks:** A standard `click` utility was created that attempts a Playwright click, then a native keyboard 'Enter' press, and finally a JS dispatch to ensure interactions succeed even during UI transitions. * **Dynamic Consent Flow Utility:** A specialized system was built to automatically detect and handle varying "Terms of Service" agreements across different sub-services (Tax Secretary, Hidden Refund, etc.) through a single unified function. * **Test Isolation:** Automated scripts were used to prevent `userNo` (test ID) collisions, ensuring 35+ complex scenarios could run in parallel without data interference. ### Integrated Feedback and Reporting The automation was integrated directly into internal communication channels to create a tight feedback loop: * **Messenger Notifications:** Every test run sends a report including execution time, test IDs, and environment data to the team's messenger. * **Automated Failure Analysis:** When a test fails, the AI automatically posts the error log, the specific failed step, a tracking EventID, and a screenshot as a thread reply for immediate debugging. * **Human-AI Collaboration:** This structure shifted the QA's role from writing code to discussing failures and policy changes within the messenger threads. The success of this 5-month experiment suggests that for high-complexity environments, the future of QA lies in "AI Orchestration." Instead of focusing on writing selectors, QA engineers should focus on defining problems and managing the AI agents that build the architecture.

aws

AWS Weekly Roundup: AWS re:Invent keynote recap, on-demand videos, and more (December 8, 2025) | AWS News Blog (opens in new tab)

The December 8, 2025, AWS Weekly Roundup recaps the major themes from AWS re:Invent, signaling a significant industry transition from AI assistants to autonomous AI agents. While technical innovation in infrastructure remains a priority, the event underscored that developers remain at the heart of the AWS mission, empowered by new tools to automate complex tasks using natural language. This shift represents a "renaissance" in cloud computing, where purpose-built infrastructure is now designed to support the non-deterministic nature of agentic workloads. ## Community Recognition and the Now Go Build Award * Raphael Francis Quisumbing (Rafi) from the Philippines was honored with the Now Go Build Award, presented by Werner Vogels. * A veteran of the ecosystem, Quisumbing has served as an AWS Hero since 2015 and has co-led the AWS User Group Philippines for over a decade. * The recognition emphasizes AWS's continued focus on community dedication and the role of individual builders in empowering regional developer ecosystems. ## The Evolution from AI Assistants to Agents * AWS CEO Matt Garman identified AI agents as the next major inflection point for the industry, moving beyond simple chat interfaces to systems that perform tasks and automate workflows. * Dr. Swami Sivasubramanian highlighted a paradigm shift where natural language serves as the primary interface for describing complex goals. * These agents are designed to autonomously generate plans, write necessary code, and call various tools to execute complete solutions without constant human intervention. * AWS is prioritizing the development of production-ready infrastructure that is secure and scalable specifically to handle the "non-deterministic" behavior of these AI agents. ## Core Infrastructure and the Developer Renaissance * Despite the focus on AI, AWS reaffirmed that its core mission remains the "freedom to invent," keeping developers central to its 20-year strategy. * Leaders Peter DeSantis and Dave Brown reinforced that foundational attributes—security, availability, and performance—remain the non-negotiable pillars of the AWS cloud. * The integration of AI agents is framed as a way to finally realize material business returns on AI investments by moving from experimental use cases to automated business logic. To maximize the value of these updates, organizations should begin evaluating how to transition from simple LLM implementations to agentic frameworks that can execute end-to-end business processes. Reviewing the on-demand keynote sessions from re:Invent 2025 is recommended for technical teams looking to implement the latest secure, agent-ready infrastructure.

woowahan

Test Automation with AI: (opens in new tab)

This blog post explores how a development team at Woowahan Tech successfully automated the creation of 100 unit tests in just 30 minutes by combining a custom IntelliJ plugin with Amazon Q. The author argues that while full AI automation often fails in complex multi-module environments, a hybrid approach using "compile-guaranteed templates" ensures high success rates and maintains operational stability. This strategy allows developers to bypass repetitive setup tasks while leveraging AI for logic implementation within a strictly defined, valid structure. ### Evaluating AI Assistants for Testing * The team compared various AI tools including GitHub Copilot, Cursor, and Amazon Q to determine which best fit their existing IntelliJ-based workflow. * Amazon Q was selected for its superior understanding of the entire project context and its ability to integrate seamlessly as a plugin without requiring a switch to a new IDE. * Initial manual use of AI assistants highlighted repetitive patterns: developers had to constantly specify team conventions (Kotest FunSpec, MockK) and manually fix build errors in 15% of the generated code. * On average, it took 10 minutes per class to generate and refine tests manually, prompting the team to seek a more automated solution via a custom plugin. ### The Pitfalls of Full Automation * The first version of the custom plugin attempted to generate complete test files by gathering class metadata through PSI (Program Structure Interface) and sending it to the Gemini API. * Pilot tests revealed a 90% compilation failure rate, as the AI frequently generated incorrect imports, hallucinated non-existent fields, or used mismatched data types. * A critical issue was the "loss of existing tests," where the AI-generated output would completely overwrite previous work rather than appending to it. * In complex multi-module projects, the AI struggled to identify the correct classes when multiple modules contained identical class names, leading to significant manual correction time. ### Shifting to Compile-Guaranteed Templates * To overcome the limitations of full automation, the team pivoted to a "template first" approach where the plugin generates a valid, compilable shell for the test. * The plugin handles the complex infrastructure of the test file, including correct imports, MockK setups, and empty test stubs for every method in the target class. * This approach reduces the AI's "hallucination surface" by providing it with a predefined structure, allowing tools like Amazon Q to focus solely on filling in the implementation details. * By automating the 1-minute setup and letting the AI handle the 2-minute implementation phase, the team achieved a 97% success rate across 100 test cases. ### Practical Conclusion For teams looking to improve test coverage in large-scale repositories, the most effective strategy is to use IDE plugins to automate context gathering and boilerplate generation. By providing the AI with a structurally sound template, developers can eliminate compilation errors and significantly reduce the time spent on manual refinement, ensuring that even complex edge cases are covered with minimal effort.

aws

Amazon Bedrock AgentCore adds quality evaluations and policy controls for deploying trusted AI agents | AWS News Blog (opens in new tab)

AWS has introduced several new capabilities to Amazon Bedrock AgentCore designed to remove the trust and quality barriers that often prevent AI agents from moving into production environments. These updates, which include granular policy controls and sophisticated evaluation tools, allow developers to implement strict operational boundaries and monitor real-world performance at scale. By balancing agent autonomy with centralized verification, AgentCore provides a secure framework for deploying highly capable agents across enterprise workflows. **Governance through Policy in AgentCore** * This feature establishes clear boundaries for agent actions by intercepting tool calls via the AgentCore Gateway before they are executed. * By operating outside of the agent’s internal reasoning loop, the policy layer acts as an independent verification system that treats the agent as an autonomous actor requiring permission. * Developers can define fine-grained permissions to ensure agents do not access sensitive data inappropriately or take unauthorized actions within external systems. **Quality Monitoring with AgentCore Evaluations** * The new evaluation framework allows teams to monitor the quality of AI agents based on actual behavior rather than theoretical simulations. * Built-in evaluators provide standardized metrics for critical dimensions such as helpfulness and correctness. * Organizations can also implement custom evaluators to ensure agents meet specific business-logic requirements and industry-specific compliance standards. **Enhanced Memory and Communication Features** * New episodic functionality in AgentCore Memory introduces a long-term strategy that allows agents to learn from past experiences and apply successful solutions to similar future tasks. * Bidirectional streaming in the AgentCore Runtime supports the deployment of advanced voice agents capable of handling natural, simultaneous conversation flows. * These enhancements focus on improving consistency and user experience, enabling agents to handle complex, multi-turn interactions with higher reliability. **Real-World Application and Performance** * The AgentCore SDK has seen rapid adoption with over 2 million downloads, supporting diverse use cases from content generation at the PGA TOUR to financial data analysis at Workday. * Case studies highlight significant operational gains, such as a 1,000 percent increase in content writing speed and a 50 percent reduction in problem resolution time through improved observability. * The platform emphasizes 100 percent traceability of agent decisions, which is critical for organizations transitioning from reactive to proactive AI-driven operations. To successfully scale AI agents, organizations should transition from simple prompt engineering to a robust agentic architecture. Leveraging these new policy and evaluation tools will allow development teams to maintain the necessary control and visibility required for customer-facing and mission-critical deployments.

naver

[DAN25] (opens in new tab)

Naver recently released the full video archives from its DAN25 conference, highlighting the company’s strategic roadmap for AI agents, Sovereign AI, and digital transformation. The sessions showcase how Naver is moving beyond general AI applications to implement specialized, real-time systems that integrate large language models (LLMs) directly into core services like search, commerce, and content. By open-sourcing these technical insights, Naver demonstrates its progress in building a cohesive AI ecosystem capable of handling massive scale and complex user intent. ### Naver PersonA and LLM-Based User Memory * The "PersonA" project focuses on building a "user memory" by treating fragmented logs across various Naver services as indirect conversations with the user. * By leveraging LLM reasoning, the system transitions from simple data tracking to a sophisticated AI agent that offers context-aware, real-time suggestions. * Technical hurdles addressed include the stable implementation of real-time log reflection for a massive user base and the selection of optimal LLM architectures for personalized inference. ### Trend Analysis and Search-Optimized Models * The Place Trend Analysis system utilizes ranking algorithms to distinguish between temporary surges and sustained popularity, providing a balanced view of "hot places." * LLMs and text mining are employed to move beyond raw data, extracting specific keywords that explain the underlying reasons for a location's trending status. * To improve search quality, Naver developed search-specific LLMs that outperform general models by using specialized data "recipes" and integrating traditional information retrieval with features like "AI briefing" and "AuthGR" for higher reliability. ### Unified Recommendation and Real-Time CRM * Naver Webtoon and Series replaced fragmented recommendation and CRM (Customer Relationship Management) models with a single, unified framework to ensure data consistency. * The architecture shifted from batch-based processing to a real-time, API-based serving system to reduce management complexity and improve the immediacy of personalized user experiences. * This transition focuses on maintaining a seamless UX by synchronizing different ML models under a unified serving logic. ### Scalable Log Pipelines and Infrastructure Stability * The "Logiss" pipeline manages up to tens of billions of logs daily, utilizing a Storm and Kafka environment to ensure high availability and performance. * Engineers implemented a multi-topology approach to allow for seamless, non-disruptive deployments even under heavy loads. * Intelligent features such as "peak-shaving" (distributing peak traffic to off-peak hours), priority-based processing during failures, and efficient data sampling help balance cost, performance, and stability. These sessions provide a practical blueprint for organizations aiming to scale LLM-driven services while maintaining infrastructure integrity. For developers and system architects, Naver’s transition toward unified ML frameworks and specialized, real-time data pipelines offers a proven model for moving AI from experimental phases into high-traffic production environments.

kakao

[AI_TOP_1 (opens in new tab)

The AI TOP 100 contest was designed to shift the focus from evaluating AI model performance to measuring human proficiency in solving real-world problems through AI collaboration. By prioritizing the "problem-solving process" over mere final output, the organizers sought to identify individuals who can define clear goals and navigate the technical limitations of current AI tools. The conclusion of this initiative suggests that true AI literacy is defined by the ability to maintain a "human-in-the-loop" workflow where human intuition guides AI execution and verification. ### Core Philosophy of Human-AI Collaboration * **Human-in-the-Loop:** The contest emphasizes a cycle of human analysis, AI problem-solving, and human verification. This ensures that the human remains the "pilot" who directs the AI engine and takes responsibility for the quality of the result. * **Strategic Intervention:** Participants were encouraged to provide AI with structural context it might struggle to perceive (like complex table relationships) and to perform data pre-processing to improve AI accuracy. * **Task Delegation:** For complex iterative tasks, such as generating images for a montage, solvers were expected to build automated pipelines using AI agents to handle repetitive feedback loops while focusing human effort on higher-level strategy. ### Designing Against "One-Shot" Solutions * **Low Barrier, High Ceiling:** Problems were designed to be intuitive enough for anyone to understand but complex enough to prevent "one-shot" solutions (the "click-and-solve" trap). * **Targeting Technical Weaknesses:** Organizers intentionally embedded technical hurdles that current LLMs struggle with, forcing participants to demonstrate how they bridge the gap between AI limitations and a correct answer. * **The Difficulty Ladder:** To account for varying domain expertise (e.g., OCR experience), problems utilized a multi-part structure. This included "Easy" starting questions to build momentum and "Medium" hint questions that guided participants toward solving the more difficult "Killer" components. ### The 4-Pattern Problem Framework * **P1 - Insight (Analysis & Definition):** Identifying meaningful opportunities or problems within complex, unstructured data. * **P2 - Action (Implementation & Automation):** Developing functional code or workflows to execute a defined solution. * **P3 - Persuasion (Strategy & Creativity):** Generating logical and creative content to communicate technical solutions to non-technical stakeholders. * **P4 - Decision (Optimization):** Making optimal choices and simulations to maximize goals under specific constraints. ### Quality Assurance and Score Calibration * **4-Stage Pipeline:** Problems moved from Ideation to Drafting (testing for one-shot immunity), then to Candidate (analyzing abuse vulnerabilities), and finally to a Final selection based on difficulty balance. * **Cross-Model Validation:** Internal and alpha testers solved problems using various models including Claude, GPT, and Gemini to ensure that no single tool could bypass the intended human-led process. * **Effort-Based Scoring:** Instead of uniform points, scores were calibrated based on the "effort cost" and human competency required to solve them. This resulted in varying total points per problem to better reflect the true difficulty of the task. In the era of rapidly evolving AI, the ability to "use" a tool is becoming less valuable than the ability to "collaborate" with it. This shift requires a move toward building automated pipelines and utilizing a "difficulty ladder" approach to tackle complex, multi-stage problems that AI cannot yet solve in a single iteration.

google

Towards better health conversations: Research insights on a “wayfinding” AI agent based on Gemini (opens in new tab)

Google Research has developed "Wayfinding AI," a research prototype based on Gemini designed to transform health information seeking from a passive query-response model into a proactive, context-seeking dialogue. By prioritizing clarifying questions and iterative guidance, the agent addresses the common struggle users face when attempting to articulate complex or ambiguous medical concerns. User studies indicate that this proactive approach results in health information that participants find significantly more helpful, relevant, and tailored to their specific needs than traditional AI responses. ### Challenges in Digital Health Navigation * Formative research involving 33 participants highlighted that users often struggle to articulate health concerns because they lack the clinical background to know which details are medically relevant. * The study found that users typically "throw words" at a search engine and sift through generic, impersonal results that do not account for their unique context. * Initial UX testing revealed a strong user preference for a "deferred-answer" approach, where the AI mimics a medical professional by asking clarifying questions before jumping to a conclusion. ### Core Design Principles of Wayfinding AI * **Proactive Conversational Guidance:** At every turn, the agent asks up to three targeted questions to reduce ambiguity and help users systematically share their "health story." * **Best-Effort Answers:** To ensure immediate utility, the AI provides the best possible information based on the data available at that moment, while noting that the answer will improve as the user provides more context. * **Transparent Reasoning:** The system explicitly explains how the user’s most recent answers have helped refine the previous response, making the AI’s internal logic understandable. ### Split-Stream User Interface * To prevent clarifying questions from being buried in long paragraphs, the prototype uses a two-column layout. * The left column is dedicated to the interactive chat and specific follow-up questions to keep the user focused on the dialogue. * The right column displays the "best information so far" and detailed explanations, allowing users to dive into the technical content only when they feel enough context has been established. ### Comparative Evaluation and Performance * A randomized study with 130 participants compared the Wayfinding AI against a baseline Gemini 2.5 Flash model. * Participants interacted with both models for at least three minutes regarding a personal health question and rated them across six dimensions: helpfulness, question relevance, tailoring, goal understanding, ease of use, and efficiency. * The proactive agent outperformed the baseline significantly, with participants reporting that the context-seeking behavior felt more professional and increased their confidence in the AI's suggestions. The research suggests that for sensitive and complex topics like health, AI should move beyond being a passive knowledge base. By adopting a "wayfinding" strategy that guides users through their own information needs, AI agents can provide more personalized and empowering experiences that better mirror expert human consultation.

google

Deep researcher with test-time diffusion (opens in new tab)

Google Cloud researchers have introduced Test-Time Diffusion Deep Researcher (TTD-DR), a framework that treats long-form research report writing as an iterative diffusion process. By mimicking human research patterns, the system treats initial drafts as "noisy" versions that are gradually polished through retrieval-augmented denoising and self-evolutionary algorithms. This approach achieves state-of-the-art results in generating comprehensive academic-style reports and solving complex multi-hop reasoning tasks. ### The Backbone DR Architecture The system operates through a three-stage pipeline designed to transition from a broad query to a detailed final document: * **Research Plan Generation:** Upon receiving a query, the agent produces a structured outline of key areas to guide the subsequent information-gathering process. * **Iterative Search Agents:** Two sub-agents work in tandem; one formulates specific search questions based on the plan, while the other performs Retrieval-Augmented Generation (RAG) to synthesize precise answers from available sources. * **Final Report Synthesis:** The agent combines the initial research plan with the accumulated question-answer pairs to produce a coherent, evidence-based final report. ### Component-wise Self-Evolution To ensure high-quality inputs at every stage, the framework employs a self-evolutionary algorithm that optimizes the performance of individual agents: * **Diverse Variant Generation:** The system explores multiple diverse answer variants to cover a larger search space and identify the most valuable information. * **Environmental Feedback:** An "LLM-as-a-judge" assesses these variants using auto-raters for metrics like helpfulness and comprehensiveness, providing specific textual feedback for improvement. * **Revision and Cross-over:** Variants undergo iterative revisions based on feedback before being merged into a single, high-quality output that consolidates the best information from all evolutionary paths. ### Report-level Refinement via Diffusion The core innovation of TTD-DR is modeling the writing process as a denoising diffusion mechanism: * **Messy-to-Polished Transformation:** The framework treats the initial rough draft as a noisy input that requires cleaning through factual verification. * **Denoising with Retrieval:** The agent identifies missing information or weak arguments in the draft and uses search tools as a "denoising step" to inject new facts and strengthen the content. * **Continuous Improvement Loop:** This process repeats in cycles, where each iteration uses newly retrieved information to refine the draft into a more accurate and high-quality final version. TTD-DR demonstrates that shifting AI development from linear generation to iterative, diffusion-based refinement significantly improves the depth and rigor of long-form content. This methodology serves as a powerful blueprint for building autonomous agents capable of handling complex, multi-step knowledge tasks.

google

MLE-STAR: A state-of-the-art machine learning engineering agent (opens in new tab)

MLE-STAR is a state-of-the-art machine learning engineering agent designed to automate complex ML tasks by treating them as iterative code optimization challenges. Unlike previous agents that rely solely on an LLM’s internal knowledge, MLE-STAR integrates external web searches and targeted ablation studies to pinpoint and refine specific pipeline components. This approach allows the agent to achieve high-performance results, evidenced by its ability to win medals in 63% of Kaggle competitions within the MLE-Bench-Lite benchmark. ## External Knowledge and Targeted Ablation The core of MLE-STAR’s effectiveness lies in its ability to move beyond generic machine learning libraries by incorporating external research and specific performance testing. * The agent uses web search to retrieve task-specific, state-of-the-art models and approaches rather than defaulting to familiar libraries like scikit-learn. * Instead of modifying an entire script at once, the system conducts an ablation study to evaluate the impact of individual pipeline components, such as feature engineering or model selection. * By identifying which code blocks have the most significant impact on performance, the agent can focus its reasoning and optimization efforts where they are most needed. ## Iterative Refinement and Intelligent Ensembling Once the critical components are identified, MLE-STAR employs a specialized refinement process to maximize the effectiveness of the generated solution. * Targeted code blocks undergo iterative refinement based on LLM-suggested plans that incorporate feedback from prior experimental failures and successes. * The agent features a unique ensembling strategy where it proposes multiple candidate solutions and then designs its own method to merge them. * Rather than using simple validation-score voting, the agent iteratively improves the ensemble strategy itself, treating the combination of models as a distinct optimization task. ## Robustness and Safety Verification To ensure the generated code is both functional and reliable for real-world deployment, MLE-STAR incorporates three specialized diagnostic modules. * **Debugging Agent:** Automatically analyzes tracebacks and execution errors in Python scripts to provide iterative corrections. * **Data Leakage Checker:** Reviews the solution script prior to execution to ensure the model does not improperly access test dataset information during the training phase. * **Data Usage Checker:** Analyzes whether the script is utilizing all available data sources, preventing the agent from overlooking complex data formats in favor of simpler files like CSVs. By combining external grounding with a granular, component-based optimization strategy, MLE-STAR represents a significant shift in automated machine learning. For organizations looking to scale their ML workflows, such an agent suggests a future where the role of the engineer shifts from manual coding to high-level supervision of autonomous agents that can navigate the vast landscape of research and data engineering.

line

LY's Tech Conference, 'Tech (opens in new tab)

LY Corporation’s Tech-Verse 2025 conference highlighted the company's strategic pivot toward becoming an AI-centric organization through the "Catalyst One Platform" initiative. By integrating the disparate infrastructures of LINE and Yahoo! JAPAN into a unified private cloud, the company aims to achieve massive cost efficiencies while accelerating the deployment of AI agents across its entire service ecosystem. This transformation focuses on empowering engineers with AI-driven development tools to foster rapid innovation and deliver a seamless, "WOW" experience for global users. ### Infrastructure Integration and the Catalyst One Platform To address the redundancies following the merger of LINE and Yahoo! JAPAN, LY Corporation is consolidating its technical foundations into a single internal ecosystem known as the Catalyst One Platform. * **Private Cloud Advantage:** The company maintains its own private cloud to achieve a four-fold cost reduction compared to public cloud alternatives, managed by a lean team of 700 people supporting 500,000 servers. * **Unified Architecture:** The integration spans several layers, including Infrastructure (Project "DC-Hub"), Cloud (Project "Flava"), and specialized Data and AI platforms. * **Next-Generation Cloud "Flava":** This platform integrates existing services to enhance VM specifications, VPC networking, and high-performance object storage (Ceph and Dragon). * **Information Security:** A dedicated "SafeOps" framework is being implemented to provide governance and security across all integrated services, ensuring a safer environment for user data. ### AI Strategy and Service Agentization A core pillar of LY’s strategy is the "AI Agentization" of all its services, moving beyond simple features to proactive, personalized assistance. * **Scaling GenAI:** Generative AI has already been integrated into 44 different services within the group. * **Personalized Agents:** The company is developing the capacity to generate millions of specialized agents that can be linked together to support the unique needs of individual users. * **Agent Ecosystem:** The goal is to move from a standard platform model to one where every user interaction is mediated by an intelligent agent. ### AI-Driven Development Transformation Beyond user-facing services, LY is fundamentally changing how its engineers work by deploying internal AI development solutions to all staff starting in July. * **Code and Test Automation:** Proof of Concept (PoC) results showed a 96% accuracy rate for "Code Assist" and a 97% reduction in time for "Auto Test" procedures. * **RAG Integration:** The system utilizes Retrieval-Augmented Generation (RAG) to leverage internal company knowledge and guidelines, ensuring high-quality, context-aware development support. * **Efficiency Gains:** By automating repetitive tasks, the company intends for engineers to shift their focus from maintenance to creative service improvement and innovation. The successful integration of these platforms and the aggressive adoption of AI-driven development tools suggest that LY Corporation is positioning itself to be a leader in the "AI-agent" era. For technical organizations, LY's model serves as a case study in how large-scale mergers can leverage private cloud infrastructure to fund and accelerate a company-wide AI transition.

google

AMIE gains vision: A research AI agent for multimodal diagnostic dialogue (opens in new tab)

Google Research and DeepMind have introduced multimodal AMIE, an advanced research AI agent designed to conduct diagnostic medical dialogues that integrate text, images, and clinical documents. By building on Gemini 2.0 Flash and a novel state-aware reasoning framework, the system can intelligently request and interpret visual data such as skin photos or ECGs to refine its diagnostic hypotheses. This evolution moves AI diagnostic tools closer to real-world clinical practice, where visual evidence is often essential for accurate patient assessment and management. ### Enhancing AMIE with Multimodal Perception To move beyond text-only limitations, researchers integrated vision capabilities that allow the agent to process complex medical information during a conversation. * The system uses Gemini 2.0 Flash as its core component to interpret diverse data types, including dermatology images and laboratory reports. * By incorporating multimodal perception, the agent can resolve diagnostic ambiguities that cannot be addressed through verbal descriptions alone. * Preliminary testing with Gemini 2.5 Flash suggests that further scaling the underlying model continues to improve the agent's reasoning and diagnostic accuracy. ### Emulating Clinical Workflows via State-Aware Reasoning A key technical contribution is the state-aware phase transition framework, which helps the AI mimic the structured yet flexible approach used by experienced clinicians. * The framework orchestrates the conversation through three distinct phases: History Taking, Diagnosis & Management, and Follow-up. * The agent maintains a dynamic internal state that tracks known information about the patient and identifies specific "knowledge gaps." * When the system detects uncertainty, it strategically requests multimodal artifacts—such as a photo of a rash or an image of a lab result—to update its differential diagnosis. * Transitions between conversation phases are only triggered once the system assesses that the objectives of the current phase have been sufficiently met. ### Evaluation through Simulated OSCEs To validate the agent’s performance, the researchers developed a robust simulation environment to facilitate rapid iteration and standardized testing. * The system was tested using patient scenarios grounded in real-world datasets, including the SCIN dataset for dermatology and PTB-XL for ECG measurements. * Evaluation was conducted using a modified version of Objective Structured Clinical Examinations (OSCEs), the global standard for assessing medical students and professionals. * In comparative studies, AMIE's performance was measured against primary care physicians (PCPs) to ensure its behavior, accuracy, and tone aligned with clinical standards. This research demonstrates that multimodal AI agents can effectively navigate the complexities of a medical consultation by combining linguistic empathy with the technical ability to interpret visual clinical evidence. As these systems continue to evolve, they offer a promising path toward high-quality, accessible diagnostic assistance that mirrors the multimodal nature of human medicine.