naver

Recreating the User's (opens in new tab)

The development of NSona, an LLM-based multi-agent persona platform, addresses the persistent gap between user research and service implementation by transforming static data into real-time collaborative resources. By recreating user voices through a multi-party dialogue system, the project demonstrates how AI can serve as an active participant in the daily design and development process. Ultimately, the initiative highlights a fundamental shift in cross-functional collaboration, where traditional role boundaries dissolve in favor of a shared starting point centered on AI-driven user empathy. ## Bridging UX Research and Daily Collaboration * The project was born from the realization that traditional UX research often remains isolated from the actual development cycle, leading to a loss of insight during implementation. * NSona transforms static user research data into dynamic "persona bots" that can interact with project members in real-time. * The platform aims to turn the user voice into a "live" resource, allowing designers and developers to consult the persona during the decision-making process. ## Agent-Centric Engineering and Multi-Party UX * The system architecture is built on an agent-centric structure designed to handle the complexities of specific user behaviors and motivations. * It utilizes a Multi-Party dialogue framework, enabling a collaborative environment where multiple AI agents and human stakeholders can converse simultaneously. * Technical implementation focused on bridging the gap between qualitative UX requirements and LLM orchestration, ensuring the persona's responses remained grounded in actual research data. ## Service-Specific Evaluation and Quality Metrics * The team moved beyond generic LLM benchmarks to establish a "Service-specific" evaluation process tailored to the project's unique UX goals. * Model quality was measured by how vividly and accurately it recreated the intended persona, focusing on the degree of "immersion" it triggered in human users. * Insights from these evaluations helped refine the prompt design and agent logic to ensure the AI's output provided genuine value to the product development lifecycle. ## Redefining Cross-Functional Collaboration * The AI development process reshaped traditional Roles and Responsibilities (RNR); designers became prompt engineers, while researchers translated qualitative logic into agentic structures. * Front-end developers evolved their roles to act as critical reviewers of the AI, treating the model as a subject of critique rather than a static asset. * The workflow shifted from a linear "relay" model to a concentric one, where all team members influence the product's core from the same starting point. To successfully integrate AI into the product lifecycle, organizations should move beyond using LLMs as simple tools and instead view them as a medium for interdisciplinary collaboration. By building multi-agent systems that reflect real user data, teams can ensure that the "user's voice" is not just a research summary, but a tangible participant in the development process.

woowahan

How Woowa Brothers Detects Failures (opens in new tab)

Woowa Brothers addresses the inevitability of system failures by shifting from traditional resource-based monitoring to a specialized Service Anomaly Detection system. By focusing on high-level service metrics such as order volume and login counts rather than just CPU or memory usage, they can identify incidents that directly impact the user experience. This approach ensures near real-time detection and provides a structured response framework to minimize damage during peak service hours. ### The Shift to Service-Level Monitoring * Traditional monitoring focuses on infrastructure metrics like CPU and memory, but it is impossible to monitor every system variable, leading to "blind spots" in failure detection. * Service metrics, such as real-time login counts and payment success rates, are finite and offer a direct reflection of the actual customer experience. * By monitoring these core indicators, the SRE team can detect anomalies that system-level alerts might overlook, ensuring that no failure goes unnoticed. ### Requirements for Effective Anomaly Detection * **Real-time Performance:** Alerts must be triggered in near-real-time to allow for immediate intervention before the impact scales. * **Explainability:** The system favors transparent logic over "black-box" AI models, allowing developers to quickly understand why an alert was triggered and how to improve the detection logic. * **Integrated Response:** Beyond just detection, the system must provide a clear response process so that any engineer, regardless of experience, can follow a standardized path to resolution. ### Technical Implementation and Logic * The system leverages the predictable, pattern-based nature of delivery service traffic, which typically peaks during lunch and dinner. * The team chose a Median-based approach to generate "Prediction" values from historical data, as it is more robust against outliers and easier to analyze than complex methods like IQR or 2-sigma. * Detection is determined by comparing "Actual" values against "Warning" and "Critical" thresholds derived from the predicted median. * To prevent false positives caused by temporary spikes, the system tracks "threshold reach counts," requiring a metric to stay in an abnormal state for a specific number of consecutive cycles before firing a Slack alert. ### Optimization of Alert Accuracy * Each service metric requires a tailored "settling period" to find the optimal balance between detection speed and accuracy. * Setting a high threshold reach count improves accuracy but slows down detection, while a low count accelerates detection at the risk of increased false positives. * Alerts are delivered via Slack with comprehensive context, including current status and urgency, to facilitate rapid decision-making. For organizations running high-traffic services, prioritizing service-level indicators (SLIs) over infrastructure metrics can significantly reduce the time to detect critical failures. Implementing simple, explainable statistical models like the Median approach allows teams to maintain a reliable monitoring system that evolves alongside the service without the complexity of uninterpretable AI models.

naver

Iceberg Low-Latency Queries with Materialized Views (opens in new tab)

This technical session from NAVER ENGINEERING DAY 2025 explores the architectural journey of building a low-latency query system for real-time transaction reports. The project focuses on resolving the tension between high data freshness, massive scalability, and rapid response times for complex, multi-dimensional filtering. By leveraging Apache Iceberg in conjunction with StarRocks’ materialized views, the team established a performant data pipeline that meets the demands of modern business intelligence. ### Challenges in Real-Time Transaction Reporting * **Query Latency vs. Data Freshness:** Traditional architectures often struggle to provide immediate visibility into transaction data while maintaining sub-second query speeds across diverse filter conditions. * **High-Dimensional Filtering:** Users require the ability to query reports based on numerous variables, necessitating an engine that can handle complex aggregations without pre-defining every possible index. * **Scalability Requirements:** The system must handle increasing transaction volumes without degrading performance or requiring significant manual intervention in the underlying storage layer. ### Optimized Architecture with Iceberg and StarRocks * **Apache Iceberg Integration:** Iceberg serves as the open table format, providing a reliable foundation for managing large-scale data snapshots and ensuring consistency during concurrent reads and writes. * **StarRocks for Query Acceleration:** The team selected StarRocks as the primary OLAP engine to take advantage of its high-speed vectorized execution and native support for Iceberg tables. * **Spark-Based Processing:** Apache Spark is utilized for the initial data ingestion and transformation phases, preparing the transaction data for efficient storage and downstream consumption. ### Enhancing Performance via Materialized Views * **Pre-computed Aggregations:** By implementing Materialized Views, the system pre-calculates intensive transaction summaries, significantly reducing the computational load during active user queries. * **Automatic Query Rewrite:** The architecture utilizes StarRocks' ability to automatically route queries to the most efficient materialized view, ensuring that even ad-hoc reports benefit from pre-computed results. * **Balanced Refresh Strategies:** The research focused on optimizing the refresh intervals of these views to maintain high "freshness" while minimizing the overhead on the cluster resources. The adoption of a modern lakehouse architecture combining Apache Iceberg with a high-performance OLAP engine like StarRocks is a recommended strategy for organizations dealing with high-volume, real-time reporting. This approach effectively decouples storage and compute while providing the low-latency response times necessary for interactive data analysis.

naver

Research for the Protection of the Web (opens in new tab)

Naver Webtoon is proactively developing technical solutions to safeguard its digital creation ecosystem against evolving threats like illegal distribution and unauthorized generative AI training. By integrating advanced AI-based watermarking and protective perturbation technologies, the platform successfully tracks content leaks and disrupts unauthorized model fine-tuning. These efforts ensure a sustainable environment where creators can maintain the integrity and economic value of their intellectual property. ## Challenges in the Digital Creation Ecosystem - **Illegal Content Leakage**: Unauthorized reproduction and distribution of digital content infringe on creator earnings and damage the platform's business model. - **Unauthorized Generative AI Training**: The rise of fine-tuning techniques (e.g., LoRA, Dreambooth) allows for the unauthorized mimicry of an artist's unique style, distorting the value of original works. - **Harmful UGC Uploads**: The presence of violent or suggestive user-generated content increases operational costs and degrades the service experience for readers. ## AI-Based Watermarking for Post-Tracking - To facilitate tracking in DRM-free environments, Naver Webtoon developed an AI-based watermarking system that embeds invisible signals into the pixels of digital images. - The system is designed around three conflicting requirements: **Invisibility** (signal remains hidden), **Robustness** (signal survives attacks like cropping or compression), and **Capacity** (sufficient data for tracking). - The technical pipeline involves three neural modules: an **Embedder** to insert the signal, a differentiable **Attack Layer** to simulate real-world distortions, and an **Extractor** to recover the signal. - Performance metrics show a high Peak Signal-to-Noise Ratio (PSNR) of over 46 dB, and the system maintains a signal error rate of less than 1% even when subjected to intense signal processing or geometric editing. ## IMPASTO: Disrupting Unauthorized AI Training - This technology utilizes **protective perturbation**, which adds microscopic changes to images that are invisible to humans but confuse generative AI models during the training phase. - It targets the way diffusion models (like Stable Diffusion) learn by either manipulating latent representations or disrupting the denoising process, preventing the AI from accurately mimicking an artist's style. - The research prioritizes overcoming the visual artifacts and slow processing speeds found in existing academic tools like Glaze and PhotoGuard. - By implementing these perturbations, any attempts to fine-tune a model on protected work will result in distorted or unintended outputs, effectively shielding the artist's original style. ## Integrated Protection Frameworks - **TOONRADAR**: A comprehensive system deployed since 2017 that uses watermarking for both proactive blocking and retrospective tracking of illegal distributors. - **XPIDER**: An automated detection tool tailored specifically for the comic domain to identify and block harmful UGC, reducing manual inspection overhead. - These solutions are being expanded not just for copyright protection, but to establish long-term trust and reliability in the era of AI-generated content. The deployment of these AI-driven defense mechanisms is essential for maintaining a fair creative economy. By balancing visual quality with robust protection, platforms can empower creators to share their work globally without the constant fear of digital theft or stylistic mimicry.

toss

Toss Income Tax Refund Service: An (opens in new tab)

Toss Income’s QA team transitioned from traditional manual testing and rigid class-based Page Object Models (POM) to a stateless Functional POM to keep pace with rapid deployment cycles. This shift allowed them to manage complex tax refund logic and frequent UI changes with high reliability and minimal maintenance overhead. By treating automation as a modular assembly of functions, they successfully reduced verification times from four hours to twenty minutes while significantly increasing test coverage. ### Transitioning to Functional POM * Replaced stateful classes and complex inheritance with stateless functions that receive a `page` object as input and return the updated `page` as output. * Adopted a clear naming convention (e.g., `gotoLoginPage`, `enterPhonePin`, `verifyRefundAmount`) to ensure that test cases read like human-readable scenarios. * Centralized UI selectors and interaction logic within these functions, allowing developers to update a single point of truth when UI text or button labels change. ### Modularizing the User Journey * Segmented the complex tax refund process into four distinct modules: Login/Terms, Deduction Checks, Refund/Payment Info, and Reporting. * Developed independent, reusable functions for specific data inputs—such as medical or credit card deductions—which can be assembled like "Lego blocks" to create new test scenarios rapidly. * Decoupled business logic from UI interactions, enabling the team to create diverse test cases by simply varying parameters like amounts or dates. ### Robust Interaction and Page Management * Implemented a 4-step "Robust Click Strategy" to eliminate flakiness caused by React rendering timings, sequentially trying an Enter key press, a standard click, a forced click, and finally a direct JavaScript execution. * Created a `waitForNetworkIdleSafely` utility that prevents test failures during polling or background network activity by prioritizing UI anchors over strict network idleness. * Standardized page transition handling with a `getLatestNonScrapePage` utility, ensuring the `currentPage` object always points to the most recent active tab or redirect window. ### Integration and Performance Outcomes * Achieved a 600% increase in test coverage, expanding from 5 core scenarios to 35 comprehensive automated flows. * Reduced the time required to respond to UI changes by 98%, as modifications are now localized to a single POM function rather than dozens of test files. * Established a 24/7 automated validation system that provides immediate feedback on functional correctness, data integrity (tax amount accuracy), and performance metrics via dedicated communication channels. For engineering teams operating in high-velocity environments, adopting a stateless, functional approach to test automation is a highly effective way to reduce technical debt. By focusing on modularity and implementing fallback strategies for UI interactions, teams can transform QA from a final bottleneck into a continuous, data-driven validation layer that supports rapid experimentation.

toss

Toss Next ML Challenge - Ad (opens in new tab)

Toss recently hosted the "Toss Next ML Challenge," a large-scale competition focused on predicting advertisement Click-Through Rates (CTR) using real-world, anonymized data from the Toss app. By tasking over 2,600 participants with developing high-performance models under real-time serving constraints, the event successfully identified innovative technical approaches to feature engineering and model ensembling. ### Designing a Real-World CTR Prediction Task * The competition required participants to predict the probability of a user clicking a display ad based on a dataset of 10.7 million training samples. * Data included anonymized features such as age, gender, ad inventory IDs, and historical user behavior. * A primary technical requirement was "real-time navigability," meaning models had to be optimized for fast inference to function within a live service environment. ### Overcoming Anonymization with Sequence Engineering * To maintain data privacy while allowing external access, Toss provided anonymized features in a single flattened table, which limited the ability of participants to perform traditional data joins. * A complex, raw "Sequence" feature was intentionally left unprocessed to serve as a differentiator for high-performing teams. * Top-tier participants demonstrated extreme persistence by deriving up to 37 unique variables from this single sequence, including transition probabilities, unique token counts, and sequence lengths. ### Winning Strategies and Technical Trends * All of the top 30 teams utilized Boosting Tree-based models (such as XGBoost or LightGBM), while Deep Learning was used only by a subset of participants. * One standout solution utilized a massive ensemble of 260 different models, providing a fresh perspective on the limits of ensemble learning for predictive accuracy. * Performance was largely driven by the ability to extract meaningful signals from anonymized data through rigorous cross-validation and creative feature interactions. The results of the Toss Next ML Challenge suggest that even in the absence of domain-specific context due to anonymization, meticulous feature engineering and robust tree-based architectures remains the gold standard for tabular data. For ML engineers, the competition underscores that the key to production-ready models lies in balancing complex ensembling with the strict latency requirements of real-time serving.

kakao

Y is Watching – The Story of Kak (opens in new tab)

Kakao developed YEYE, a dedicated Attack Surface Management (ASM) system, to proactively identify and manage the organization's vast digital footprint, including IPs, domains, and open ports. By integrating automated scanning with a human-led Daily Security Review (DSR) process, the platform transforms raw asset data into actionable security intelligence. This holistic approach ensures that potential entry points are identified and secured before they can be exploited by external threats. ## The YEYE Asset Management Framework * Defines attack surfaces broadly to include every external-facing digital asset, such as subdomains, API endpoints, and mobile APKs. * Categorizes assets using a standardized taxonomy based on scope (In/Out/Undefined), type (Domain/IP/Service), and identification status (Known/Unknown/3rd Party). * Implements a labeling system that converts diverse data formats from multiple sources into a simplified, unified structure for better visibility. * Establishes multi-dimensional relationships between assets, CVEs, certificates, and departments, allowing teams to instantly identify which business unit is responsible for a newly discovered vulnerability. ## Daily Security Review (DSR) * Operates on the principle that "security is a process, not a product," bridging the gap between automated detection and manual remediation. * Utilizes a rotating group system where security engineers review external feeds, public vulnerability news, and YEYE alerts every morning. * Focuses on detecting "shadow IT" or assets deployed without formal security reviews to ensure all external touchpoints are accounted for. ## Scalable and Efficient Scanning Architecture * Resolved internal network bandwidth bottlenecks by adopting a hybrid infrastructure that leverages public cloud resources for high-concurrency scanning tasks. * Developed a custom distributed scanning structure using schedulers and queues to manage multiple independent workers, overcoming the limitations of single-process open-source scanners. * Optimized infrastructure costs by identifying the "sweet spot" in server specifications, favoring the horizontal expansion of medium-spec servers over expensive, high-performance hardware. * Mitigates service impact and false alarms by using fixed IPs and custom User-Agent (UA) strings, allowing service owners to distinguish YEYE’s security probes from actual malicious traffic. To effectively manage a growing attack surface, organizations should combine automated asset discovery with a structured manual review process. Prioritizing data standardization and relationship mapping between assets and vulnerabilities is essential for rapid incident response and long-term infrastructure hardening.

toss

From Legacy Payment Ledger to Scalable (opens in new tab)

Toss Payments successfully modernized a 20-year-old legacy payment ledger by transitioning to a decoupled, MySQL-based architecture designed for high scalability and consistency. By implementing strategies like INSERT-only immutability and event-driven domain isolation, they overcame structural limitations such as the inability to handle split payments. Ultimately, the project demonstrates that robust system design must be paired with resilient operational recovery mechanisms to manage the complexities of large-scale financial migrations. ### Legacy Ledger Challenges * **Inconsistent Schemas:** Different payment methods used entirely different table structures; for instance, a table named `REFUND` unexpectedly contained only account transfer data rather than all refund types. * **Domain Coupling:** Multiple domains (settlement, accounting, and payments) shared the same tables and columns, meaning a single schema change required impact analysis across several teams. * **Structural Limits:** A rigid 1:1 relationship between a payment and its method prevented the implementation of modern features like split payments or "Dutch pay" models. ### New Ledger Architecture * **Data Immutability:** The system shifted from updating existing rows to an **INSERT-only** principle, ensuring a reliable audit trail and preventing database deadlocks. * **Event-Driven Decoupling:** Instead of direct database access, the system uses Kafka to publish payment events, allowing independent domains to consume data without tight coupling. * **Payment-Approval Separation:** By separating the "Payment" (the transaction intent) from the "Approval" (the specific financial method), the system now supports multiple payment methods per transaction. ### Safe Migration and Data Integrity * **Asynchronous Mirroring:** To maintain zero downtime, data was initially written to the legacy system and then asynchronously loaded into the new MySQL ledger. * **Resource Tuning:** Developers used dedicated migration servers within the same AWS Availability Zone to minimize latency and implemented **Bulk Inserts** to handle hundreds of millions of rows efficiently. * **Verification Batches:** A separate batch process ran every five minutes against a Read-Only (RO) database to identify and correct any data gaps caused by asynchronous processing failures. ### Operational Resilience and Incident Response * **Query Optimization:** During a load spike, the MySQL optimizer chose "Full Scans" over indexes; the team resolved this by implementing SQL hints and utilizing a 5-version Docker image history for rapid rollbacks. * **Network Cancellation:** To handle timeouts between Toss and external card issuers, the system uses specific logic to automatically send cancellation requests and synchronize states. * **Timeout Standardization:** Discrepancies between microservices were resolved by calculating the maximum processing time of approval servers and aligning all upstream timeout settings to prevent merchant response mismatches. * **Reliable Event Delivery:** While using the **Outbox pattern** for events, the team added log-based recovery (Elasticsearch and local disk) and idempotency keys in event headers to handle both missing and duplicate messages. For organizations tackling significant technical debt, this transition highlights that initial design is only half the battle. True system reliability comes from building "self-healing" structures—such as automated correction batches and standardized timeout chains—that can survive the unpredictable nature of live production environments.

toss

In search of Toss’s brand (opens in new tab)

Toss, a leading Korean fintech platform, embarked on a UX research journey to define its visual identity as it expanded from digital services into offline environments like Toss Pay payment stations. The study revealed that while users strongly associate the brand with seamless "usability," they lacked a single, clear mental image of a visual symbol. By analyzing user perceptions of fonts, colors, and shapes, Toss identified a specific visual formula—combining the app icon shape with a white, blue, and black palette—to ensure the brand remains instantly recognizable in the physical world. ## The Challenge of Offline Brand Recognition * The project began with the need to design "danglers" (small signage at payment counters) to signal that Toss Pay is accepted at offline merchants. * While Toss had successfully used various logo iterations online, the team realized that "Toss-ness" learned within the app might not automatically translate to unfamiliar offline environments. * Initial internal debates focused on superficial visual tweaks, such as background colors or language choices, rather than understanding the core assets that trigger brand recognition. ## Identifying Usability as the Core Brand Image * In-depth interviews were conducted with participants selected for their ability to articulate abstract brand impressions. * Research showed that users primarily associate Toss with keywords like "clean," "practical," and "convenient," rather than specific aesthetic elements. * One participant described Toss as a "program made by a genius engineer in Excel," highlighting that the brand’s value was rooted in its utility rather than a distinct visual symbol. * This presented a challenge: since the "app experience" cannot be felt through a static offline sign, the team had to find a visual surrogate for that functional reliability. ## Deconstructing the Toss Symbol: Font, Color, and Shape * **Font:** Testing revealed that the most recognizable font was the black English "toss" wordmark, primarily because users see it most often in external media and news rather than inside the app. * **Color:** Surprisingly, users did not associate Toss with a single shade of blue. Instead, they recognized the specific combination of a "blue logo on a white background." * **Logo:** When asked to draw the logo from memory, users consistently included a square border. This indicated that users perceive the brand’s "face" specifically as the smartphone app icon (the blue logo inside a rounded square) rather than the standalone logo mark. ## Implementing the "Toss Formula" in Design * The research led to a refined brand identity formula: **White background + Black bold English font + Blue app-icon-shaped logo.** * In the "10 to 100" 10th-anniversary campaign, the company shifted away from all-blue backgrounds in favor of this white-based combination to maximize recognition. * Toss Pay payment screens were updated to remove blue backgrounds, adopting the white-and-black layout to align with how users intuitively identify the service. For UX researchers and designers, this case demonstrates that brand identity is often a composite of environmental cues rather than a single graphic. When moving a digital-first brand into the physical world, it is essential to look beyond the logo and identify the specific "visual formula" that triggers the user's memory of the product experience.

naver

[DAN25] (opens in new tab)

Naver recently released the full video archives from its DAN25 conference, highlighting the company’s strategic roadmap for AI agents, Sovereign AI, and digital transformation. The sessions showcase how Naver is moving beyond general AI applications to implement specialized, real-time systems that integrate large language models (LLMs) directly into core services like search, commerce, and content. By open-sourcing these technical insights, Naver demonstrates its progress in building a cohesive AI ecosystem capable of handling massive scale and complex user intent. ### Naver PersonA and LLM-Based User Memory * The "PersonA" project focuses on building a "user memory" by treating fragmented logs across various Naver services as indirect conversations with the user. * By leveraging LLM reasoning, the system transitions from simple data tracking to a sophisticated AI agent that offers context-aware, real-time suggestions. * Technical hurdles addressed include the stable implementation of real-time log reflection for a massive user base and the selection of optimal LLM architectures for personalized inference. ### Trend Analysis and Search-Optimized Models * The Place Trend Analysis system utilizes ranking algorithms to distinguish between temporary surges and sustained popularity, providing a balanced view of "hot places." * LLMs and text mining are employed to move beyond raw data, extracting specific keywords that explain the underlying reasons for a location's trending status. * To improve search quality, Naver developed search-specific LLMs that outperform general models by using specialized data "recipes" and integrating traditional information retrieval with features like "AI briefing" and "AuthGR" for higher reliability. ### Unified Recommendation and Real-Time CRM * Naver Webtoon and Series replaced fragmented recommendation and CRM (Customer Relationship Management) models with a single, unified framework to ensure data consistency. * The architecture shifted from batch-based processing to a real-time, API-based serving system to reduce management complexity and improve the immediacy of personalized user experiences. * This transition focuses on maintaining a seamless UX by synchronizing different ML models under a unified serving logic. ### Scalable Log Pipelines and Infrastructure Stability * The "Logiss" pipeline manages up to tens of billions of logs daily, utilizing a Storm and Kafka environment to ensure high availability and performance. * Engineers implemented a multi-topology approach to allow for seamless, non-disruptive deployments even under heavy loads. * Intelligent features such as "peak-shaving" (distributing peak traffic to off-peak hours), priority-based processing during failures, and efficient data sampling help balance cost, performance, and stability. These sessions provide a practical blueprint for organizations aiming to scale LLM-driven services while maintaining infrastructure integrity. For developers and system architects, Naver’s transition toward unified ML frameworks and specialized, real-time data pipelines offers a proven model for moving AI from experimental phases into high-traffic production environments.

naver

Naver Integrated Search LLM DevOps (opens in new tab)

Naver’s Integrated Search team is transitioning from manual fault response to an automated system using LLM Agents to manage the increasing complexity of search infrastructure. By integrating Large Language Models into the DevOps pipeline, the system evolves through accumulated experience, moving beyond simple alert monitoring to intelligent diagnostic analysis and action recommendation. ### Limitations of Traditional Fault Response * **Complex Search Flows:** Naver’s search architecture involves multiple interdependent layers, which makes manual root cause analysis slow and prone to human error. * **Fragmented Context:** Existing monitoring requires developers to manually synthesize logs and metrics from disparate telemetry sources, leading to high cognitive load during outages. * **Delayed Intervention:** Human-led responses often suffer from a "detection-to-action" lag, especially during high-traffic periods or subtle service regressions. ### Architecture of DevOps Agent v1 * **Initial Design:** Focused on automating basic data gathering and providing preliminary textual reports to engineers. * **Infrastructure Integration:** Built using a specialized software stack designed to bridge frontend (FE) and backend (BE) telemetry within the search infrastructure. * **Standardized Logic:** The v1 agent operated on a fixed set of instructions to perform predefined diagnostic tasks when triggered by specific system alarms. ### Evolution to DevOps Agent v2 * **Overcoming V1 Limitations:** The first iteration struggled with maintaining deep context and providing diverse actionable insights, necessitating a more robust agentic structure. * **Enhanced Memory and Learning:** V2 incorporates a more sophisticated architecture that allows the agent to reference historical failure data and learn from past incident resolutions. * **Advanced Tool Interaction:** The system was upgraded to handle more complex tool-calling capabilities, allowing the agent to interact more deeply with internal infrastructure APIs. ### System Operations and Evaluation * **Trigger Queue Management:** Implements a queuing system to efficiently process and prioritize multiple concurrent system alerts without overwhelming the diagnostic pipeline. * **Anomaly Detection:** Utilizes advanced detection methods to distinguish between routine traffic fluctuations and genuine service anomalies that require LLM intervention. * **Rigorous Evaluation:** The agent’s performance is measured through a dedicated evaluation framework that assesses the accuracy of its diagnoses against known ground-truth incidents. ### Scaling and Future Challenges * **Context Expansion:** Efforts are focused on integrating a wider range of metadata and environmental context to provide a holistic view of system health. * **Action Recommendation:** The system is moving toward suggesting specific recovery actions, such as rollbacks or traffic rerouting, rather than just identifying the problem. * **Sustainability:** Ensuring the DevOps Agent remains maintainable and cost-effective as the underlying search infrastructure and LLM models continue to evolve. Organizations managing high-scale search traffic should consider LLM-based agents as integrated infrastructure components rather than standalone tools. Moving from reactive monitoring to a proactive, experience-based agent system is essential for reducing the mean time to recovery (MTTR) in complex distributed environments.

toss

Toss Payments' Open API Ecosystem (opens in new tab)

Toss Payments treats its Open API not just as a communication tool, but as a long-term infrastructure designed to support over 200,000 merchants for decades. By focusing on resource-oriented design and developer experience, the platform ensures that its interfaces remain intuitive, consistent, and easy to maintain. This strategic approach prioritizes structural stability and clear communication over mere functionality, fostering a reliable ecosystem for both developers and businesses. ### Resource-Oriented Interface Design * The API follows a predictable path structure (e.g., `/v1/payments/{id}`) where the root indicates the version, followed by the domain and a unique identifier. * Request and response bodies utilize structured JSON with nested objects (like `card` or `cashReceipt`) to modularize data and reduce redundancy. * Consistency is maintained by reusing the same domain objects across different APIs, such as payment approval, inquiry, and cancellation, which minimizes the learning curve for external developers. * Data representation shifts from cryptic legacy codes (e.g., SC0010) to human-readable strings, supporting localization into multiple languages via the `Accept-Language` HTTP header. * Standardized error handling utilizes HTTP status codes paired with a JSON error object containing specific `code` and `message` fields, allowing developers to either display messages directly or implement custom logic. ### Asynchronous Communication via Webhooks * Webhooks are provided alongside standard APIs to handle asynchronous events where immediate responses are not possible, such as status changes in complex payment flows. * Event types are clearly categorized (e.g., `PAYMENT_STATUS_CHANGED`), and the payloads mirror the exact resource structures used in the REST APIs to simplify parsing. * The system ensures reliability by implementing an Exponential Backoff strategy for retries, preventing network congestion during recipient service outages. * A dedicated developer center allows merchants to register custom endpoints, monitor transmission history, and perform manual retries if automated attempts fail. ### External Ecosystem and Documentation Automation * Developer Experience (DX) is treated as the core metric for API quality, focusing on how quickly and efficiently a developer can integrate and operate the service. * To prevent the common issue of outdated manuals, Toss Payments uses a documentation automation system based on the OpenAPI Specification (OAS). * By utilizing libraries like `springdoc`, the platform automatically syncs the technical documentation with the actual server code, ensuring that parameters, schemas, and endpoints are always up-to-date and trustworthy. To ensure the longevity of a high-traffic Open API, organizations should prioritize automated documentation and resource-based consistency. Moving away from cryptic codes toward human-readable, localized data and providing robust asynchronous notification tools like webhooks are essential steps for building a developer-friendly infrastructure.

line

Code Quality Improvement Techniques Part (opens in new tab)

Effective code review communication relies on a "conclusion-first" approach to minimize cognitive load and ensure clarity for the developer. By stating proposed changes or specific requests before providing the underlying rationale, reviewers help authors understand the primary goal of the feedback immediately. This practice improves development productivity by making review comments easier to parse and act upon without repeated reading. ### Optimizing Review Comment Structure * Place the core suggestion or requested code change at the very beginning of the comment to establish immediate context. * Follow the initial request with a structured explanation, utilizing headers or numbered lists to organize multiple supporting arguments. * Clearly distinguish between the "what" (the requested change) and the "why" (the technical justification) to prevent the intended action from being buried in a long technical discussion. * Use visual formatting to help the developer quickly validate the logic behind the suggestion once they understand the proposed change. ### Immutability and Data Class Design * Prefer the use of `val` over `var` in Kotlin `data class` structures to ensure object immutability. * Using immutable properties prevents bugs associated with unintended side effects that occur when mutable objects are shared across different parts of an application. * Instead of reassigning values to a mutable property, utilize the `copy()` function to create a new instance with updated state, which results in more robust and predictable code. * Avoid mixing `var` properties with `data class` features, as this can lead to confusion regarding whether to modify the existing instance or create a copy. ### Property Separation by Lifecycle * Analyze the update frequency of different properties within a class to identify those with different lifecycles. * Decouple frequently updated status fields (such as `onlineStatus` or `statusMessage`) from more stable attributes (such as `userId` or `accountName`) by moving them into separate classes. * Grouping properties by their lifecycle prevents unnecessary updates to stable data and makes the data model easier to maintain as the application scales. To maintain high development velocity, reviewers should prioritize brevity and structure in their feedback. Leading with a clear recommendation and supporting it with organized technical reasoning ensures that code reviews remain a tool for progress rather than a source of confusion.

toss

In an era where everyone does research, (opens in new tab)

In an era where AI moderators and non-researchers handle the bulk of data collection, the role of the UX researcher has shifted from a technical specialist to a strategic guide. The core value of the researcher now lies in "UX Leadership"—the ability to frame problems, align team perspectives, and define the fundamental identity of a product. By bridging the gap between business goals and user needs, researchers ensure that products solve real problems rather than just chasing metrics or technical feasibility. ### Setting the Framework in the Idea Phase When starting a new project, a researcher’s primary task is to establish the "boundaries of the puzzle" by shifting the team’s focus from business impact to user value. * **Case - AI Signal:** For a service that interprets stock market events using AI, the team initially focused on business metrics like retention and news consumption. * **Avoiding "Metric Traps":** A researcher intervenes to prevent fatigue-inducing UX (e.g., excessive notifications to boost CTR) by defining the "North Star" as the specific problem the user is trying to solve. * **The Checklist:** Once the user problem and value are defined, they serve as a persistent checklist for every design iteration and action item. ### Aligning Team Direction for Product Improvements When a product already exists but needs improvement, different team members often have scattered, subjective opinions on what to fix. The researcher structures these thoughts into a cohesive direction. * **Case - Stock Market Calendar:** While the team suggested UI changes like "it doesn't look like a calendar," the researcher refocused the effort on the user's ultimate goal: making better investment decisions. * **Defining Success Criteria:** The team agreed on a "Good Usage" standard based on three stages: Awareness (recognizing issues) → Understanding (why it matters) → Preparation (adjusting investment plans). * **Identifying Obstacles:** By identifying specific friction points—such as the lack of information hierarchy or the difficulty of interpreting complex indicators—the researcher moves the project from "simple UI cleanup" to "essential tool development." ### Redefining Product Identity During Stagnation When a product's growth stalls, the issue often isn't a specific UI bug but a fundamental mismatch between the product's identity and its environment. * **Case - Toss Securities PC:** Despite being functional, the PC version struggled because it initially tried to copy the "mobile simplicity" of the app. * **Contextual Analysis:** Research revealed that while mobile users value speed and portability, PC users require an environment for deep analysis, multi-window comparisons, and deliberate decision-making. * **Consensus through Synthesis:** The researcher integrates data, user interviews, and market trends into workshops to help the team decide where the product should "live" in the market. This process creates team-wide alignment on a new strategic direction rather than just fixing features. The modern UX researcher must move beyond "crafting the tool" (interviewing and data gathering) and toward "UX Leadership." True expertise involves maintaining a broad view of the industry and product ecosystem, structuring team discussions to reach a consensus, and ensuring that every product decision is rooted in a clear understanding of the user's context and goals.

line

Practical Security Knowledge Growing with (opens in new tab)

LINE CTF 2025 serves as a collaborative platform for global security experts to exchange technical knowledge and tackle real-world cybersecurity challenges through a competitive framework. Under the newly integrated LY Corporation, the event evolved to prioritize anti-AI problem design and enhanced privacy protections, reinforcing its position as a top-tier competition in the Asian security community. The event successfully demonstrated that high-quality problem engineering and community-focused operations can drive both individual growth and organizational security excellence. ## Strategic Shift and AI-Resilient Design * **Multisite Collaboration:** While previous years were led primarily by the Japanese team, 2025 saw a shift where the Korean security team led preparations and the Vietnamese team contributed the highest volume of technical challenges. * **Counter-AI Engineering:** To maintain fairness in an era of LLMs, problems were specifically designed to mislead automated AI analysis, requiring human logic and deep conceptual understanding to arrive at the correct "flag." * **Systemic Integration:** This was the first year applying the unified LY Corporation administrative and approval processes, resulting in a more refined timeline for problem verification and quality control. ## Competition Format and Problem Engineering * **Jeopardy-Style Challenges:** The event featured 13 independent challenges—6 Web, 4 Pwnable, and 3 Reverse Engineering—where teams earned points based on difficulty. * **Three-Stage Validation:** Every problem underwent a rigorous cycle of idea conception, technical environment isolation/testing, and internal peer review to eliminate unintended "cheese" solutions or bugs. * **Technical Philosophy:** Problems were modeled after real-world service vulnerabilities and latest security trends, targeting a difficulty level that requires several hours of dedicated analysis by a skilled researcher. ## Platform Evolution and Performance * **Privacy-First Infrastructure:** The team customized the open-source CTFd framework to remove email-based registration, instead using a recovery-code system to ensure participant anonymity and data security. * **Growing Technical Prestige:** The competition’s rating on CTFtime (a global community platform) has climbed steadily over three years, reaching a weight of 66.5 in 2025, reflecting its high quality and difficulty. * **Competitive Results:** The Korean team "The Duck" maintained dominance with a third consecutive win, while the battle for second place was decided by a dramatic last-minute solve by the Japanese team "GMO Ierae." Participating in CTFs like LINE CTF offers an invaluable practical learning environment for security engineers to master vulnerability analysis and exploit development. Aspiring and professional researchers are encouraged to engage with these challenges to sharpen their analytical skills and contribute to a more robust, collaborative global security culture.