woowahan

How Woowa Brothers Detects (opens in new tab)

Woowa Brothers addresses the inevitability of system failures by shifting from traditional resource-based monitoring to a specialized Service Anomaly Detection system. By focusing on high-level service metrics such as order volume and login counts rather than just CPU or memory usage, they can identify incidents that directly impact the user experience. This approach ensures near real-time detection and provides a structured response framework to minimize damage during peak service hours. ### The Shift to Service-Level Monitoring * Traditional monitoring focuses on infrastructure metrics like CPU and memory, but it is impossible to monitor every system variable, leading to "blind spots" in failure detection. * Service metrics, such as real-time login counts and payment success rates, are finite and offer a direct reflection of the actual customer experience. * By monitoring these core indicators, the SRE team can detect anomalies that system-level alerts might overlook, ensuring that no failure goes unnoticed. ### Requirements for Effective Anomaly Detection * **Real-time Performance:** Alerts must be triggered in near-real-time to allow for immediate intervention before the impact scales. * **Explainability:** The system favors transparent logic over "black-box" AI models, allowing developers to quickly understand why an alert was triggered and how to improve the detection logic. * **Integrated Response:** Beyond just detection, the system must provide a clear response process so that any engineer, regardless of experience, can follow a standardized path to resolution. ### Technical Implementation and Logic * The system leverages the predictable, pattern-based nature of delivery service traffic, which typically peaks during lunch and dinner. * The team chose a Median-based approach to generate "Prediction" values from historical data, as it is more robust against outliers and easier to analyze than complex methods like IQR or 2-sigma. * Detection is determined by comparing "Actual" values against "Warning" and "Critical" thresholds derived from the predicted median. * To prevent false positives caused by temporary spikes, the system tracks "threshold reach counts," requiring a metric to stay in an abnormal state for a specific number of consecutive cycles before firing a Slack alert. ### Optimization of Alert Accuracy * Each service metric requires a tailored "settling period" to find the optimal balance between detection speed and accuracy. * Setting a high threshold reach count improves accuracy but slows down detection, while a low count accelerates detection at the risk of increased false positives. * Alerts are delivered via Slack with comprehensive context, including current status and urgency, to facilitate rapid decision-making. For organizations running high-traffic services, prioritizing service-level indicators (SLIs) over infrastructure metrics can significantly reduce the time to detect critical failures. Implementing simple, explainable statistical models like the Median approach allows teams to maintain a reliable monitoring system that evolves alongside the service without the complexity of uninterpretable AI models.

naver

Naver TV (opens in new tab)

This technical session from NAVER ENGINEERING DAY 2025 explores the architectural journey of building a low-latency query system for real-time transaction reports. The project focuses on resolving the tension between high data freshness, massive scalability, and rapid response times for complex, multi-dimensional filtering. By leveraging Apache Iceberg in conjunction with StarRocks’ materialized views, the team established a performant data pipeline that meets the demands of modern business intelligence. ### Challenges in Real-Time Transaction Reporting * **Query Latency vs. Data Freshness:** Traditional architectures often struggle to provide immediate visibility into transaction data while maintaining sub-second query speeds across diverse filter conditions. * **High-Dimensional Filtering:** Users require the ability to query reports based on numerous variables, necessitating an engine that can handle complex aggregations without pre-defining every possible index. * **Scalability Requirements:** The system must handle increasing transaction volumes without degrading performance or requiring significant manual intervention in the underlying storage layer. ### Optimized Architecture with Iceberg and StarRocks * **Apache Iceberg Integration:** Iceberg serves as the open table format, providing a reliable foundation for managing large-scale data snapshots and ensuring consistency during concurrent reads and writes. * **StarRocks for Query Acceleration:** The team selected StarRocks as the primary OLAP engine to take advantage of its high-speed vectorized execution and native support for Iceberg tables. * **Spark-Based Processing:** Apache Spark is utilized for the initial data ingestion and transformation phases, preparing the transaction data for efficient storage and downstream consumption. ### Enhancing Performance via Materialized Views * **Pre-computed Aggregations:** By implementing Materialized Views, the system pre-calculates intensive transaction summaries, significantly reducing the computational load during active user queries. * **Automatic Query Rewrite:** The architecture utilizes StarRocks' ability to automatically route queries to the most efficient materialized view, ensuring that even ad-hoc reports benefit from pre-computed results. * **Balanced Refresh Strategies:** The research focused on optimizing the refresh intervals of these views to maintain high "freshness" while minimizing the overhead on the cluster resources. The adoption of a modern lakehouse architecture combining Apache Iceberg with a high-performance OLAP engine like StarRocks is a recommended strategy for organizations dealing with high-volume, real-time reporting. This approach effectively decouples storage and compute while providing the low-latency response times necessary for interactive data analysis.

toss

Toss Next ML Challenge (opens in new tab)

Toss recently hosted the "Toss Next ML Challenge," a large-scale competition focused on predicting advertisement Click-Through Rates (CTR) using real-world, anonymized data from the Toss app. By tasking over 2,600 participants with developing high-performance models under real-time serving constraints, the event successfully identified innovative technical approaches to feature engineering and model ensembling. ### Designing a Real-World CTR Prediction Task * The competition required participants to predict the probability of a user clicking a display ad based on a dataset of 10.7 million training samples. * Data included anonymized features such as age, gender, ad inventory IDs, and historical user behavior. * A primary technical requirement was "real-time navigability," meaning models had to be optimized for fast inference to function within a live service environment. ### Overcoming Anonymization with Sequence Engineering * To maintain data privacy while allowing external access, Toss provided anonymized features in a single flattened table, which limited the ability of participants to perform traditional data joins. * A complex, raw "Sequence" feature was intentionally left unprocessed to serve as a differentiator for high-performing teams. * Top-tier participants demonstrated extreme persistence by deriving up to 37 unique variables from this single sequence, including transition probabilities, unique token counts, and sequence lengths. ### Winning Strategies and Technical Trends * All of the top 30 teams utilized Boosting Tree-based models (such as XGBoost or LightGBM), while Deep Learning was used only by a subset of participants. * One standout solution utilized a massive ensemble of 260 different models, providing a fresh perspective on the limits of ensemble learning for predictive accuracy. * Performance was largely driven by the ability to extract meaningful signals from anonymized data through rigorous cross-validation and creative feature interactions. The results of the Toss Next ML Challenge suggest that even in the absence of domain-specific context due to anonymization, meticulous feature engineering and robust tree-based architectures remains the gold standard for tabular data. For ML engineers, the competition underscores that the key to production-ready models lies in balancing complex ensembling with the strict latency requirements of real-time serving.

discord

Bringing In-Game Commerce to Discord Communities (opens in new tab)

Discord is expanding its suite of developer tools by introducing direct commerce capabilities, allowing users to purchase and gift in-game items without leaving the platform. This evolution follows previous initiatives like the Quests ad format and the Social SDK, aiming to turn Discord from a communication hub into a comprehensive ecosystem for game discovery and monetization. By integrating transactions into official servers and friend interactions, the platform seeks to capitalize on high player engagement and community density. ### Evolution of Social and Engagement Tools * Launched the Quests ad format in 2024 to help developers reach highly engaged player audiences through targeted discovery. * Introduced the Discord Social SDK at GDC 2025, providing a social layer integration and account linking features. * Reported that Social SDK integrations have resulted in up to a 48% increase in player session lengths. ### In-App Commerce and Gifting * Enabled commerce features that allow players to browse, buy, and gift in-game items directly within chat windows and official game servers. * Integrated "Friends’ Wishlists" to facilitate social gifting and peer-to-peer item discovery. * Designed the experience to reduce friction by placing the storefront where players already spend their time communicating. ### Marvel Rivals Launch Partnership * Debuted the commerce experience through a partnership with *Marvel Rivals*, leveraging its existing community of over 4 million members. * Utilized the game's massive launch momentum—10 million players in the first 72 hours—to test the scalability of the new transaction system. * Implemented the feature directly into the official Marvel Rivals server, allowing for immediate community-based browsing and purchasing. For developers looking to maximize player lifetime value and engagement, these tools provide a streamlined path to conversion by leveraging existing social graphs. Integrating the Social SDK alongside these commerce features offers a dual benefit of longer playtimes and lower-friction monetization channels.

naver

Research on Protecting the Webtoon (opens in new tab)

Naver Webtoon is proactively developing technical solutions to safeguard its digital creation ecosystem against evolving threats like illegal distribution and unauthorized generative AI training. By integrating advanced AI-based watermarking and protective perturbation technologies, the platform successfully tracks content leaks and disrupts unauthorized model fine-tuning. These efforts ensure a sustainable environment where creators can maintain the integrity and economic value of their intellectual property. ## Challenges in the Digital Creation Ecosystem - **Illegal Content Leakage**: Unauthorized reproduction and distribution of digital content infringe on creator earnings and damage the platform's business model. - **Unauthorized Generative AI Training**: The rise of fine-tuning techniques (e.g., LoRA, Dreambooth) allows for the unauthorized mimicry of an artist's unique style, distorting the value of original works. - **Harmful UGC Uploads**: The presence of violent or suggestive user-generated content increases operational costs and degrades the service experience for readers. ## AI-Based Watermarking for Post-Tracking - To facilitate tracking in DRM-free environments, Naver Webtoon developed an AI-based watermarking system that embeds invisible signals into the pixels of digital images. - The system is designed around three conflicting requirements: **Invisibility** (signal remains hidden), **Robustness** (signal survives attacks like cropping or compression), and **Capacity** (sufficient data for tracking). - The technical pipeline involves three neural modules: an **Embedder** to insert the signal, a differentiable **Attack Layer** to simulate real-world distortions, and an **Extractor** to recover the signal. - Performance metrics show a high Peak Signal-to-Noise Ratio (PSNR) of over 46 dB, and the system maintains a signal error rate of less than 1% even when subjected to intense signal processing or geometric editing. ## IMPASTO: Disrupting Unauthorized AI Training - This technology utilizes **protective perturbation**, which adds microscopic changes to images that are invisible to humans but confuse generative AI models during the training phase. - It targets the way diffusion models (like Stable Diffusion) learn by either manipulating latent representations or disrupting the denoising process, preventing the AI from accurately mimicking an artist's style. - The research prioritizes overcoming the visual artifacts and slow processing speeds found in existing academic tools like Glaze and PhotoGuard. - By implementing these perturbations, any attempts to fine-tune a model on protected work will result in distorted or unintended outputs, effectively shielding the artist's original style. ## Integrated Protection Frameworks - **TOONRADAR**: A comprehensive system deployed since 2017 that uses watermarking for both proactive blocking and retrospective tracking of illegal distributors. - **XPIDER**: An automated detection tool tailored specifically for the comic domain to identify and block harmful UGC, reducing manual inspection overhead. - These solutions are being expanded not just for copyright protection, but to establish long-term trust and reliability in the era of AI-generated content. The deployment of these AI-driven defense mechanisms is essential for maintaining a fair creative economy. By balancing visual quality with robust protection, platforms can empower creators to share their work globally without the constant fear of digital theft or stylistic mimicry.

toss

Toss Income Tax Refund Service: (opens in new tab)

Toss Income’s QA team transitioned from traditional manual testing and rigid class-based Page Object Models (POM) to a stateless Functional POM to keep pace with rapid deployment cycles. This shift allowed them to manage complex tax refund logic and frequent UI changes with high reliability and minimal maintenance overhead. By treating automation as a modular assembly of functions, they successfully reduced verification times from four hours to twenty minutes while significantly increasing test coverage. ### Transitioning to Functional POM * Replaced stateful classes and complex inheritance with stateless functions that receive a `page` object as input and return the updated `page` as output. * Adopted a clear naming convention (e.g., `gotoLoginPage`, `enterPhonePin`, `verifyRefundAmount`) to ensure that test cases read like human-readable scenarios. * Centralized UI selectors and interaction logic within these functions, allowing developers to update a single point of truth when UI text or button labels change. ### Modularizing the User Journey * Segmented the complex tax refund process into four distinct modules: Login/Terms, Deduction Checks, Refund/Payment Info, and Reporting. * Developed independent, reusable functions for specific data inputs—such as medical or credit card deductions—which can be assembled like "Lego blocks" to create new test scenarios rapidly. * Decoupled business logic from UI interactions, enabling the team to create diverse test cases by simply varying parameters like amounts or dates. ### Robust Interaction and Page Management * Implemented a 4-step "Robust Click Strategy" to eliminate flakiness caused by React rendering timings, sequentially trying an Enter key press, a standard click, a forced click, and finally a direct JavaScript execution. * Created a `waitForNetworkIdleSafely` utility that prevents test failures during polling or background network activity by prioritizing UI anchors over strict network idleness. * Standardized page transition handling with a `getLatestNonScrapePage` utility, ensuring the `currentPage` object always points to the most recent active tab or redirect window. ### Integration and Performance Outcomes * Achieved a 600% increase in test coverage, expanding from 5 core scenarios to 35 comprehensive automated flows. * Reduced the time required to respond to UI changes by 98%, as modifications are now localized to a single POM function rather than dozens of test files. * Established a 24/7 automated validation system that provides immediate feedback on functional correctness, data integrity (tax amount accuracy), and performance metrics via dedicated communication channels. For engineering teams operating in high-velocity environments, adopting a stateless, functional approach to test automation is a highly effective way to reduce technical debt. By focusing on modularity and implementing fallback strategies for UI interactions, teams can transform QA from a final bottleneck into a continuous, data-driven validation layer that supports rapid experimentation.

toss

From Legacy Payment Ledger to Scalable System (opens in new tab)

Toss Payments successfully modernized a 20-year-old legacy payment ledger by transitioning to a decoupled, MySQL-based architecture designed for high scalability and consistency. By implementing strategies like INSERT-only immutability and event-driven domain isolation, they overcame structural limitations such as the inability to handle split payments. Ultimately, the project demonstrates that robust system design must be paired with resilient operational recovery mechanisms to manage the complexities of large-scale financial migrations. ### Legacy Ledger Challenges * **Inconsistent Schemas:** Different payment methods used entirely different table structures; for instance, a table named `REFUND` unexpectedly contained only account transfer data rather than all refund types. * **Domain Coupling:** Multiple domains (settlement, accounting, and payments) shared the same tables and columns, meaning a single schema change required impact analysis across several teams. * **Structural Limits:** A rigid 1:1 relationship between a payment and its method prevented the implementation of modern features like split payments or "Dutch pay" models. ### New Ledger Architecture * **Data Immutability:** The system shifted from updating existing rows to an **INSERT-only** principle, ensuring a reliable audit trail and preventing database deadlocks. * **Event-Driven Decoupling:** Instead of direct database access, the system uses Kafka to publish payment events, allowing independent domains to consume data without tight coupling. * **Payment-Approval Separation:** By separating the "Payment" (the transaction intent) from the "Approval" (the specific financial method), the system now supports multiple payment methods per transaction. ### Safe Migration and Data Integrity * **Asynchronous Mirroring:** To maintain zero downtime, data was initially written to the legacy system and then asynchronously loaded into the new MySQL ledger. * **Resource Tuning:** Developers used dedicated migration servers within the same AWS Availability Zone to minimize latency and implemented **Bulk Inserts** to handle hundreds of millions of rows efficiently. * **Verification Batches:** A separate batch process ran every five minutes against a Read-Only (RO) database to identify and correct any data gaps caused by asynchronous processing failures. ### Operational Resilience and Incident Response * **Query Optimization:** During a load spike, the MySQL optimizer chose "Full Scans" over indexes; the team resolved this by implementing SQL hints and utilizing a 5-version Docker image history for rapid rollbacks. * **Network Cancellation:** To handle timeouts between Toss and external card issuers, the system uses specific logic to automatically send cancellation requests and synchronize states. * **Timeout Standardization:** Discrepancies between microservices were resolved by calculating the maximum processing time of approval servers and aligning all upstream timeout settings to prevent merchant response mismatches. * **Reliable Event Delivery:** While using the **Outbox pattern** for events, the team added log-based recovery (Elasticsearch and local disk) and idempotency keys in event headers to handle both missing and duplicate messages. For organizations tackling significant technical debt, this transition highlights that initial design is only half the battle. True system reliability comes from building "self-healing" structures—such as automated correction batches and standardized timeout chains—that can survive the unpredictable nature of live production environments.

toss

In Search of the Toss Brand Symbol (opens in new tab)

Toss, a leading Korean fintech platform, embarked on a UX research journey to define its visual identity as it expanded from digital services into offline environments like Toss Pay payment stations. The study revealed that while users strongly associate the brand with seamless "usability," they lacked a single, clear mental image of a visual symbol. By analyzing user perceptions of fonts, colors, and shapes, Toss identified a specific visual formula—combining the app icon shape with a white, blue, and black palette—to ensure the brand remains instantly recognizable in the physical world. ## The Challenge of Offline Brand Recognition * The project began with the need to design "danglers" (small signage at payment counters) to signal that Toss Pay is accepted at offline merchants. * While Toss had successfully used various logo iterations online, the team realized that "Toss-ness" learned within the app might not automatically translate to unfamiliar offline environments. * Initial internal debates focused on superficial visual tweaks, such as background colors or language choices, rather than understanding the core assets that trigger brand recognition. ## Identifying Usability as the Core Brand Image * In-depth interviews were conducted with participants selected for their ability to articulate abstract brand impressions. * Research showed that users primarily associate Toss with keywords like "clean," "practical," and "convenient," rather than specific aesthetic elements. * One participant described Toss as a "program made by a genius engineer in Excel," highlighting that the brand’s value was rooted in its utility rather than a distinct visual symbol. * This presented a challenge: since the "app experience" cannot be felt through a static offline sign, the team had to find a visual surrogate for that functional reliability. ## Deconstructing the Toss Symbol: Font, Color, and Shape * **Font:** Testing revealed that the most recognizable font was the black English "toss" wordmark, primarily because users see it most often in external media and news rather than inside the app. * **Color:** Surprisingly, users did not associate Toss with a single shade of blue. Instead, they recognized the specific combination of a "blue logo on a white background." * **Logo:** When asked to draw the logo from memory, users consistently included a square border. This indicated that users perceive the brand’s "face" specifically as the smartphone app icon (the blue logo inside a rounded square) rather than the standalone logo mark. ## Implementing the "Toss Formula" in Design * The research led to a refined brand identity formula: **White background + Black bold English font + Blue app-icon-shaped logo.** * In the "10 to 100" 10th-anniversary campaign, the company shifted away from all-blue backgrounds in favor of this white-based combination to maximize recognition. * Toss Pay payment screens were updated to remove blue backgrounds, adopting the white-and-black layout to align with how users intuitively identify the service. For UX researchers and designers, this case demonstrates that brand identity is often a composite of environmental cues rather than a single graphic. When moving a digital-first brand into the physical world, it is essential to look beyond the logo and identify the specific "visual formula" that triggers the user's memory of the product experience.

kakao

YEYE is Watching – (opens in new tab)

Kakao developed YEYE, a dedicated Attack Surface Management (ASM) system, to proactively identify and manage the organization's vast digital footprint, including IPs, domains, and open ports. By integrating automated scanning with a human-led Daily Security Review (DSR) process, the platform transforms raw asset data into actionable security intelligence. This holistic approach ensures that potential entry points are identified and secured before they can be exploited by external threats. ## The YEYE Asset Management Framework * Defines attack surfaces broadly to include every external-facing digital asset, such as subdomains, API endpoints, and mobile APKs. * Categorizes assets using a standardized taxonomy based on scope (In/Out/Undefined), type (Domain/IP/Service), and identification status (Known/Unknown/3rd Party). * Implements a labeling system that converts diverse data formats from multiple sources into a simplified, unified structure for better visibility. * Establishes multi-dimensional relationships between assets, CVEs, certificates, and departments, allowing teams to instantly identify which business unit is responsible for a newly discovered vulnerability. ## Daily Security Review (DSR) * Operates on the principle that "security is a process, not a product," bridging the gap between automated detection and manual remediation. * Utilizes a rotating group system where security engineers review external feeds, public vulnerability news, and YEYE alerts every morning. * Focuses on detecting "shadow IT" or assets deployed without formal security reviews to ensure all external touchpoints are accounted for. ## Scalable and Efficient Scanning Architecture * Resolved internal network bandwidth bottlenecks by adopting a hybrid infrastructure that leverages public cloud resources for high-concurrency scanning tasks. * Developed a custom distributed scanning structure using schedulers and queues to manage multiple independent workers, overcoming the limitations of single-process open-source scanners. * Optimized infrastructure costs by identifying the "sweet spot" in server specifications, favoring the horizontal expansion of medium-spec servers over expensive, high-performance hardware. * Mitigates service impact and false alarms by using fixed IPs and custom User-Agent (UA) strings, allowing service owners to distinguish YEYE’s security probes from actual malicious traffic. To effectively manage a growing attack surface, organizations should combine automated asset discovery with a structured manual review process. Prioritizing data standardization and relationship mapping between assets and vulnerabilities is essential for rapid incident response and long-term infrastructure hardening.

toss

Toss Payments' Open API (opens in new tab)

Toss Payments treats its Open API not just as a communication tool, but as a long-term infrastructure designed to support over 200,000 merchants for decades. By focusing on resource-oriented design and developer experience, the platform ensures that its interfaces remain intuitive, consistent, and easy to maintain. This strategic approach prioritizes structural stability and clear communication over mere functionality, fostering a reliable ecosystem for both developers and businesses. ### Resource-Oriented Interface Design * The API follows a predictable path structure (e.g., `/v1/payments/{id}`) where the root indicates the version, followed by the domain and a unique identifier. * Request and response bodies utilize structured JSON with nested objects (like `card` or `cashReceipt`) to modularize data and reduce redundancy. * Consistency is maintained by reusing the same domain objects across different APIs, such as payment approval, inquiry, and cancellation, which minimizes the learning curve for external developers. * Data representation shifts from cryptic legacy codes (e.g., SC0010) to human-readable strings, supporting localization into multiple languages via the `Accept-Language` HTTP header. * Standardized error handling utilizes HTTP status codes paired with a JSON error object containing specific `code` and `message` fields, allowing developers to either display messages directly or implement custom logic. ### Asynchronous Communication via Webhooks * Webhooks are provided alongside standard APIs to handle asynchronous events where immediate responses are not possible, such as status changes in complex payment flows. * Event types are clearly categorized (e.g., `PAYMENT_STATUS_CHANGED`), and the payloads mirror the exact resource structures used in the REST APIs to simplify parsing. * The system ensures reliability by implementing an Exponential Backoff strategy for retries, preventing network congestion during recipient service outages. * A dedicated developer center allows merchants to register custom endpoints, monitor transmission history, and perform manual retries if automated attempts fail. ### External Ecosystem and Documentation Automation * Developer Experience (DX) is treated as the core metric for API quality, focusing on how quickly and efficiently a developer can integrate and operate the service. * To prevent the common issue of outdated manuals, Toss Payments uses a documentation automation system based on the OpenAPI Specification (OAS). * By utilizing libraries like `springdoc`, the platform automatically syncs the technical documentation with the actual server code, ensuring that parameters, schemas, and endpoints are always up-to-date and trustworthy. To ensure the longevity of a high-traffic Open API, organizations should prioritize automated documentation and resource-based consistency. Moving away from cryptic codes toward human-readable, localized data and providing robust asynchronous notification tools like webhooks are essential steps for building a developer-friendly infrastructure.

naver

Naver TV (opens in new tab)

Naver’s Integrated Search team is transitioning from manual fault response to an automated system using LLM Agents to manage the increasing complexity of search infrastructure. By integrating Large Language Models into the DevOps pipeline, the system evolves through accumulated experience, moving beyond simple alert monitoring to intelligent diagnostic analysis and action recommendation. ### Limitations of Traditional Fault Response * **Complex Search Flows:** Naver’s search architecture involves multiple interdependent layers, which makes manual root cause analysis slow and prone to human error. * **Fragmented Context:** Existing monitoring requires developers to manually synthesize logs and metrics from disparate telemetry sources, leading to high cognitive load during outages. * **Delayed Intervention:** Human-led responses often suffer from a "detection-to-action" lag, especially during high-traffic periods or subtle service regressions. ### Architecture of DevOps Agent v1 * **Initial Design:** Focused on automating basic data gathering and providing preliminary textual reports to engineers. * **Infrastructure Integration:** Built using a specialized software stack designed to bridge frontend (FE) and backend (BE) telemetry within the search infrastructure. * **Standardized Logic:** The v1 agent operated on a fixed set of instructions to perform predefined diagnostic tasks when triggered by specific system alarms. ### Evolution to DevOps Agent v2 * **Overcoming V1 Limitations:** The first iteration struggled with maintaining deep context and providing diverse actionable insights, necessitating a more robust agentic structure. * **Enhanced Memory and Learning:** V2 incorporates a more sophisticated architecture that allows the agent to reference historical failure data and learn from past incident resolutions. * **Advanced Tool Interaction:** The system was upgraded to handle more complex tool-calling capabilities, allowing the agent to interact more deeply with internal infrastructure APIs. ### System Operations and Evaluation * **Trigger Queue Management:** Implements a queuing system to efficiently process and prioritize multiple concurrent system alerts without overwhelming the diagnostic pipeline. * **Anomaly Detection:** Utilizes advanced detection methods to distinguish between routine traffic fluctuations and genuine service anomalies that require LLM intervention. * **Rigorous Evaluation:** The agent’s performance is measured through a dedicated evaluation framework that assesses the accuracy of its diagnoses against known ground-truth incidents. ### Scaling and Future Challenges * **Context Expansion:** Efforts are focused on integrating a wider range of metadata and environmental context to provide a holistic view of system health. * **Action Recommendation:** The system is moving toward suggesting specific recovery actions, such as rollbacks or traffic rerouting, rather than just identifying the problem. * **Sustainability:** Ensuring the DevOps Agent remains maintainable and cost-effective as the underlying search infrastructure and LLM models continue to evolve. Organizations managing high-scale search traffic should consider LLM-based agents as integrated infrastructure components rather than standalone tools. Moving from reactive monitoring to a proactive, experience-based agent system is essential for reducing the mean time to recovery (MTTR) in complex distributed environments.

line

Code Quality Improvement Techniques Part 25 (opens in new tab)

Effective code review communication relies on a "conclusion-first" approach to minimize cognitive load and ensure clarity for the developer. By stating proposed changes or specific requests before providing the underlying rationale, reviewers help authors understand the primary goal of the feedback immediately. This practice improves development productivity by making review comments easier to parse and act upon without repeated reading. ### Optimizing Review Comment Structure * Place the core suggestion or requested code change at the very beginning of the comment to establish immediate context. * Follow the initial request with a structured explanation, utilizing headers or numbered lists to organize multiple supporting arguments. * Clearly distinguish between the "what" (the requested change) and the "why" (the technical justification) to prevent the intended action from being buried in a long technical discussion. * Use visual formatting to help the developer quickly validate the logic behind the suggestion once they understand the proposed change. ### Immutability and Data Class Design * Prefer the use of `val` over `var` in Kotlin `data class` structures to ensure object immutability. * Using immutable properties prevents bugs associated with unintended side effects that occur when mutable objects are shared across different parts of an application. * Instead of reassigning values to a mutable property, utilize the `copy()` function to create a new instance with updated state, which results in more robust and predictable code. * Avoid mixing `var` properties with `data class` features, as this can lead to confusion regarding whether to modify the existing instance or create a copy. ### Property Separation by Lifecycle * Analyze the update frequency of different properties within a class to identify those with different lifecycles. * Decouple frequently updated status fields (such as `onlineStatus` or `statusMessage`) from more stable attributes (such as `userId` or `accountName`) by moving them into separate classes. * Grouping properties by their lifecycle prevents unnecessary updates to stable data and makes the data model easier to maintain as the application scales. To maintain high development velocity, reviewers should prioritize brevity and structure in their feedback. Leading with a clear recommendation and supporting it with organized technical reasoning ensures that code reviews remain a tool for progress rather than a source of confusion.

toss

The era when everyone does research (opens in new tab)

In an era where AI moderators and non-researchers handle the bulk of data collection, the role of the UX researcher has shifted from a technical specialist to a strategic guide. The core value of the researcher now lies in "UX Leadership"—the ability to frame problems, align team perspectives, and define the fundamental identity of a product. By bridging the gap between business goals and user needs, researchers ensure that products solve real problems rather than just chasing metrics or technical feasibility. ### Setting the Framework in the Idea Phase When starting a new project, a researcher’s primary task is to establish the "boundaries of the puzzle" by shifting the team’s focus from business impact to user value. * **Case - AI Signal:** For a service that interprets stock market events using AI, the team initially focused on business metrics like retention and news consumption. * **Avoiding "Metric Traps":** A researcher intervenes to prevent fatigue-inducing UX (e.g., excessive notifications to boost CTR) by defining the "North Star" as the specific problem the user is trying to solve. * **The Checklist:** Once the user problem and value are defined, they serve as a persistent checklist for every design iteration and action item. ### Aligning Team Direction for Product Improvements When a product already exists but needs improvement, different team members often have scattered, subjective opinions on what to fix. The researcher structures these thoughts into a cohesive direction. * **Case - Stock Market Calendar:** While the team suggested UI changes like "it doesn't look like a calendar," the researcher refocused the effort on the user's ultimate goal: making better investment decisions. * **Defining Success Criteria:** The team agreed on a "Good Usage" standard based on three stages: Awareness (recognizing issues) → Understanding (why it matters) → Preparation (adjusting investment plans). * **Identifying Obstacles:** By identifying specific friction points—such as the lack of information hierarchy or the difficulty of interpreting complex indicators—the researcher moves the project from "simple UI cleanup" to "essential tool development." ### Redefining Product Identity During Stagnation When a product's growth stalls, the issue often isn't a specific UI bug but a fundamental mismatch between the product's identity and its environment. * **Case - Toss Securities PC:** Despite being functional, the PC version struggled because it initially tried to copy the "mobile simplicity" of the app. * **Contextual Analysis:** Research revealed that while mobile users value speed and portability, PC users require an environment for deep analysis, multi-window comparisons, and deliberate decision-making. * **Consensus through Synthesis:** The researcher integrates data, user interviews, and market trends into workshops to help the team decide where the product should "live" in the market. This process creates team-wide alignment on a new strategic direction rather than just fixing features. The modern UX researcher must move beyond "crafting the tool" (interviewing and data gathering) and toward "UX Leadership." True expertise involves maintaining a broad view of the industry and product ecosystem, structuring team discussions to reach a consensus, and ensuring that every product decision is rooted in a clear understanding of the user's context and goals.