당근

10 posts

medium.com/daangn

Filter by tag

daangn

Redux for the Server: Developing a (opens in new tab)

Traditional CRUD-based architectures often struggle to meet complex backend requirements such as audit logging, version history, and state rollbacks. To address these challenges, Daangn’s Frontend Core team developed **Ventyd**, an open-source TypeScript library that implements event sourcing on the server using patterns familiar to Redux users. By shifting the focus from storing "current state" to storing a "history of events," developers can build more traceable and resilient systems. ### Limitations of Traditional CRUD * Standard CRUD (Create, Read, Update, Delete) patterns only record the final state of data, losing the context of "why" or "how" a change occurred. * Implementing complex features like approval workflows or history tracking usually requires manual table management, such as adding `status` columns or creating separate history tables. * Rollback logic in CRUD is often fragile and requires complex custom code to revert data to a previous specific state. ### The Event Sourcing Philosophy * Instead of overwriting rows in a database, event sourcing records every discrete action (e.g., "Post Created," "Post Approved," "Profile Updated") as an immutable sequence. * The system provides a built-in audit log, ensuring every change is attributed to a specific user, time, and reason. * State can be reconstructed for any point in time by "replaying" events, enabling seamless "time travel" and easier debugging. * It allows for deeper business insights by providing a full narrative of data changes rather than just a snapshot. ### Redux as a Server-Side Blueprint * The library leverages the familiarity of Redux to bridge the gap between frontend and backend engineering. * Just as Redux uses **Actions** and **Reducers** to manage state in the browser, event sourcing uses **Events** and **Reducers** to manage state in the database. * The primary difference is persistence: Redux manages state in memory, while Ventyd persists the event stream to a database for permanent storage. ### Technical Implementation with Ventyd * **Type-Safe Schemas**: Developers use `defineSchema` to define the shape of both the events and the resulting state, ensuring strict TypeScript validation. * **Validation Library Support**: Ventyd is flexible, supporting various validation libraries including Valibot, Zod, TypeBox, and ArkType. * **Reducer Logic**: The `defineReducer` function centralizes how the state evolves based on incoming events, making state transitions predictable and easy to test. * **Database Agnostic**: The library is designed to be flexible regarding the underlying storage, allowing it to integrate with different database systems. Ventyd offers a robust path for teams needing more than what basic CRUD can provide, particularly for internal tools requiring high accountability. By adopting this event-driven approach, developers can simplify the implementation of complex business logic while maintaining a clear, type-safe history of every action within their system.

daangn

The Journey of Karrot Pay (opens in new tab)

Daangn Pay’s backend evolution demonstrates how software architecture must shift from a focus on development speed to a focus on long-term sustainability as a service grows. Over four years, the platform transitioned from a simple layered structure to a complex monorepo powered by Hexagonal and Clean Architecture principles to manage increasing domain complexity. This journey highlights that technical debt is often the price of early success, but structural refactoring is essential to support organizational scaling and maintain code quality. ## Early Speed with Layered Architecture * The initial system was built using a standard Controller-Service-Repository pattern to meet the urgent deadline for obtaining an electronic financial business license. * This simple structure allowed for rapid development and the successful launch of core remittance and wallet features. * As the service expanded to include promotions, billing, and points, the "Service" layer became overloaded with cross-cutting concerns like validation and permissions. * The lack of strict boundaries led to circular dependencies and "spaghetti code," making the system fragile and difficult to test or refactor. ## Decoupling Logic via Hexagonal Architecture * To address the tight coupling between business logic and infrastructure, the team adopted a Hexagonal (Ports and Adapters) approach. * The system was divided into three distinct modules: `domain` (pure POJO rules), `usecase` (orchestration of scenarios), and `adapter` (external implementations like DBs and APIs). * This separation ensured that core business logic remained independent of the Spring Framework or specific database technologies. * While this solved dependency issues and improved reusability across REST APIs and batch jobs, it introduced significant boilerplate code and the complexity of mapping between different data models (e.g., domain entities vs. persistence entities). ## Scaling to a Monorepo and Clean Architecture * As Daangn Pay grew from a single project into dozens of services handled by multiple teams, a Monorepo structure was implemented using Gradle multi-projects. * The architecture evolved to separate "Domain" modules (pure business logic) from "Service" modules (the actual runnable applications like API servers or workers). * An "Internal-First" policy was adopted, where modules are private by default and can only be accessed through explicitly defined public APIs to prevent accidental cross-domain contamination. * This setup currently manages over 30 services, providing a balance between code sharing and strict boundary enforcement between domains like Money, Billing, and Points. The evolution of Daangn Pay’s architecture serves as a practical reminder that there is no "perfect" architecture from the start; rather, the best design is one that adapts to the current size of the organization and the complexity of the business. Engineers should prioritize flexibility and structural constraints that guide developers toward correct patterns, ensuring the codebase remains manageable even as the team and service scale.

daangn

Daangn's User Behavior (opens in new tab)

Daangn transitioned its user behavior log management from a manual, code-based Git workflow to a centralized UI platform called Event Center to improve data consistency and operational efficiency. By automating schema creation and enforcing standardized naming conventions, the platform reduced the technical barriers for developers and analysts while ensuring high data quality for downstream analysis. This transition has streamlined the entire data lifecycle, from collection in the mobile app to structured storage in BigQuery. ### Challenges of Code-Based Schema Management Prior to Event Center, Daangn managed its event schemas—definitions that describe the ownership, domain, and custom parameters of a log—using Git and manual JSON files. This approach created several bottlenecks for the engineering team: * **High Entry Barrier**: Users were required to write complex Spark `StructType` JSON files, which involved managing nested structures and specific metadata fields like `nullable` and `type`. * **Inconsistent Naming**: Without a central enforcement mechanism, event names followed different patterns (e.g., `item_click` vs. `click_item`), making it difficult for analysts to discover relevant data. * **Operational Friction**: Every schema change required a Pull Request (PR), manual review by the data team, and a series of CI checks, leading to slow iteration cycles and frequent communication overhead. ### The User Behavior Log Pipeline To support data-driven decision-making, Daangn employs a robust pipeline that processes millions of events daily through several critical stages: * **Collection and Validation**: Events are sent from the mobile SDK to an event server, which performs initial validation before passing data to GCP Pub/Sub. * **Streaming Processing**: GCP Dataflow handles real-time deduplication, field validation, and data transformation (flattening) to prepare logs for storage. * **Storage and Accessibility**: Data is stored in Google Cloud Storage and BigQuery, where custom parameters defined in the schema are automatically expanded into searchable columns, removing the need for complex JSON parsing in SQL. ### Standardizing Discovery via Event Center The Event Center platform was designed to transform log management into a user-friendly, UI-driven experience while maintaining technical rigor. * **Standardized Naming Conventions**: The platform enforces a strict "Action-Object-Service" naming rule, ensuring that all events are categorized logically across the entire organization. * **Recursive Schema Builder**: To handle the complexity of nested JSON data, the team built a UI component that uses a recursive tree structure, allowing users to define deep data hierarchies without writing code. * **Centralized Dictionary**: The platform serves as a "single source of truth" where any employee can search for events, view their descriptions, and identify the team responsible for specific data points. ### Technical Implementation and Integration The system architecture was built to bridge the gap between a modern web UI and the existing Git-based infrastructure. * **Tech Stack**: The backend is powered by Go (Gin framework) and PostgreSQL (GORM), while the frontend utilizes React, TypeScript, and TanStack Query for state management. * **Automated Git Sync**: When a user saves a schema in Event Center, the system automatically triggers a GitHub Action that generates the necessary JSON files and pushes them to the repository, maintaining the codebase as the ultimate source of truth while abstracting the complexity. * **Real-time Validation**: The UI provides immediate feedback on data types and naming errors, preventing invalid schemas from reaching the production pipeline. Implementing a dedicated log management platform like Event Center is highly recommended for organizations scaling their data operations. Moving away from manual file management to a UI-based system not only reduces the risk of human error but also democratizes data access by allowing non-engineers to define and discover the logs they need for analysis.

daangn

Why did Karrot make (opens in new tab)

Daangn transitioned from manually calculating user activation metrics to a centralized "Activation Layer" built on DBT to solve inconsistencies and high operational overhead. By standardizing the definitions of user states and transitions, the team provides a reliable foundation for analyzing why active user counts fluctuate rather than just reporting the final numbers. This common data layer improves data reliability and cost-efficiency while allowing various teams to reuse the same logic for different core user behaviors. ### The Role of User Activation Analysis * While Active User counts show "what" happened, User Activation explains "why" by breaking users down into specific categories. * The system tracks **Activation States**, classifying users as New, Retained, Reactivated, or Inactive at any given time. * It monitors **State Transitions** to identify how users move between categories, such as "New to Retained" or "Reactivated to Inactive." * The layer provides granular behavioral metadata, including continuous activity streaks, the interval between visits, and the duration of churned periods. ### Ensuring Reliability via Fact Models * Raw event logs are often tied to specific UI elements and contain "noise" that makes them unreliable for direct activation analysis. * To ensure consistency, the Activation Layer uses **Fact Models** as its primary input, which are refined datasets where business logic and core behaviors are already defined. * A strict naming convention (`fact_name_activation_time_grain`) is enforced so that users can immediately identify which specific behavior is being analyzed. * This structure ensures that "Active" status is interpreted identically across the entire organization, regardless of which team is performing the analysis. ### Incremental Processing for Cost Efficiency * Calculating the entire history of user activity every day is computationally expensive and leads to high cloud infrastructure costs. * The architecture utilizes a **FirstLast model** to store only the essential metadata for each user: the date of their very first activity and their most recent activity. * By joining daily activity logs with this lightweight FirstLast table, the system can calculate new states and transitions incrementally. * This approach maintains data idempotency and ensures high performance even as the volume of user interaction data grows. ### Scaling with DBT Macros * To support various metrics—such as app visits, item sales, or community posts—the team encapsulated the complex transition logic into **DBT Macros**. * This abstraction allows data engineers to generate a new activation model by simply specifying the source Fact model and the desired time grain (daily, weekly, or monthly). * Centralizing the logic in macros ensures that any bug fixes or improvements to the activation calculation are automatically reflected across all related data models. * The standardized output format allows for the creation of universal dashboards and analysis templates that work for any tracked behavior. Centralizing User Activation logic into a common data layer allows organizations to move beyond surface-level vanity metrics and gain deep, actionable behavioral insights. By combining DBT’s macro capabilities with incremental modeling, teams can maintain high data quality and operational efficiency even as the variety of tracked user behaviors expands.

daangn

The Journey to Karrot Pay’ (opens in new tab)

Daangn Pay has evolved its Fraud Detection System (FDS) from a traditional rule-based architecture to a sophisticated AI-powered framework to better protect user assets and combat evolving financial scams. By implementing a modular rule engine and integrating Large Language Models (LLMs), the platform has significantly reduced manual review times and improved its response to emerging fraud trends. This transition allows for consistent, context-aware risk assessment while maintaining compliance with strict financial regulations. ### Modular Rule Engine Architecture * The system is built on a "Lego-like" structure consisting of three components: Conditions (basic units like account age or transfer frequency), Rules (logical combinations of conditions), and Policies (groups of rules with specific sanction levels). * This modularity allows non-developers to adjust thresholds—such as changing a "30-day membership" requirement to "70 days"—in real-time to respond to sudden shifts in fraud patterns. * Data flows through two distinct paths: a Synchronous API for immediate blocking decisions (e.g., during a live transfer) and an Asynchronous Stream for high-volume, real-time monitoring where slight latency is acceptable. ### Risk Evaluation and Post-Processing * Events undergo a structured pipeline beginning with ingestion, followed by multi-layered evaluation through the rule engine to determine the final risk score. * The post-processing phase incorporates LLM analysis to evaluate behavioral context, which is then used to trigger alerts for human operators or apply automated user sanctions. * Implementation of this engine led to a measurable decrease in information requests from financial and investigative authorities, indicating a higher rate of internal prevention. ### LLM Integration for Contextual Analysis * To solve the inconsistency and time lag of manual reviews—which previously took between 5 and 20 minutes per case—Daangn Pay integrated Claude 3.5 Sonnet via AWS Bedrock. * The system overcomes strict financial "network isolation" regulations by utilizing an "Innovative Financial Service" designation, allowing the use of cloud-based generative AI within a regulated environment. * The technical implementation uses a specialized data collector that pulls fraud history from BigQuery into a Redis cache to build structured, multi-step prompts for the LLM. * The AI provides evaluations in a structured JSON format, assessing whether a transaction is fraudulent based on specific criteria and providing the reasoning behind the decision. The combination of a flexible, rule-based foundation and context-aware LLM analysis demonstrates how fintech companies can scale security operations. For organizations facing high-volume fraud, the modular approach ensures immediate technical agility, while AI integration provides the nuanced judgment necessary to handle complex social engineering tactics.

daangn

Karrot's Gen (opens in new tab)

Daangn has scaled its Generative AI capabilities from a few initial experiments to hundreds of diverse use cases by building a robust, centralized internal infrastructure. By abstracting model complexity and empowering non-technical stakeholders, the company has optimized API management, cost tracking, and rapid product iteration. The resulting platform ecosystem allows the organization to focus on delivering product value while minimizing the operational overhead of managing fragmented AI services. ### Centralized API Management via LLM Router Initially, Daangn faced challenges with fragmented API keys, inconsistent rate limits across teams, and the inability to track total costs across multiple providers like OpenAI, Anthropic, and Google. The LLM Router was developed as an "AI Gateway" to consolidate these resources into a single point of access. * **Unified Authentication:** Service teams no longer manage individual API keys; they use a unique Service ID to access models through the router. * **Standardized Interface:** The router uses the OpenAI SDK as a standard interface, allowing developers to switch between models (e.g., from Claude to GPT) by simply changing the model name in the code without rewriting implementation logic. * **Observability and Cost Control:** Every request is tracked by service ID, enabling the infrastructure team to monitor usage limits and integrate costs directly into the company’s internal billing platform. ### Empowering Non-Engineers with Prompt Studio To remove the bottleneck of needing an engineer for every prompt adjustment, Daangn built Prompt Studio, a web-based platform for prompt engineering and testing. This tool enables PMs and other non-developers to iterate on AI features independently. * **No-Code Experimentation:** Users can write prompts, select models (including internally served vLLM models), and compare outputs side-by-side in a browser-based UI. * **Batch Evaluation:** The platform includes an Evaluation feature that allows users to upload thousands of test cases to quantitatively measure how prompt changes impact output quality across different scenarios. * **Direct Deployment:** Once a prompt is finalized, it can be deployed via API with a single click. Engineers only need to integrate the Prompt Studio API once, after which non-engineers can update the prompt or model version without further code changes. ### Ensuring Service Reliability and Stability Because third-party AI APIs can be unstable or subject to regional outages, the platform incorporates several safety mechanisms to ensure that user-facing features remain functional even during provider downtime. * **Automated Retries:** The system automatically identifies retry-able errors and re-executes requests to mitigate temporary API failures. * **Region Fallback:** To bypass localized outages or rate limits, the platform can automatically route requests to different geographic regions or alternative providers to maintain service continuity. ### Recommendation For organizations scaling AI adoption, the Daangn model suggests that investing early in a centralized gateway and a no-code prompt management environment is essential. This approach not only secures API management and controls costs but also democratizes AI development, allowing product teams to experiment at a pace that is impossible when tied to traditional software release cycles.

daangn

Easily Operating Karrot (opens in new tab)

This blog post by the Daangn (Karrot) search platform team details their journey in optimizing Elasticsearch operations on Kubernetes (ECK). While their initial migration to ECK reduced deployment times, the team faced critical latency spikes during rolling restarts due to "cold caches" and high traffic volumes. To achieve a "deploy anytime" environment, they developed a data node warm-up system to ensure nodes are performance-ready before they begin handling live search requests. ## Scaling Challenges and Operational Constraints - Over two years, Daangn's search infrastructure expanded from a single cluster to four specialized clusters, with peak traffic jumping from 1,000 to over 10,000 QPS. - The initial strategy of "avoiding peak hours" for deployments became a bottleneck, as the window for safe updates narrowed while total deployment time across all clusters exceeded six hours. - Manual monitoring became a necessity rather than an option, as engineers had to verify traffic conditions and latency graphs before and during every ArgoCD sync. ## The Hazards of Rolling Restarts in Elasticsearch - Standard Kubernetes rolling restarts are problematic for stateful systems because a "Ready" Pod does not equate to a "Performant" Pod; Elasticsearch relies heavily on memory-resident caches (page cache, query cache, field data cache). - A version update in the Elastic Operator once triggered an unintended rolling restart that caused a 60% error rate and 3-second latency spikes because new nodes had to fetch all data from disk. - When a node restarts, the cluster enters a "Yellow" state where remaining replicas must handle 100% of the traffic, creating a single point of failure and increasing the load on the surviving nodes. ## Strategy for Reliable Node Warm-up - The primary goal was to reach a state where p99 latency remains stable during restarts, regardless of whether the deployment occurs during peak traffic hours. - The solution involves a "Warm-up System" designed to pre-load frequently accessed data into the filesystem and Elasticsearch internal caches before the node is allowed to join the load balancer. - By executing representative search queries against a newly started node, the system ensures that the necessary segments are already in the page cache, preventing the disk I/O thrashing that typically follows a cold start. ## Implementation Goals - Automate the validation of node readiness beyond simple health checks to include performance readiness. - Eliminate the need for human "eyes-on-glass" monitoring during the 90-minute deployment cycles. - Maintain high availability and consistent user experience even when shards are being reallocated and replicas are temporarily unassigned. To maintain a truly resilient search platform on Kubernetes, it is critical to recognize that for stateful applications, "available" is not the same as "ready." Implementing a customized warm-up controller or logic is a recommended practice for any high-traffic Elasticsearch environment to decouple deployment schedules from traffic patterns.

daangn

won Park": Author. * (opens in new tab)

Daangn’s data governance team addressed the lack of transparency in their data pipelines by building a column-level lineage system using SQL parsing. By analyzing BigQuery query logs with specialized parsing tools, they successfully mapped intricate data dependencies that standard table-level tracking could not capture. This system now enables precise impact analysis and significantly improves data reliability and troubleshooting speed across the organization. **The Necessity of Column-Level Visibility** * Table-level lineage, while easily accessible via BigQuery’s `JOBS` view, fails to identify how specific fields—such as PII or calculated metrics—propagate through downstream systems. * Without granular lineage, the team faced "cascading failures" where a single pipeline error triggered a chain of broken tables that were difficult to trace manually. * Schema migrations, such as modifying a source MySQL column, were historically high-risk because the impact on derivative BigQuery tables and columns was unknown. **Evaluating Extraction Strategies** * BigQuery’s native `INFORMATION_SCHEMA` was found to be insufficient because it does not support column-level detail and often obscures original source tables when Views are involved. * Frameworks like OpenLineage were considered but rejected due to high operational costs; requiring every team to instrument their own Airflow jobs or notebooks was deemed impractical for a central governance team. * The team chose a centralized SQL parsing approach, leveraging the fact that nearly all data transformations within the company are executed as SQL queries within BigQuery. **Technical Implementation and Tech Stack** * **sqlglot:** This library serves as the core engine, parsing SQL strings into Abstract Syntax Trees (AST) to programmatically identify source and destination columns. * **Data Collection:** The system pulls raw query text from `INFORMATION_SCHEMA.JOBS` across all Google Cloud projects to ensure comprehensive coverage. * **Processing and Orchestration:** Spark is utilized to handle the parallel processing of massive query logs, while Airflow schedules regular updates to the lineage data. * **Storage:** The resulting mappings are stored in a centralized BigQuery table (`data_catalog.lineage`), making the dependency map easily accessible for impact analysis and data cataloging. By centralizing lineage extraction through SQL parsing rather than per-job instrumentation, organizations can achieve comprehensive visibility without placing an integration burden on individual developers. This approach is highly effective for BigQuery-centric environments where SQL is the primary language for data movement and transformation.

daangn

You don't need to fetch it (opens in new tab)

As Daangn’s data volume grew, their traditional full-dump approach using Spark for MongoDB began causing significant CPU spikes and failing to meet the two-hour data delivery Service Level Objectives (SLOs). To resolve this, the team implemented a Change Data Capture (CDC) pipeline using Flink CDC to synchronize data efficiently without the need for resource-intensive full table scans. This transition successfully stabilized database performance and ensured timely data availability in BigQuery by focusing on incremental change logs rather than repeated bulk extracts. ### Limitations of Traditional Dump Methods * The previous Spark Connector method required full table scans, creating a direct conflict between service stability and data freshness. * Attempts to lower DB load resulted in missing the 2-hour SLO, while meeting the SLO pushed CPU usage to dangerous levels. * Standard incremental loading was ruled out because it relied on `updated_at` fields, which were not consistently updated across all business logic or schemas. * The team targeted the top five largest and most frequently updated collections for the initial CDC transition to maximize performance gains. ### Advantages of Flink CDC * Flink CDC provides native support for MongoDB Change Streams, allowing the system to use resume tokens and Flink checkpoints for seamless recovery after failures. * It guarantees "Exactly-Once" processing by periodically saving the pipeline state to distributed storage, ensuring data integrity during restarts. * Unlike tools like Debezium that require separate systems for data processing, Flink handles the entire "Extract-Transform-Load" (ETL) lifecycle within a single job. * The architecture is horizontally scalable; increasing the number of TaskManagers allows the pipeline to handle surges in event volume with linear performance improvements. ### Pipeline Architecture and Implementation * The system utilizes the MongoDB Oplog to capture real-time write operations (inserts, updates, and deletes) which are then processed by Flink. * The backend pipeline operates on an hourly batch cycle to extract the latest change events, deduplicate them, and merge them into raw JSON tables in BigQuery. * A "Schema Evolution" step automatically detects and adds missing fields to BigQuery tables, bridging the gap between NoSQL flexibility and SQL structure. * While Flink captures data in real-time, the team opted for hourly materialization to maintain idempotency, simplify error recovery, and meet existing business requirements without unnecessary architectural complexity. For organizations managing large-scale MongoDB instances, moving from bulk extracts to a CDC-based model is a critical step in balancing database health with analytical needs. Implementing a unified framework like Flink CDC not only reduces the load on operational databases but also simplifies the management of complex data transformations and schema changes.

daangn

No Need to Fetch Everything Every Time (opens in new tab)

To optimize data synchronization and ensure production stability, Daangn’s data engineering team transitioned their MongoDB data pipeline from a resource-intensive full-dump method to a Change Data Capture (CDC) architecture. By leveraging Flink CDC, the team successfully reduced database CPU usage to under 60% while consistently meeting a two-hour data delivery Service Level Objective (SLO). This shift enables efficient, schema-agnostic data replication to BigQuery, facilitating high-scale analysis without compromising the performance of live services. ### Limitations of Traditional Dump Methods * The previous Spark Connector-based approach required full table scans, leading to a direct trade-off between hitting delivery deadlines and maintaining database health. * Increasing data volumes caused significant CPU spikes, threatening the stability of transaction processing in production environments. * Standard incremental loads were unreliable because many collections lacked consistent `updated_at` fields or required the tracking of hard deletes, which full dumps handle poorly at scale. ### Advantages of Flink CDC for MongoDB * Flink CDC provides native support for MongoDB Change Streams, allowing the system to read the Oplog directly and use resume tokens to restart from specific failure points. * The framework’s checkpointing mechanism ensures "Exactly-Once" processing by periodically saving the pipeline state to distributed storage like GCS or S3. * Unlike standalone tools like Debezium, Flink allows for an integrated "Extract-Transform-Load" (ETL) flow within a single job, reducing operational complexity and the need for intermediate message queues. * The architecture is horizontally scalable, meaning TaskManagers can be increased to handle sudden bursts in event volume without re-architecting the pipeline. ### Pipeline Architecture and Processing Logic * The core engine monitors MongoDB write operations (Insert, Update, Delete) in real-time via Change Streams and transmits them to BigQuery. * An hourly batch process is utilized rather than pure real-time streaming to prioritize operational stability, idempotency, and easier recovery from failures. * The downstream pipeline includes a Schema Evolution step that automatically detects and adds new fields to BigQuery tables, ensuring the NoSQL-to-SQL transition is seamless. * Data processing involves deduplicating recent change events and merging them into a raw JSON table before materializing them into a final structured table for end-users. For organizations managing large-scale MongoDB clusters, implementing Flink CDC serves as a powerful solution to balance analytical requirements with database performance. Prioritizing a robust, batch-integrated CDC flow allows teams to meet strict delivery targets and maintain data integrity without the infrastructure overhead of a fully real-time streaming system.