AWS Weekly Roundup: Anthropic & Meta partnership, AWS Lambda S3 Files, Amazon Bedrock AgentCore CLI, and more (April 27, 2026) Late March took me to Seattle for the Specialist Tech Conference, one of the most energizing gatherings of AWS specialists from around the world. It was…
지금 LY Corporation에서 일어나고 있는 변화 현재 LY Corporation에서는 AI를 활용한 개발과 업무 개선이 이전보다 훨씬 빠른 속도로 확산되고 있습니다. 생성형 AI를 활용해 코드를 생성하고 테스트를 효율화하는 것은 물론, 비생성형 AI를 결합해 분석 및 운영을 최적화하는 등 엔지니어 주도로 곳곳에서 AI 활용이 실용화 단계에 접어들었습니다. 이런 흐름 속에서 ‘각 현장에서 얻은지식을 어떻게 사내외에 공유해 다음 도전으로 연결해 나갈 것인가’라는 질문에 대한 해답으로 LY C…
들어가며: 늘어나는 서비스, 새로운 인프라, 끝없는 문의 여러분의 팀은 하루에 몇 번이나 같은 질문에 답하고, 같은 작업을 반복하고 계신가요? LINE Home DevOps 팀은 최근 팀원이 늘어났지만, VOOM 서비스를 안정적으로 운영하는 업무와 새로운 HomeTab 서비스 준비, 새로운 클라우드 인프라 플랫폼인 Flava로 전환하는 일이 겹치면서 오히려 더욱 바빠졌습니다. 어느 하나 포기할 수 없었기에 저희는 이 상황을 개선하기 위한 방법을 찾았고, 문득 팀의 에너지가 중요한 일이 아니라 반…
Let’s Talk Agentic Development: Spotify x Anthropic Live AI agents are transforming the way we build — and even how we think of ourselves as software developers. Both Spotify and Anthropic have been exploring what this shift means in practice, and on March 30, we got together at…
This session from NAVER Engineering Day 2025 explores how developers can transition AI from a simple assistant into a functional project collaborator through local automation. By leveraging local Large Language Models (LLMs) and the Model Context Protocol (MCP), development teams can automate high-friction tasks such as build failure diagnostics and crash log analysis. The presentation demonstrates that integrating these tools directly into the development pipeline significantly reduces the manual overhead required for routine troubleshooting and reporting.
### Integrating LLMs with Local Environments
* Utilizing **Ollama** allows teams to run LLMs locally, ensuring data privacy and reducing latency compared to cloud-based alternatives.
* The **mcp-agent** (Model Context Protocol) serves as the critical bridge, connecting the LLM to local file systems, tools, and project-specific data.
* This infrastructure enables the AI to act as an "agent" that can autonomously navigate the codebase rather than just processing static text prompts.
### Build Failure and Crash Monitoring Automation
* When a build fails, the AI agent automatically parses the logs to identify the root cause, providing a concise summary instead of requiring a developer to sift through thousands of lines of terminal output.
* For crash monitoring, the system goes beyond simple summarization by analyzing stack traces and identifying the specific developer or team responsible for the affected code segment.
* By automating the initial diagnostic phase, the time between an error occurring and a developer beginning the fix is dramatically shortened.
### Intelligent Reporting via Slack
* The system integrates with **Slack** to deliver automated, context-aware reports that categorize issues by severity and impact.
* These reports include actionable insights, such as suggested fixes or links to relevant documentation, directly within the communication channel used by the team.
* This ensures that project stakeholders remain informed of the system's health without requiring manual status updates from engineers.
### Considerations for LLM and MCP Implementation
* While powerful, the combination of LLMs and MCP agents is not a "silver bullet"; it requires careful prompt engineering and boundary setting to prevent hallucination in technical diagnostics.
* Effective automation depends on the quality of the local context provided to the agent; the more structured the logs and metadata, the more accurate the AI's conclusions.
* Organizations should evaluate the balance between the computational cost of running local models and the productivity gains achieved through automation.
To successfully implement AI-driven automation, developers should start by targeting specific, repetitive bottlenecks—such as triaging build errors—before expanding the agent's scope to more complex architectural tasks. Focusing on the integration between Ollama and mcp-agent provides a secure, extensible foundation for building a truly "smart" development workflow.
Tran Le Till Pieper Director, Product Management Gillian McGarvey Writing a postmortem is an essential learning process after an incident is resolved. But documenting important details comprehensively can be cumbersome, especially when responders have already moved on to the nex…
Datadog’s first global outage on March 8, 2023, served as a rigorous stress test for their established incident response framework and "you build it, you own it" philosophy. While the outage was triggered by a systemic failure during a routine systemd upgrade, the company's commitment to blameless culture and decentralized engineering autonomy allowed hundreds of responders to coordinate a complex recovery across multiple regions. Ultimately, the event validated their investment in out-of-band monitoring and rigorous, bi-annual incident training as essential components for managing high-scale system disasters.
## Incident Response Structure and Philosophy
* Datadog employs a decentralized "you build it, you own it" model where individual engineering teams are responsible for the 24/7 health and monitoring of the services they build.
* For high-severity incidents, a specialized rotation is paged, consisting of an Incident Commander to lead the response, a communications lead, and a customer liaison to manage external messaging.
* The organization prioritizes "people over process," empowering engineers to use their judgment to find creative solutions rather than following rigid, pre-written playbooks that may not apply to unprecedented failures.
* A blameless culture is strictly maintained across all levels of the company, ensuring that post-incident investigations focus on systemic improvements rather than assigning fault to individuals.
## Multi-Layered Monitoring Strategy
* Standard telemetry provides internal visibility, but Datadog also maintains "out-of-band" monitoring that operates completely outside its own infrastructure.
* This out-of-band system interacts with Datadog APIs exactly like a customer would, ensuring that engineers are alerted even if the internal monitoring platform itself becomes unavailable.
* Communication is streamlined through a dedicated Slack incident app that automatically generates coordination channels, providing situational awareness to any engineer who joins the effort.
## Anatomy of the March 8 Outage
* The outage began at 06:00 UTC, triggered by a systemd upgrade that caused widespread Kubernetes failures and prevented pods from restarting correctly.
* The global nature of the outage was diagnosed within 32 minutes of the initial monitoring alerts, leading to the activation of executive on-calls and the customer support management team.
* Responders identified "unattended upgrades" as the incident trigger approximately five and a half hours after the initial failure.
* Recovery was executed in stages: compute capacity was restored first in the EU1 region, followed by the US1 region, with full infrastructure restoration completed by 19:00 UTC.
Organizations should treat incident response as a perishable skill that requires constant practice through a low threshold for declaring incidents and regular training. By combining out-of-band monitoring with a culture that empowers individual engineers to act autonomously during a crisis, teams can more effectively navigate the "not if, but when" reality of large-scale system failures.
Jason Satti Jeremy Baker Employees at all modern software companies use dozens and sometimes hundreds of outside pieces of software to do their jobs and to develop their product. From your company email to the services that you host your product on, to everything in between, mos…
Jules Denardou Doug DePerry Datadog maintains multiple compliance and security layers and employs a number of controls to prevent and detect unauthorized access. This post highlights some recent work to improve our cloud-based monitoring and alerting pipeline. The Datadog securi…