cloud-infrastructure

10 posts

line

SRE 팀의 반복 작업을 10분의 1로 줄인 SRE 봇 개발기 (opens in new tab)

들어가며: 늘어나는 서비스, 새로운 인프라, 끝없는 문의 여러분의 팀은 하루에 몇 번이나 같은 질문에 답하고, 같은 작업을 반복하고 계신가요? LINE Home DevOps 팀은 최근 팀원이 늘어났지만, VOOM 서비스를 안정적으로 운영하는 업무와 새로운 HomeTab 서비스 준비, 새로운 클라우드 인프라 플랫폼인 Flava로 전환하는 일이 겹치면서 오히려 더욱 바빠졌습니다. 어느 하나 포기할 수 없었기에 저희는 이 상황을 개선하기 위한 방법을 찾았고, 문득 팀의 에너지가 중요한 일이 아니라 반…

gitlab

DevSecOps-as-a-Service on Oracle Cloud Infrastructure by Data Intensity (opens in new tab)

Data Intensity’s DevSecOps-as-a-Service provides a solution for organizations that require the granular control of GitLab Self-Managed but wish to eliminate the operational burden of infrastructure maintenance. By hosting dedicated GitLab instances on Oracle Cloud Infrastructure (OCI), the service combines the security and customization of a self-managed environment with the convenience of a fully managed platform. This partnership enables teams to focus on software delivery while leveraging expert management for high availability and disaster recovery. ### The Benefits of GitLab Self-Managed * Offers complete ownership of data residency and instance configuration to meet strict regulatory and compliance requirements. * Enables deep customization and integration possibilities that are often restricted in standard SaaS environments. * Addresses the challenges of manual server management, upgrades, and high-availability scaling by offloading these tasks to a managed provider. ### Managed Service Features and Support * Provides 24/7 monitoring, alarming, and expert technical support for standalone GitLab instances. * Includes scheduled quarterly patching performed during customer-specified maintenance windows to minimize disruption. * Ensures business continuity through automated backups and professional disaster recovery protection. * Utilizes tiered architectures designed to scale based on specific user capacities and recovery time objectives. ### Infrastructure Optimization via OCI * Delivers significant cost efficiency, with organizations typically realizing 40-50% reductions in infrastructure spending compared to other hyperscalers. * Supports diverse deployment models, including Public Cloud, Government Cloud, EU Sovereign Clouds, and dedicated infrastructure behind a corporate firewall. * Maintains consistent pricing and operational tooling across hybrid, global, and regulated environments. ### Implementation and Migration * Data Intensity offers optional migration services to transition existing code repositories and configurations to the OCI environment seamlessly. * The service is specifically designed for organizations with predictable cost requirements and those lacking in-house infrastructure expertise. * Deployment planning involves tailored consultations to match specific compliance and data residency needs with OCI’s global region availability. This managed service is a recommended path for enterprise teams that need to prioritize data sovereignty and flexibility without sacrificing the speed of a turnkey solution. Organizations currently using or planning to adopt OCI can leverage this service to standardize their DevSecOps workflows while achieving significant infrastructure savings.

aws

Amazon EC2 C8id, M8id, and R8id instances with up to 22.8 TB local NVMe storage are generally available | Amazon Web Services (opens in new tab)

Amazon EC2 C8id, M8id, and R8id instances with up to 22.8 TB local NVMe storage are generally available Last year, we launched the Amazon Elastic Compute Cloud (Amazon EC2) C8i instances, M8i instances, and R8i instances powered by custom Intel Xeon 6 processors available only o…

aws

Opening the AWS European Sovereign Cloud (opens in new tab)

AWS has officially launched the AWS European Sovereign Cloud, a specialized infrastructure designed to meet the rigorous data residency and operational autonomy requirements of European public sector organizations and highly regulated industries. This new offering provides a fully featured cloud environment that is physically and logically separate from existing AWS Regions, ensuring all data and metadata remain entirely within the European Union. By bridging the gap between legacy on-premises security and modern cloud innovation, AWS enables sensitive workloads to operate under strict European jurisdiction and independent governance. **Strategic Independence and Operational Control** Organizations in the EU often face complex regulatory hurdles that prevent them from using standard public cloud offerings, frequently forcing them to remain on aging on-premises hardware. The AWS European Sovereign Cloud addresses these challenges through: * **Independent Operations:** The infrastructure is operated independently from other AWS Regions, providing a distinct management layer specific to the EU. * **Enhanced Sovereignty Controls:** Robust technical controls and legal protections are integrated to ensure that data remains under European jurisdiction. * **Governance Autonomy:** The cloud is built to provide European entities with full control over their data residency and operational transparency. **Independent Infrastructure and Regional Presence** The architecture is designed for high availability and resilience, ensuring that mission-critical services remain functional regardless of external connectivity. * **Initial Region:** The first region is now generally available in Brandenburg, Germany, serving as the primary hub for the sovereign infrastructure. * **Redundancy:** The infrastructure utilizes multiple Availability Zones with redundant power and networking to maintain continuous operation. * **Isolated Connectivity:** The design allows the cloud to continue operating even if connectivity to the rest of the global AWS network is interrupted. **Expansion and Hybrid Deployment Options** To support the diverse needs of EU member states, AWS is expanding the footprint of this sovereign infrastructure through localized hardware and edge services. * **Sovereign Local Zones:** Future expansion plans include new Local Zones in Belgium, the Netherlands, and Portugal to provide low-latency access within specific borders. * **Hybrid Integration:** Customers can extend sovereign infrastructure to their own data centers using AWS Outposts or AWS Dedicated Local Zones. * **Advanced Capabilities:** The platform supports specialized workloads through AWS AI Factories, allowing regulated industries to leverage artificial intelligence within a sovereign boundary. For European organizations navigating strict compliance landscapes, the AWS European Sovereign Cloud provides a viable path to digital transformation. Decision-makers should evaluate their current on-premises or restricted cloud environments to determine how these new sovereign regions and local zones can fulfill upcoming data residency mandates while providing access to advanced cloud-native services.

aws

AWS Weekly Roundup: AWS re:Invent keynote recap, on-demand videos, and more (December 8, 2025) (opens in new tab)

The December 8, 2025, AWS Weekly Roundup recaps the major themes from AWS re:Invent, signaling a significant industry transition from AI assistants to autonomous AI agents. While technical innovation in infrastructure remains a priority, the event underscored that developers remain at the heart of the AWS mission, empowered by new tools to automate complex tasks using natural language. This shift represents a "renaissance" in cloud computing, where purpose-built infrastructure is now designed to support the non-deterministic nature of agentic workloads. ## Community Recognition and the Now Go Build Award * Raphael Francis Quisumbing (Rafi) from the Philippines was honored with the Now Go Build Award, presented by Werner Vogels. * A veteran of the ecosystem, Quisumbing has served as an AWS Hero since 2015 and has co-led the AWS User Group Philippines for over a decade. * The recognition emphasizes AWS's continued focus on community dedication and the role of individual builders in empowering regional developer ecosystems. ## The Evolution from AI Assistants to Agents * AWS CEO Matt Garman identified AI agents as the next major inflection point for the industry, moving beyond simple chat interfaces to systems that perform tasks and automate workflows. * Dr. Swami Sivasubramanian highlighted a paradigm shift where natural language serves as the primary interface for describing complex goals. * These agents are designed to autonomously generate plans, write necessary code, and call various tools to execute complete solutions without constant human intervention. * AWS is prioritizing the development of production-ready infrastructure that is secure and scalable specifically to handle the "non-deterministic" behavior of these AI agents. ## Core Infrastructure and the Developer Renaissance * Despite the focus on AI, AWS reaffirmed that its core mission remains the "freedom to invent," keeping developers central to its 20-year strategy. * Leaders Peter DeSantis and Dave Brown reinforced that foundational attributes—security, availability, and performance—remain the non-negotiable pillars of the AWS cloud. * The integration of AI agents is framed as a way to finally realize material business returns on AI investments by moving from experimental use cases to automated business logic. To maximize the value of these updates, organizations should begin evaluating how to transition from simple LLM implementations to agentic frameworks that can execute end-to-end business processes. Reviewing the on-demand keynote sessions from re:Invent 2025 is recommended for technical teams looking to implement the latest secure, agent-ready infrastructure.

datadog

2023-03-08 incident: A deep dive into our incident response | Datadog (opens in new tab)

Datadog’s first global outage on March 8, 2023, served as a rigorous stress test for their established incident response framework and "you build it, you own it" philosophy. While the outage was triggered by a systemic failure during a routine systemd upgrade, the company's commitment to blameless culture and decentralized engineering autonomy allowed hundreds of responders to coordinate a complex recovery across multiple regions. Ultimately, the event validated their investment in out-of-band monitoring and rigorous, bi-annual incident training as essential components for managing high-scale system disasters. ## Incident Response Structure and Philosophy * Datadog employs a decentralized "you build it, you own it" model where individual engineering teams are responsible for the 24/7 health and monitoring of the services they build. * For high-severity incidents, a specialized rotation is paged, consisting of an Incident Commander to lead the response, a communications lead, and a customer liaison to manage external messaging. * The organization prioritizes "people over process," empowering engineers to use their judgment to find creative solutions rather than following rigid, pre-written playbooks that may not apply to unprecedented failures. * A blameless culture is strictly maintained across all levels of the company, ensuring that post-incident investigations focus on systemic improvements rather than assigning fault to individuals. ## Multi-Layered Monitoring Strategy * Standard telemetry provides internal visibility, but Datadog also maintains "out-of-band" monitoring that operates completely outside its own infrastructure. * This out-of-band system interacts with Datadog APIs exactly like a customer would, ensuring that engineers are alerted even if the internal monitoring platform itself becomes unavailable. * Communication is streamlined through a dedicated Slack incident app that automatically generates coordination channels, providing situational awareness to any engineer who joins the effort. ## Anatomy of the March 8 Outage * The outage began at 06:00 UTC, triggered by a systemd upgrade that caused widespread Kubernetes failures and prevented pods from restarting correctly. * The global nature of the outage was diagnosed within 32 minutes of the initial monitoring alerts, leading to the activation of executive on-calls and the customer support management team. * Responders identified "unattended upgrades" as the incident trigger approximately five and a half hours after the initial failure. * Recovery was executed in stages: compute capacity was restored first in the EU1 region, followed by the US1 region, with full infrastructure restoration completed by 19:00 UTC. Organizations should treat incident response as a perishable skill that requires constant practice through a low threshold for declaring incidents and regular training. By combining out-of-band monitoring with a culture that empowers individual engineers to act autonomously during a crisis, teams can more effectively navigate the "not if, but when" reality of large-scale system failures.