How Squad runs coordinated AI agents inside your repository (opens in new tab)
Principal PM Architect in CoreAI Apps & Agents at Microsoft, where I build stuff to make it easier for people to party with Copilot in the cloud.
16 posts
Principal PM Architect in CoreAI Apps & Agents at Microsoft, where I build stuff to make it easier for people to party with Copilot in the cloud.
Our First 2026 Heroes Cohort Is Here! We’re thrilled to celebrate three exceptional developer community leaders as AWS Heroes. These individuals represent the heart of what makes the AWS community so vibrant. In addition to sharing technical knowledge, they build connections, fo…
Improving breast cancer screening workflows with machine learning March 17, 2026 Lihong Xi, Senior Technical Program Manager, and Daniel Golden, Engineering Manager, Google Research A large-scale evaluation of our mammography system across multiple screening services demonstrate…
GitLab 18.9 introduces critical updates designed to provide regulated enterprises with governed, agentic AI capabilities through self-hosted infrastructure and model flexibility. By combining the Duo Agent Platform with Bring Your Own Model (BYOM) support, organizations in sectors like finance and government can now automate complex DevSecOps workflows while maintaining total control over data residency. This release transforms GitLab into a high-security AI control plane that balances the need for advanced automation with the rigid sovereignty requirements of high-compliance environments. ## Self-Hosted Duo Agent Platform for Online Cloud Licenses The Duo Agent Platform allows engineering teams to automate sequences of tasks, such as hardening CI/CD pipelines and triaging vulnerabilities, but was previously difficult to deploy for customers under strict online cloud licensing. This update makes the platform generally available for these environments, bridging the gap between cloud-based licensing and self-hosted security needs. * **Usage-Based Billing:** The platform now utilizes GitLab Credits to provide transparent, per-request metering, which is essential for internal chargeback and regulatory reporting. * **Infrastructure Control:** Enterprises can host models on their own internal infrastructure or within approved cloud environments, ensuring that inference traffic is routed according to internal security policies. * **Deployment Readiness:** By removing the requirement to route data through external AI vendors, the platform is now a viable option for critical infrastructure and government agencies. ## Bring Your Own Model (BYOM) Integration Recognizing that many enterprises have already invested in domain-tuned LLMs or air-gapped deployments, GitLab now allows customers to integrate their existing models directly into the Duo Agent Platform. This ensures that organizations are not locked into a specific vendor and can leverage models that have already passed internal risk assessments. * **AI Gateway Connectivity:** Administrators can connect third-party or internal models via the GitLab AI Gateway, allowing these models to function as enterprise-ready options within the GitLab ecosystem. * **Granular Model Mapping:** The system provides the ability to map specific models to individual Duo Agent Platform flows or features, giving admins fine-grained control over which agent uses which model. * **Administrative Ownership:** While GitLab provides the orchestration layer, administrators retain full responsibility for model validation, performance tuning, and risk evaluation for the models they choose to bring. For organizations operating in high-compliance sectors, these updates offer a path to consolidate fragmented AI tools into a single, governed platform. Engineering leaders should evaluate their current model investments and leverage the GitLab AI Gateway to unify their automation workflows under one secure DevSecOps umbrella.
AWS Weekly Roundup: Amazon EC2 M8azn instances, new open weights models in Amazon Bedrock, and more (February 16, 2026) I joined AWS in 2021, and since then I’ve watched the Amazon Elastic Compute Cloud (Amazon EC2) instance family grow at a pace that still surprises me. From AW…
Scaling LLM Post-Training at Netflix -- 6 Listen Share Baolin Li, Lingyi Liu, Binh Tang, Shaojing Li Introduction Pre-training gives Large Language Models (LLMs) broad linguistic ability and general world knowledge, but post-training is the phase that actually aligns them to con…
Scheduling in a changing world: Maximizing throughput with time-varying capacity February 11, 2026 Manish Purohit, Research Scientist, Google Research We introduce new, provably effective algorithms for scheduling jobs without interruptions on cloud infrastructure when machine a…
Figma achieves C5 accreditation, strengthening cloud security for customers across the DACH region Inside Figma News Figma is giving customers greater confidence in cloud security and compliance. Today, Figma announced that it has achieved C5 accreditation, the Cloud Computing C…
Amazon has announced the general availability of EC2 X8i instances, specifically engineered for memory-intensive workloads such as SAP HANA, large-scale databases, and data analytics. Powered by custom Intel Xeon 6 processors with a 3.9 GHz all-core turbo frequency, these instances provide a significant performance leap over the previous X2i generation. By offering up to 6 TB of memory and substantial improvements in throughput, X8i instances represent the highest-performing Intel-based memory-optimized option in the AWS cloud. ### Performance Enhancements and Processor Architecture * **Custom Silicon:** The instances utilize custom Intel Xeon 6 processors available exclusively on AWS, delivering the fastest memory bandwidth among comparable Intel cloud processors. * **Memory and Bandwidth:** X8i provides 1.5 times more memory capacity (up to 6 TB) and 3.4 times more memory bandwidth compared to previous-generation X2i instances. * **Workload Benchmarks:** Real-world performance gains include a 50% increase in SAP Application Performance Standard (SAPS), 47% faster PostgreSQL performance, 88% faster Memcached performance, and a 46% boost in AI inference. ### Scalable Instance Sizes and Throughput * **Flexible Sizing:** The instances are available in 14 sizes, including new larger formats such as the 48xlarge, 64xlarge, and 96xlarge. * **Bare Metal Options:** Two bare metal sizes (metal-48xl and metal-96xl) are available for workloads requiring direct access to physical hardware resources. * **Networking and Storage:** The architecture supports up to 100 Gbps of network bandwidth with Elastic Fabric Adapter (EFA) support and up to 80 Gbps of Amazon EBS throughput. * **Bandwidth Control:** Support for Instance Bandwidth Configuration (IBC) allows users to customize the allocation of performance between networking and EBS to suit specific application needs. ### Cost Efficiency and Use Cases * **Licensing Optimization:** In preview testing, customers like Orion reduced SQL Server licensing costs by 50% by maintaining performance thresholds with fewer active cores compared to older instance types. * **Enterprise Applications:** The instances are SAP-certified, making them ideal for RISE with SAP and other high-demand ERP environments. * **Broad Utility:** Beyond databases, the instances are optimized for Electronic Design Automation (EDA) and complex data analytics that require massive memory footprints. For organizations managing massive datasets or expensive licensed database software, migrating to X8i instances offers a clear path to both performance optimization and infrastructure cost reduction. These instances are currently available in the US East (N. Virginia), US West (Oregon), and Europe (Ireland) regions through On-Demand, Spot, and Reserved purchasing models.
The AWS Weekly Roundup for January 2026 highlights a significant push toward modernization, headlined by the introduction of .NET 10 support for AWS Lambda and Apache Airflow 2.11 for Amazon MWAA. To encourage exploration of these and other emerging technologies, AWS has revamped its Free Tier to offer new users up to $200 in credits and six months of risk-free experimentation. These updates collectively aim to streamline serverless development, enhance container storage efficiency, and provide more robust authentication options for messaging services. ### Modernized Runtimes and Orchestration * AWS Lambda now supports .NET 10 as both a managed runtime and a container base image, with AWS providing automatic updates to these environments as they become available. * Amazon Managed Workflows for Apache Airflow (MWAA) has added support for version 2.11, which serves as a critical stepping stone for users preparing to migrate to Apache Airflow 3. ### Infrastructure and Resource Management * Amazon ECS has extended support for `tmpfs` mounts to Linux tasks running on AWS Fargate and Managed Instances; this allows developers to utilize memory-backed file systems for containerized workloads to avoid writing sensitive or temporary data to task storage. * AWS Config has expanded its monitoring capabilities to discover, assess, and audit new resource types across Amazon EC2, Amazon SageMaker, and Amazon S3 Tables. * A new AWS Client VPN quickstart was released, providing a CloudFormation template and a step-by-step guide to automate the deployment of secure client-to-site VPN connections. ### Security and Messaging Enhancements * Amazon MQ for RabbitMQ brokers now supports HTTP-based authentication, which can be enabled and managed through the broker’s configuration file. * RabbitMQ brokers on Amazon MQ also now support certificate-based authentication using mutual TLS (mTLS) to improve the security posture of messaging applications. ### Educational Initiatives and Community Events * New AWS Free Tier accounts now include a 6-month trial period featuring $200 in credits and access to over 30 always-free services, specifically targeting developers interested in AI/ML and compute experimentation. * AWS published a curated "Best of re:Invent 2025" playlist, featuring high-impact sessions and keynotes for those who missed the live event. * The 2026 AWS Summit season begins shortly, with upcoming events scheduled for Dubai on February 10 and Paris on March 10. Developers should take immediate advantage of the new .NET 10 Lambda runtime for serverless applications and review the updated ECS `tmpfs` documentation to optimize container performance. For those new to the platform, the expanded Free Tier credits provide an excellent opportunity to prototype AI/ML workloads with minimal financial risk.
The AWS Weekly Roundup for mid-December 2025 highlights a series of updates designed to streamline developer workflows and enhance security across the cloud ecosystem. Following the momentum of re:Invent 2025, these releases focus on reducing operational friction through faster database provisioning, more granular container control, and AI-assisted development tools. These advancements collectively aim to simplify infrastructure management while providing deeper cost visibility and improved performance for enterprise applications. ## Database and Developer Productivity * **Amazon Aurora DSQL** now supports near-instant cluster creation, reducing provisioning time from minutes to seconds to facilitate rapid prototyping and AI-powered development via the Model Context Protocol (MCP) server. * **Amazon Aurora PostgreSQL** has integrated with **Kiro powers**, allowing developers to use AI-assisted coding for schema management and database queries through pre-packaged MCP servers. * **Amazon CloudWatch SDK** introduced support for optimized JSON and CBOR protocols, improving the efficiency of data transmission and processing within the monitoring suite. * **Amazon Cognito** simplified user communications by enabling automated email delivery through Amazon SES using verified identities, removing the need for manual SES configuration. ## Compute and Networking Optimizations * **Amazon ECS on AWS Fargate** now honors custom container stop signals, such as SIGQUIT or SIGINT, allowing for graceful shutdowns of applications that do not use the default SIGTERM instruction. * **Application Load Balancer (ALB)** received performance enhancements that reduce latency for establishing new connections and lower resource consumption during traffic processing. * **AWS Fargate** cost optimization strategies were highlighted in new technical guides, focusing on leveraging Graviton processors and Fargate Spot to maximize compute efficiency. ## Security and Cost Management * **Amazon WorkSpaces Secure Browser** introduced Web Content Filtering, providing category-based access control across 25+ predefined categories and granular URL policies at no additional cost. * **AWS Cost Management** tools now feature **Tag Inheritance**, which automatically applies tags from resources to cost data, allowing for more precise tracking in Cost Explorer and AWS Budgets. * **Amazon Step Functions** integration with Amazon Bedrock was further detailed in community resources, showcasing how to build resilient, long-running AI workflows with integrated error handling. To take full advantage of these updates, organizations should review their Fargate task definitions to implement custom stop signals for better application stability and enable Tag Inheritance to improve the accuracy of year-end cloud financial reporting.
The December 8, 2025, AWS Weekly Roundup recaps the major themes from AWS re:Invent, signaling a significant industry transition from AI assistants to autonomous AI agents. While technical innovation in infrastructure remains a priority, the event underscored that developers remain at the heart of the AWS mission, empowered by new tools to automate complex tasks using natural language. This shift represents a "renaissance" in cloud computing, where purpose-built infrastructure is now designed to support the non-deterministic nature of agentic workloads. ## Community Recognition and the Now Go Build Award * Raphael Francis Quisumbing (Rafi) from the Philippines was honored with the Now Go Build Award, presented by Werner Vogels. * A veteran of the ecosystem, Quisumbing has served as an AWS Hero since 2015 and has co-led the AWS User Group Philippines for over a decade. * The recognition emphasizes AWS's continued focus on community dedication and the role of individual builders in empowering regional developer ecosystems. ## The Evolution from AI Assistants to Agents * AWS CEO Matt Garman identified AI agents as the next major inflection point for the industry, moving beyond simple chat interfaces to systems that perform tasks and automate workflows. * Dr. Swami Sivasubramanian highlighted a paradigm shift where natural language serves as the primary interface for describing complex goals. * These agents are designed to autonomously generate plans, write necessary code, and call various tools to execute complete solutions without constant human intervention. * AWS is prioritizing the development of production-ready infrastructure that is secure and scalable specifically to handle the "non-deterministic" behavior of these AI agents. ## Core Infrastructure and the Developer Renaissance * Despite the focus on AI, AWS reaffirmed that its core mission remains the "freedom to invent," keeping developers central to its 20-year strategy. * Leaders Peter DeSantis and Dave Brown reinforced that foundational attributes—security, availability, and performance—remain the non-negotiable pillars of the AWS cloud. * The integration of AI agents is framed as a way to finally realize material business returns on AI investments by moving from experimental use cases to automated business logic. To maximize the value of these updates, organizations should begin evaluating how to transition from simple LLM implementations to agentic frameworks that can execute end-to-end business processes. Reviewing the on-demand keynote sessions from re:Invent 2025 is recommended for technical teams looking to implement the latest secure, agent-ready infrastructure.
Amazon SageMaker HyperPod has introduced checkpointless and elastic training features to accelerate AI model development by minimizing infrastructure-related downtime. These advancements replace traditional, slow checkpoint-restart cycles with peer-to-peer state recovery and enable training workloads to scale dynamically based on available compute capacity. By decoupling training progress from static hardware configurations, organizations can significantly reduce model time-to-market while maximizing cluster utilization. **Checkpointless Training and Rapid State Recovery** * Replaces the traditional five-stage recovery process—including job termination, network setup, and checkpoint retrieval—which can often take up to an hour on self-managed clusters. * Utilizes peer-to-peer state replication and in-process recovery to allow healthy nodes to restore the model state instantly without restarting the entire job. * Incorporates technical optimizations such as collective communications initialization and memory-mapped data loading to enable efficient data caching. * Reduces recovery downtime by over 80% based on internal studies of clusters with up to 2,000 GPUs, and was a core technology used in the development of Amazon Nova models. **Elastic Training and Automated Cluster Scaling** * Allows AI workloads to automatically expand to use idle cluster capacity as it becomes available and contract when resources are needed for higher-priority tasks. * Reduces the need for manual intervention, saving hours of engineering time previously spent reconfiguring training jobs to match fluctuating compute availability. * Optimizes total cost of ownership by ensuring that training momentum continues even as inference volumes peak and pull resources away from the training pool. * Orchestrates these transitions seamlessly through the HyperPod training operator, ensuring that model development is not disrupted by infrastructure changes. For teams managing large-scale AI workloads, adopting these features can reclaim significant development time and lower operational costs by preventing idle cluster periods. Organizations scaling to thousands of accelerators should prioritize checkpointless training to mitigate the impact of hardware faults and maintain continuous training momentum.
Google researchers have developed LAVA, a scheduling framework designed to optimize virtual machine (VM) allocation in large-scale data centers by accurately predicting and adapting to VM lifespans. By moving beyond static, one-time predictions toward a "continuous re-prediction" model based on survival analysis, the system significantly improves resource efficiency and reduces fragmentation. This approach allows cloud providers to solve the complex "bin packing" problem more effectively, leading to better capacity utilization and easier system maintenance. ### The Challenge of Long-Tailed VM Distributions * Cloud workloads exhibit a extreme long-tailed distribution: while 88% of VMs live for less than an hour, these short-lived jobs consume only 2% of total resources. * The rare VMs that run for 30 days or longer account for a massive fraction of compute resources, meaning their placement has a disproportionate impact on host availability. * Poor allocation leads to "resource stranding," where a server's remaining capacity is too small or unbalanced to host new VMs, effectively wasting expensive hardware. * Traditional machine learning models that provide only a single prediction at VM creation are often fragile, as a single misprediction can block a physical host from being cleared for maintenance or new tasks. ### Continuous Re-prediction via Survival Analysis * Instead of predicting a single average lifetime, LAVA uses an ML model to generate a probability distribution of a VM's expected duration. * The system employs "continuous re-prediction," asking how much longer a VM is expected to run given how long it has already survived (e.g., a VM that has run for five days is assigned a different remaining lifespan than a brand-new one). * This adaptive approach allows the scheduling logic to automatically correct for initial mispredictions as more data about the VM's actual behavior becomes available over time. ### Novel Scheduling and Rescheduling Algorithms * **Non-Invasive Lifetime Aware Scheduling (NILAS):** Currently deployed on Google’s Borg cluster manager, this algorithm ranks potential hosts by grouping VMs with similar expected exit times to increase the frequency of "empty hosts" available for maintenance. * **Lifetime-Aware VM Allocation (LAVA):** This algorithm fills resource gaps on hosts containing long-lived VMs with jobs that are at least an order of magnitude shorter. This ensures the short-lived VMs exit quickly without extending the host's overall occupation time. * **Lifetime-Aware Rescheduling (LARS):** To minimize disruptions during defragmentation, LARS identifies and migrates the longest-lived VMs first while allowing short-lived VMs to finish their tasks naturally on the original host. By integrating survival-analysis-based predictions into the core logic of data center management, cloud providers can transition from reactive scheduling to a proactive model. This system not only maximizes resource density but also ensures that the physical infrastructure remains flexible enough to handle large, resource-intensive provisioning requests and essential system updates.
Following a massive system-wide outage in March 2023, Datadog successfully restored its EU1 region by identifying that a simple node reboot could resolve network connectivity issues caused by a faulty system patch. While the team managed to restore 100 percent of compute capacity within hours, the recovery effort was subsequently hindered by cloud provider infrastructure limits and IP address exhaustion. This post-mortem highlights the complexities of scaling hierarchical Kubernetes environments under extreme pressure and the importance of accounting for "black swan" capacity requirements. ## Hierarchical Kubernetes Recovery Datadog utilizes a strict hierarchy of Kubernetes clusters to manage its infrastructure, which necessitated a granular, three-tiered recovery approach. Because the outage affected network connectivity via `systemd-networkd`, the team had to restore components in a specific order to regain control of the environment. * **Parent Control Planes:** Engineers first rebooted the virtual machines hosting the parent clusters, which manage the control planes for all other clusters. * **Child Control Planes:** Once parent clusters were stable, the team restored the control planes for application clusters, which run as pods within the parent infrastructure. * **Application Worker Nodes:** Thousands of worker nodes across dozens of clusters were restarted progressively to avoid overwhelming the control planes, reaching full capacity by 12:05 UTC. ## Scaling Bottlenecks and Cloud Quotas Once the infrastructure was online, the team attempted to scale out rapidly to process a massive backlog of buffered data. This surge in demand triggered previously unencountered limitations within the Google Cloud environment. * **VPC Peering Limits:** At 14:18 UTC, the platform hit a documented but overlooked limit of 15,500 VM instances within a single network peering group, blocking all further scaling. * **Provider Intervention:** Datadog worked directly with Google Cloud support to manually raise the peering group limit, which allowed scaling to resume after a nearly four-hour delay. ## IP Address and Subnet Capacity Even after cloud-level instance quotas were lifted, specific high-traffic clusters processing logs and traces hit a secondary bottleneck related to internal networking. * **Subnet Exhaustion:** These clusters attempted to scale to more than twice their normal size, quickly exhausting all available IP addresses in their assigned subnets. * **Capacity Planning Gaps:** While Datadog typically targets a 66% maximum IP usage to allow for a 50% scale-out, the extreme demands of the recovery backlog exceeded these safety margins. * **Impact on Backlog:** For six hours, the lack of available IPs forced these clusters to process data significantly slower than the rest of the recovered infrastructure. ## Recovery Summary The EU1 recovery demonstrates that even when hardware is functional, software-defined limits can create cascading delays. Organizations should not only monitor their own resource usage but also maintain visibility into cloud provider quotas and ensure that subnet allocations account for extreme recovery scenarios where workloads may need to double or triple in size momentarily.