cloud-computing

5 posts

aws

Amazon EC2 X8i instances powered by custom Intel Xeon 6 processors are generally available for memory-intensive workloads | AWS News Blog (opens in new tab)

Amazon has announced the general availability of EC2 X8i instances, specifically engineered for memory-intensive workloads such as SAP HANA, large-scale databases, and data analytics. Powered by custom Intel Xeon 6 processors with a 3.9 GHz all-core turbo frequency, these instances provide a significant performance leap over the previous X2i generation. By offering up to 6 TB of memory and substantial improvements in throughput, X8i instances represent the highest-performing Intel-based memory-optimized option in the AWS cloud. ### Performance Enhancements and Processor Architecture * **Custom Silicon:** The instances utilize custom Intel Xeon 6 processors available exclusively on AWS, delivering the fastest memory bandwidth among comparable Intel cloud processors. * **Memory and Bandwidth:** X8i provides 1.5 times more memory capacity (up to 6 TB) and 3.4 times more memory bandwidth compared to previous-generation X2i instances. * **Workload Benchmarks:** Real-world performance gains include a 50% increase in SAP Application Performance Standard (SAPS), 47% faster PostgreSQL performance, 88% faster Memcached performance, and a 46% boost in AI inference. ### Scalable Instance Sizes and Throughput * **Flexible Sizing:** The instances are available in 14 sizes, including new larger formats such as the 48xlarge, 64xlarge, and 96xlarge. * **Bare Metal Options:** Two bare metal sizes (metal-48xl and metal-96xl) are available for workloads requiring direct access to physical hardware resources. * **Networking and Storage:** The architecture supports up to 100 Gbps of network bandwidth with Elastic Fabric Adapter (EFA) support and up to 80 Gbps of Amazon EBS throughput. * **Bandwidth Control:** Support for Instance Bandwidth Configuration (IBC) allows users to customize the allocation of performance between networking and EBS to suit specific application needs. ### Cost Efficiency and Use Cases * **Licensing Optimization:** In preview testing, customers like Orion reduced SQL Server licensing costs by 50% by maintaining performance thresholds with fewer active cores compared to older instance types. * **Enterprise Applications:** The instances are SAP-certified, making them ideal for RISE with SAP and other high-demand ERP environments. * **Broad Utility:** Beyond databases, the instances are optimized for Electronic Design Automation (EDA) and complex data analytics that require massive memory footprints. For organizations managing massive datasets or expensive licensed database software, migrating to X8i instances offers a clear path to both performance optimization and infrastructure cost reduction. These instances are currently available in the US East (N. Virginia), US West (Oregon), and Europe (Ireland) regions through On-Demand, Spot, and Reserved purchasing models.

aws

AWS Weekly Roundup: AWS Lambda for .NET 10, AWS Client VPN quickstart, Best of AWS re:Invent, and more (January 12, 2026) | AWS News Blog (opens in new tab)

The AWS Weekly Roundup for January 2026 highlights a significant push toward modernization, headlined by the introduction of .NET 10 support for AWS Lambda and Apache Airflow 2.11 for Amazon MWAA. To encourage exploration of these and other emerging technologies, AWS has revamped its Free Tier to offer new users up to $200 in credits and six months of risk-free experimentation. These updates collectively aim to streamline serverless development, enhance container storage efficiency, and provide more robust authentication options for messaging services. ### Modernized Runtimes and Orchestration * AWS Lambda now supports .NET 10 as both a managed runtime and a container base image, with AWS providing automatic updates to these environments as they become available. * Amazon Managed Workflows for Apache Airflow (MWAA) has added support for version 2.11, which serves as a critical stepping stone for users preparing to migrate to Apache Airflow 3. ### Infrastructure and Resource Management * Amazon ECS has extended support for `tmpfs` mounts to Linux tasks running on AWS Fargate and Managed Instances; this allows developers to utilize memory-backed file systems for containerized workloads to avoid writing sensitive or temporary data to task storage. * AWS Config has expanded its monitoring capabilities to discover, assess, and audit new resource types across Amazon EC2, Amazon SageMaker, and Amazon S3 Tables. * A new AWS Client VPN quickstart was released, providing a CloudFormation template and a step-by-step guide to automate the deployment of secure client-to-site VPN connections. ### Security and Messaging Enhancements * Amazon MQ for RabbitMQ brokers now supports HTTP-based authentication, which can be enabled and managed through the broker’s configuration file. * RabbitMQ brokers on Amazon MQ also now support certificate-based authentication using mutual TLS (mTLS) to improve the security posture of messaging applications. ### Educational Initiatives and Community Events * New AWS Free Tier accounts now include a 6-month trial period featuring $200 in credits and access to over 30 always-free services, specifically targeting developers interested in AI/ML and compute experimentation. * AWS published a curated "Best of re:Invent 2025" playlist, featuring high-impact sessions and keynotes for those who missed the live event. * The 2026 AWS Summit season begins shortly, with upcoming events scheduled for Dubai on February 10 and Paris on March 10. Developers should take immediate advantage of the new .NET 10 Lambda runtime for serverless applications and review the updated ECS `tmpfs` documentation to optimize container performance. For those new to the platform, the expanded Free Tier credits provide an excellent opportunity to prototype AI/ML workloads with minimal financial risk.

aws

AWS Weekly Roundup: AWS re:Invent keynote recap, on-demand videos, and more (December 8, 2025) | AWS News Blog (opens in new tab)

The December 8, 2025, AWS Weekly Roundup recaps the major themes from AWS re:Invent, signaling a significant industry transition from AI assistants to autonomous AI agents. While technical innovation in infrastructure remains a priority, the event underscored that developers remain at the heart of the AWS mission, empowered by new tools to automate complex tasks using natural language. This shift represents a "renaissance" in cloud computing, where purpose-built infrastructure is now designed to support the non-deterministic nature of agentic workloads. ## Community Recognition and the Now Go Build Award * Raphael Francis Quisumbing (Rafi) from the Philippines was honored with the Now Go Build Award, presented by Werner Vogels. * A veteran of the ecosystem, Quisumbing has served as an AWS Hero since 2015 and has co-led the AWS User Group Philippines for over a decade. * The recognition emphasizes AWS's continued focus on community dedication and the role of individual builders in empowering regional developer ecosystems. ## The Evolution from AI Assistants to Agents * AWS CEO Matt Garman identified AI agents as the next major inflection point for the industry, moving beyond simple chat interfaces to systems that perform tasks and automate workflows. * Dr. Swami Sivasubramanian highlighted a paradigm shift where natural language serves as the primary interface for describing complex goals. * These agents are designed to autonomously generate plans, write necessary code, and call various tools to execute complete solutions without constant human intervention. * AWS is prioritizing the development of production-ready infrastructure that is secure and scalable specifically to handle the "non-deterministic" behavior of these AI agents. ## Core Infrastructure and the Developer Renaissance * Despite the focus on AI, AWS reaffirmed that its core mission remains the "freedom to invent," keeping developers central to its 20-year strategy. * Leaders Peter DeSantis and Dave Brown reinforced that foundational attributes—security, availability, and performance—remain the non-negotiable pillars of the AWS cloud. * The integration of AI agents is framed as a way to finally realize material business returns on AI investments by moving from experimental use cases to automated business logic. To maximize the value of these updates, organizations should begin evaluating how to transition from simple LLM implementations to agentic frameworks that can execute end-to-end business processes. Reviewing the on-demand keynote sessions from re:Invent 2025 is recommended for technical teams looking to implement the latest secure, agent-ready infrastructure.

aws

Introducing checkpointless and elastic training on Amazon SageMaker HyperPod | AWS News Blog (opens in new tab)

Amazon SageMaker HyperPod has introduced checkpointless and elastic training features to accelerate AI model development by minimizing infrastructure-related downtime. These advancements replace traditional, slow checkpoint-restart cycles with peer-to-peer state recovery and enable training workloads to scale dynamically based on available compute capacity. By decoupling training progress from static hardware configurations, organizations can significantly reduce model time-to-market while maximizing cluster utilization. **Checkpointless Training and Rapid State Recovery** * Replaces the traditional five-stage recovery process—including job termination, network setup, and checkpoint retrieval—which can often take up to an hour on self-managed clusters. * Utilizes peer-to-peer state replication and in-process recovery to allow healthy nodes to restore the model state instantly without restarting the entire job. * Incorporates technical optimizations such as collective communications initialization and memory-mapped data loading to enable efficient data caching. * Reduces recovery downtime by over 80% based on internal studies of clusters with up to 2,000 GPUs, and was a core technology used in the development of Amazon Nova models. **Elastic Training and Automated Cluster Scaling** * Allows AI workloads to automatically expand to use idle cluster capacity as it becomes available and contract when resources are needed for higher-priority tasks. * Reduces the need for manual intervention, saving hours of engineering time previously spent reconfiguring training jobs to match fluctuating compute availability. * Optimizes total cost of ownership by ensuring that training momentum continues even as inference volumes peak and pull resources away from the training pool. * Orchestrates these transitions seamlessly through the HyperPod training operator, ensuring that model development is not disrupted by infrastructure changes. For teams managing large-scale AI workloads, adopting these features can reclaim significant development time and lower operational costs by preventing idle cluster periods. Organizations scaling to thousands of accelerators should prioritize checkpointless training to mitigate the impact of hardware faults and maintain continuous training momentum.

google

Solving virtual machine puzzles: How AI is optimizing cloud computing (opens in new tab)

Google researchers have developed LAVA, a scheduling framework designed to optimize virtual machine (VM) allocation in large-scale data centers by accurately predicting and adapting to VM lifespans. By moving beyond static, one-time predictions toward a "continuous re-prediction" model based on survival analysis, the system significantly improves resource efficiency and reduces fragmentation. This approach allows cloud providers to solve the complex "bin packing" problem more effectively, leading to better capacity utilization and easier system maintenance. ### The Challenge of Long-Tailed VM Distributions * Cloud workloads exhibit a extreme long-tailed distribution: while 88% of VMs live for less than an hour, these short-lived jobs consume only 2% of total resources. * The rare VMs that run for 30 days or longer account for a massive fraction of compute resources, meaning their placement has a disproportionate impact on host availability. * Poor allocation leads to "resource stranding," where a server's remaining capacity is too small or unbalanced to host new VMs, effectively wasting expensive hardware. * Traditional machine learning models that provide only a single prediction at VM creation are often fragile, as a single misprediction can block a physical host from being cleared for maintenance or new tasks. ### Continuous Re-prediction via Survival Analysis * Instead of predicting a single average lifetime, LAVA uses an ML model to generate a probability distribution of a VM's expected duration. * The system employs "continuous re-prediction," asking how much longer a VM is expected to run given how long it has already survived (e.g., a VM that has run for five days is assigned a different remaining lifespan than a brand-new one). * This adaptive approach allows the scheduling logic to automatically correct for initial mispredictions as more data about the VM's actual behavior becomes available over time. ### Novel Scheduling and Rescheduling Algorithms * **Non-Invasive Lifetime Aware Scheduling (NILAS):** Currently deployed on Google’s Borg cluster manager, this algorithm ranks potential hosts by grouping VMs with similar expected exit times to increase the frequency of "empty hosts" available for maintenance. * **Lifetime-Aware VM Allocation (LAVA):** This algorithm fills resource gaps on hosts containing long-lived VMs with jobs that are at least an order of magnitude shorter. This ensures the short-lived VMs exit quickly without extending the host's overall occupation time. * **Lifetime-Aware Rescheduling (LARS):** To minimize disruptions during defragmentation, LARS identifies and migrates the longest-lived VMs first while allowing short-lived VMs to finish their tasks naturally on the original host. By integrating survival-analysis-based predictions into the core logic of data center management, cloud providers can transition from reactive scheduling to a proactive model. This system not only maximizes resource density but also ensures that the physical infrastructure remains flexible enough to handle large, resource-intensive provisioning requests and essential system updates.