gpu-acceleration

3 posts

aws

Announcing Amazon EC2 G7e instances accelerated by NVIDIA RTX PRO 6000 Blackwell Server Edition GPUs | AWS News Blog (opens in new tab)

Amazon has announced the general availability of EC2 G7e instances, a new hardware tier powered by NVIDIA RTX PRO 6000 Blackwell Server Edition GPUs designed for generative AI and high-end graphics. These instances deliver up to 2.3 times the inference performance of their G6e predecessors while providing significant upgrades to memory and bandwidth. This launch aims to provide a cost-effective solution for running medium-sized AI models and complex spatial computing workloads at scale. **Blackwell GPU and Memory Advancements** * The G7e instances feature NVIDIA RTX PRO 6000 Blackwell GPUs, which provide twice the memory and 1.85 times the memory bandwidth of the G6e generation. * Each GPU provides 96 GB of memory, allowing users to run medium-sized models—such as those with up to 70 billion parameters—on a single GPU using FP8 precision. * The architecture is optimized for both spatial computing and scientific workloads, offering the highest graphics performance currently available in the EC2 portfolio. **High-Speed Connectivity and Multi-GPU Scaling** * To support large-scale models, G7e instances utilize NVIDIA GPUDirect P2P, enabling direct communication between GPUs over PCIe interconnects with minimal latency. * These instances offer four times the inter-GPU bandwidth compared to the L40s GPUs found in G6e instances, facilitating more efficient data transfer in multi-GPU configurations. * Total GPU memory can scale up to 768 GB within a single node, supporting massive inference tasks across eight interconnected GPUs. **Networking and Storage Performance** * G7e instances provide up to 1,600 Gbps of network bandwidth, a four-fold increase over previous generations, making them suitable for small-scale multi-node clusters. * Support for NVIDIA GPUDirect Remote Direct Memory Access (RDMA) via Elastic Fabric Adapter (EFA) reduces latency for remote GPU-to-GPU communication. * The instances support GPUDirect Storage with Amazon FSx for Lustre, achieving throughput speeds up to 1.2 Tbps to ensure rapid model loading and data processing. **System Specifications and Configurations** * Under the hood, G7e instances are powered by Intel Emerald Rapids processors and support up to 192 vCPUs and 2,048 GiB of system memory. * Local storage options include up to 15.2 TB of NVMe SSD capacity to handle high-speed data caching and local processing. * The instance family ranges from the g7e.2xlarge (1 GPU, 8 vCPUs) to the g7e.48xlarge (8 GPUs, 192 vCPUs). For developers ready to transition to Blackwell-based architecture, these instances are accessible through AWS Deep Learning AMIs (DLAMI). They represent a major step forward for organizations needing to balance the high memory requirements of modern LLMs with the cost efficiencies of the G-series instance family.

aws

Amazon OpenSearch Service improves vector database performance and cost with GPU acceleration and auto-optimization | AWS News Blog (opens in new tab)

Amazon OpenSearch Service has introduced serverless GPU acceleration and auto-optimization features designed to enhance the performance and cost-efficiency of large-scale vector databases. These updates allow users to build vector indexes up to ten times faster at a quarter of the traditional indexing cost, enabling the creation of billion-scale databases in under an hour. By automating complex tuning processes, OpenSearch Service simplifies the deployment of generative AI and high-speed search applications. ### GPU Acceleration for Rapid Indexing The new serverless GPU acceleration streamlines the creation of vector data structures by offloading intensive workloads to specialized hardware. * **Performance Gains:** Indexing speed is increased by 10x compared to non-GPU configurations, significantly reducing the time-to-market for data-heavy applications. * **Cost Efficiency:** Indexing costs are reduced to approximately 25% of standard costs, and users only pay for active processing through OpenSearch Compute Units (OCU) rather than idle instance time. * **Serverless Management:** There is no need to provision or manage GPU instances manually; OpenSearch Service automatically detects acceleration opportunities and isolates workloads within the user's Amazon VPC. * **Operational Scope:** Acceleration is automatically applied to both initial indexing and subsequent force-merge operations. ### Automated Vector Index Optimization Auto-optimization removes the requirement for deep vector expertise by automatically balancing competing performance metrics. * **Simplified Tuning:** The system replaces manual index tuning—which can traditionally take weeks—with automated configurations. * **Resource Balancing:** The tool finds the optimal trade-off between search latency, search quality (recall rates), and memory requirements. * **Improved Accuracy:** Users can achieve higher recall rates and better cost savings compared to using default, unoptimized index configurations. ### Configuration and Integration These features can be integrated into new or existing OpenSearch Service domains and Serverless collections through the AWS Console or CLI. * **CLI Activation:** Users can enable acceleration on existing domains using the `update-domain-config` command with the `--aiml-options` flag set to enable `ServerlessVectorAcceleration`. * **Index Settings:** To leverage GPU processing, users must create a vector index with specific settings, notably setting `index.knn.remote_index_build.enabled` to `true`. * **Supported Workloads:** The service supports standard OpenSearch operations, including the Bulk API for adding vector data and text embeddings. For organizations managing large-scale vector workloads for RAG (Retrieval-Augmented Generation) or semantic search, enabling GPU acceleration is a highly recommended step to reduce operational overhead. Developers should transition existing indexes to include the `remote_index_build` setting to take immediate advantage of the improved speed and reduced OCU pricing.

google

Introducing interactive on-device segmentation in Snapseed (opens in new tab)

Google has introduced a new "Object Brush" feature in Snapseed that enables intuitive, real-time selective photo editing through a novel on-device segmentation technology. By leveraging a high-performance interactive AI model, users can isolate complex subjects with simple touch gestures in under 20 milliseconds, bridging the gap between professional-grade editing and mobile convenience. This breakthrough is achieved through a sophisticated teacher-student training architecture that prioritizes both pixel-perfect accuracy and low-latency performance on consumer hardware. ### High-Performance On-Device Inference * The system is powered by the Interactive Segmenter model, which is integrated directly into the Snapseed "Adjust" tool to facilitate immediate object-based modifications. * To ensure a fluid user experience, the model utilizes the MediaPipe framework and LiteRT’s GPU acceleration to process selections in less than 20ms. * The interface supports dynamic refinement, allowing users to provide real-time feedback by tracing lines or tapping to add or subtract specific areas of an image. ### Teacher-Student Model Distillation * The development team first created "Interactive Segmenter: Teacher," a large-scale model fine-tuned on 30,000 high-quality, pixel-perfect manual annotations across more than 350 object categories. * Because the Teacher model’s size and computational requirements are prohibitive for mobile use, researchers developed "Interactive Segmenter: Edge" through knowledge distillation. * This distillation process utilized a dataset of over 2 million weakly annotated images, allowing the smaller Edge model to inherit the generalization capabilities of the Teacher model while maintaining a footprint suitable for mobile devices. ### Training via Synthetic User Prompts * To make the model universally capable across all object types, the training process uses a class-agnostic approach based on the Big Transfer (BiT) strategy. * The model learns to interpret user intent through "prompt generation," which simulates real-world interactions such as random scribbles, taps, and lasso (box) selections. * During training, both the Teacher and Edge models receive identical prompts—such as red foreground scribbles and blue background scribbles—to ensure the student model learns to produce high-quality masks even from imprecise user input. This advancement significantly lowers the barrier to entry for complex photo manipulation by moving heavy-duty AI processing directly onto the mobile device. Users can expect a more responsive and precise editing experience that handles everything from fine-tuning a subject's lighting to isolating specific environmental elements like clouds or clothing.