We are open-sourcing the initial version of RCCLX – an enhanced version of RCCL that we developed and tested on Meta’s internal workloads. RCCLX is fully integrated with Torchcomms and aims to empower researchers and developers to accelerate innovation, regardless of their chose…
How low-bit inference enables efficient AI In just the past few years, large machine learning models have made incredible strides. Today’s models are not only remarkably capable but also achieve impressive results across a range of applications, from software engineering and sci…
Amazon has announced the general availability of EC2 G7e instances, a new hardware tier powered by NVIDIA RTX PRO 6000 Blackwell Server Edition GPUs designed for generative AI and high-end graphics. These instances deliver up to 2.3 times the inference performance of their G6e predecessors while providing significant upgrades to memory and bandwidth. This launch aims to provide a cost-effective solution for running medium-sized AI models and complex spatial computing workloads at scale.
**Blackwell GPU and Memory Advancements**
* The G7e instances feature NVIDIA RTX PRO 6000 Blackwell GPUs, which provide twice the memory and 1.85 times the memory bandwidth of the G6e generation.
* Each GPU provides 96 GB of memory, allowing users to run medium-sized models—such as those with up to 70 billion parameters—on a single GPU using FP8 precision.
* The architecture is optimized for both spatial computing and scientific workloads, offering the highest graphics performance currently available in the EC2 portfolio.
**High-Speed Connectivity and Multi-GPU Scaling**
* To support large-scale models, G7e instances utilize NVIDIA GPUDirect P2P, enabling direct communication between GPUs over PCIe interconnects with minimal latency.
* These instances offer four times the inter-GPU bandwidth compared to the L40s GPUs found in G6e instances, facilitating more efficient data transfer in multi-GPU configurations.
* Total GPU memory can scale up to 768 GB within a single node, supporting massive inference tasks across eight interconnected GPUs.
**Networking and Storage Performance**
* G7e instances provide up to 1,600 Gbps of network bandwidth, a four-fold increase over previous generations, making them suitable for small-scale multi-node clusters.
* Support for NVIDIA GPUDirect Remote Direct Memory Access (RDMA) via Elastic Fabric Adapter (EFA) reduces latency for remote GPU-to-GPU communication.
* The instances support GPUDirect Storage with Amazon FSx for Lustre, achieving throughput speeds up to 1.2 Tbps to ensure rapid model loading and data processing.
**System Specifications and Configurations**
* Under the hood, G7e instances are powered by Intel Emerald Rapids processors and support up to 192 vCPUs and 2,048 GiB of system memory.
* Local storage options include up to 15.2 TB of NVMe SSD capacity to handle high-speed data caching and local processing.
* The instance family ranges from the g7e.2xlarge (1 GPU, 8 vCPUs) to the g7e.48xlarge (8 GPUs, 192 vCPUs).
For developers ready to transition to Blackwell-based architecture, these instances are accessible through AWS Deep Learning AMIs (DLAMI). They represent a major step forward for organizations needing to balance the high memory requirements of modern LLMs with the cost efficiencies of the G-series instance family.
Amazon OpenSearch Service has introduced serverless GPU acceleration and auto-optimization features designed to enhance the performance and cost-efficiency of large-scale vector databases. These updates allow users to build vector indexes up to ten times faster at a quarter of the traditional indexing cost, enabling the creation of billion-scale databases in under an hour. By automating complex tuning processes, OpenSearch Service simplifies the deployment of generative AI and high-speed search applications.
### GPU Acceleration for Rapid Indexing
The new serverless GPU acceleration streamlines the creation of vector data structures by offloading intensive workloads to specialized hardware.
* **Performance Gains:** Indexing speed is increased by 10x compared to non-GPU configurations, significantly reducing the time-to-market for data-heavy applications.
* **Cost Efficiency:** Indexing costs are reduced to approximately 25% of standard costs, and users only pay for active processing through OpenSearch Compute Units (OCU) rather than idle instance time.
* **Serverless Management:** There is no need to provision or manage GPU instances manually; OpenSearch Service automatically detects acceleration opportunities and isolates workloads within the user's Amazon VPC.
* **Operational Scope:** Acceleration is automatically applied to both initial indexing and subsequent force-merge operations.
### Automated Vector Index Optimization
Auto-optimization removes the requirement for deep vector expertise by automatically balancing competing performance metrics.
* **Simplified Tuning:** The system replaces manual index tuning—which can traditionally take weeks—with automated configurations.
* **Resource Balancing:** The tool finds the optimal trade-off between search latency, search quality (recall rates), and memory requirements.
* **Improved Accuracy:** Users can achieve higher recall rates and better cost savings compared to using default, unoptimized index configurations.
### Configuration and Integration
These features can be integrated into new or existing OpenSearch Service domains and Serverless collections through the AWS Console or CLI.
* **CLI Activation:** Users can enable acceleration on existing domains using the `update-domain-config` command with the `--aiml-options` flag set to enable `ServerlessVectorAcceleration`.
* **Index Settings:** To leverage GPU processing, users must create a vector index with specific settings, notably setting `index.knn.remote_index_build.enabled` to `true`.
* **Supported Workloads:** The service supports standard OpenSearch operations, including the Bulk API for adding vector data and text embeddings.
For organizations managing large-scale vector workloads for RAG (Retrieval-Augmented Generation) or semantic search, enabling GPU acceleration is a highly recommended step to reduce operational overhead. Developers should transition existing indexes to include the `remote_index_build` setting to take immediate advantage of the improved speed and reduced OCU pricing.
Google has introduced a new "Object Brush" feature in Snapseed that enables intuitive, real-time selective photo editing through a novel on-device segmentation technology. By leveraging a high-performance interactive AI model, users can isolate complex subjects with simple touch gestures in under 20 milliseconds, bridging the gap between professional-grade editing and mobile convenience. This breakthrough is achieved through a sophisticated teacher-student training architecture that prioritizes both pixel-perfect accuracy and low-latency performance on consumer hardware.
### High-Performance On-Device Inference
* The system is powered by the Interactive Segmenter model, which is integrated directly into the Snapseed "Adjust" tool to facilitate immediate object-based modifications.
* To ensure a fluid user experience, the model utilizes the MediaPipe framework and LiteRT’s GPU acceleration to process selections in less than 20ms.
* The interface supports dynamic refinement, allowing users to provide real-time feedback by tracing lines or tapping to add or subtract specific areas of an image.
### Teacher-Student Model Distillation
* The development team first created "Interactive Segmenter: Teacher," a large-scale model fine-tuned on 30,000 high-quality, pixel-perfect manual annotations across more than 350 object categories.
* Because the Teacher model’s size and computational requirements are prohibitive for mobile use, researchers developed "Interactive Segmenter: Edge" through knowledge distillation.
* This distillation process utilized a dataset of over 2 million weakly annotated images, allowing the smaller Edge model to inherit the generalization capabilities of the Teacher model while maintaining a footprint suitable for mobile devices.
### Training via Synthetic User Prompts
* To make the model universally capable across all object types, the training process uses a class-agnostic approach based on the Big Transfer (BiT) strategy.
* The model learns to interpret user intent through "prompt generation," which simulates real-world interactions such as random scribbles, taps, and lasso (box) selections.
* During training, both the Teacher and Edge models receive identical prompts—such as red foreground scribbles and blue background scribbles—to ensure the student model learns to produce high-quality masks even from imprecise user input.
This advancement significantly lowers the barrier to entry for complex photo manipulation by moving heavy-duty AI processing directly onto the mobile device. Users can expect a more responsive and precise editing experience that handles everything from fine-tuning a subject's lighting to isolating specific environmental elements like clouds or clothing.