google

Simulating large systems with Regression Language Models (opens in new tab)

Researchers from Google have introduced Regression Language Models (RLMs) as a universal solution for numeric prediction tasks by framing regression as a text-to-text problem. By converting complex, unstructured system data into strings, RLMs can predict performance metrics without the need for manual feature engineering or data normalization. This approach allows large language models to move beyond subjective human feedback and directly model raw operational data for large-scale software and industrial infrastructures. ## Conceptualizing Text-to-Text Regression * Traditional regression methods rely on tabular data—fixed-length numeric vectors—which are difficult and laborious to maintain for evolving systems like software logs or hardware patterns. * RLMs represent the input state ($x$) as a structured text string (such as JSON or YAML) and the numerical output ($y$) as a text string. * The model is trained using standard next-token prediction and cross-entropy loss, allowing it to function as a universal approximator for complex data types. * This paradigm eliminates the need for manual feature engineering, as the model learns directly from the raw textual representation of the system state. ## Architecture and Training for Large Systems * The research utilizes a compact RLM consisting of a two-layer encoder-decoder architecture with 60 million parameters. * To manage large inputs that can reach up to 1 million tokens, the system reorders features by importance at the beginning of the string so that critical data is preserved when truncated to the model's 8k token limit. * Pre-training the RLM on diverse regression tasks enables few-shot adaptation, allowing the model to adjust to new data types with minimal gradient updates. * Numerical values are processed as-is within the text, removing the requirement for traditional scaling or normalization common in standard machine learning pipelines. ## Optimizing Google's Borg Infrastructure * The method was specifically applied to Google’s Borg system to predict MIPS per GCU (Millions of Instructions Per Second per Google Compute Unit), a vital efficiency metric. * The RLM simulates the outcomes of complex bin-packing algorithms within a "digital twin" framework to optimize resource allocation across CPUs and TPUs. * By analyzing execution traces and textual metadata, the model provides high-accuracy forecasting for diverse workloads including Gmail, YouTube, and Maps. ## Density Capture and Uncertainty Modeling * Unlike traditional regressors that provide a single point estimate, RLMs can capture full probability distributions by sampling the decoded output multiple times. * This density estimation is critical for modeling aleatoric uncertainty, which represents the inherent randomness and stochastic load demands of large-scale compute environments. * The ability to visualize these distributions helps engineers identify the range of possible outcomes and the inherent variability of the system's performance over time. This research demonstrates that small, specialized language models can effectively replace traditional regression methods in highly dynamic environments. For practitioners looking to implement these capabilities, the open-source `regress-lm` library provides a framework for simulating large systems and predicting performance across varied industrial and scientific use cases.

line

Introducing a case of utilizing DDD in (opens in new tab)

LY Corporation’s ABC Studio developed a specialized retail Merchant system by leveraging Domain-Driven Design (DDD) to overcome the functional limitations of a legacy food-delivery infrastructure. The project demonstrates that the primary value of DDD lies not just in technical implementation, but in aligning organizational structures and team responsibilities with domain boundaries. By focusing on the roles and responsibilities of the system rather than just the code, the team created a scalable platform capable of supporting diverse consumer interfaces. ### Redefining the Retail Domain * The legacy system treated retail items like restaurant entries, creating friction for specialized retail services; the new system was built to be a standalone platform. * The team narrowed the domain focus to five core areas: Shop, Item, Category, Inventory, and Order. * Sales-specific logic, such as coupons and promotions, was delegated to external "Consumer Platforms," allowing the Merchant system to serve as a high-performance information provider. ### Clean Architecture and Modular Composition * The system utilizes Clean Architecture to ensure domain entities remain independent of external frameworks, which also provided a manageable learning curve for new team members. * Services are split into two distinct modules: "API" modules for receiving external requests and "Engine" modules for processing business logic. * Communication between these modules is handled asynchronously via gRPC and Apache Kafka, using the Decaton library to increase throughput while maintaining a low partition count. * The architecture prioritizes eventual consistency, allowing for high responsiveness and scalability across the platform. ### Global Collaboration and Conway’s Law * Development was split between teams in Korea (Core Domain) and Japan (System Integration and BFF), requiring a shared understanding of domain boundaries. * Architectural Decision Records (ADR) were implemented to document critical decisions and prevent "knowledge drift" during long-term collaboration. * The organizational structure was intentionally designed to mirror the system architecture, with specific teams (Core, Link, BFF, and Merchant Link) assigned to distinct domain layers. * This alignment, reflecting Conway’s Law, ensures that changes to external consumer platforms have minimal impact on the stable core domain logic. Successful DDD adoption requires moving beyond technical patterns like hexagonal architecture and focusing on establishing a shared understanding of roles across the organization. By structuring teams to match domain boundaries, companies can build resilient systems where the core business logic remains protected even as the external service ecosystem evolves.

google

SensorLM: Learning the language of wearable sensors (opens in new tab)

SensorLM is a new family of foundation models designed to bridge the gap between high-dimensional wearable sensor data and natural language descriptions. By training on a massive dataset of nearly 60 million hours of de-identified health data, the models learn to interpret complex physiological signals to provide meaningful context for human activities. This research demonstrates that integrating multimodal sensor signals with language models enables sophisticated health insights, such as zero-shot activity recognition and automated health captioning, that significantly outperform general-purpose large language models. ## Dataset Scale and Automated Annotation * The models were pre-trained on an unprecedented 59.7 million hours of multimodal sensor data collected from over 103,000 individuals across 127 countries. * To overcome the high cost of manual annotation, researchers developed a hierarchical pipeline that automatically generates text descriptions by calculating statistics and identifying trends within the raw sensor streams. * Data was sourced from Fitbit and Pixel Watch devices, representing nearly 2.5 million person-days of activity and health information. ## Hybrid Training Architecture * SensorLM unifies two primary multimodal strategies: contrastive learning and generative pre-training. * Through contrastive learning, the model learns to discriminate between different states—such as a "light swim" versus a "strength workout"—by matching sensor segments to corresponding text descriptions. * The generative component allows the model to "speak" for the sensors, producing nuanced, context-aware natural language captions directly from high-dimensional biometric signals. ## Activity Recognition and Cross-Modal Capabilities * The model demonstrates state-of-the-art performance in zero-shot human activity recognition, accurately classifying 20 different activities without any specific fine-tuning. * Its few-shot learning capabilities allow the model to adapt to new tasks or individual user patterns with only a handful of examples. * SensorLM facilitates cross-modal retrieval, enabling users or experts to find specific sensor patterns using natural language queries or to generate descriptions based on specific sensor inputs. ## Generative Health Captioning * Beyond simple classification, the model can generate hierarchical captions that describe the statistical, structural, and semantic dimensions of a user’s data. * Experimental results using metrics like BERTScore show that SensorLM produces captions that are more factually correct and coherent than those created by powerful non-specialist LLMs. * This capability allows for the translation of abstract data points, such as heart rate variability or step counts, into readable summaries that explain the "why" behind physiological changes. By providing a framework where wearable data can be understood through the lens of human language, SensorLM paves the way for more intuitive and personalized health monitoring. This technology holds the potential to transform raw biometric streams into actionable insights, helping users better understand the relationship between their activities and their overall physical well-being.

google

Synthetic and federated: Privacy-preserving domain adaptation with LLMs for mobile applications (opens in new tab)

Researchers at Google have developed a framework for improving both small and large language models (LMs) in mobile applications like Gboard by utilizing privacy-preserving synthetic data and federated learning. This approach combines differential privacy (DP) with large language model (LLM) generation to minimize data memorization risks while achieving significant gains in production metrics like next-word prediction and proofreading. The result is a robust pipeline that allows models to adapt to specific user domains without compromising individual privacy or requiring centralized data storage. ### Strengthening Privacy with DP-FL * Gboard has transitioned all production LMs trained on user data to a Federated Learning with Differential Privacy (DP-FL) framework, ensuring data remains on-device and is never memorized. * The deployment utilizes the **BLT-DP-FTRL** algorithm, which offers an optimized trade-off between privacy guarantees and model utility while being easier to deploy in production. * Engineers adopted the **SI-CIFG** model architecture to facilitate efficient on-device training, ensuring the hardware can handle local updates while maintaining compatibility with DP constraints. ### Synthetic Data Generation via Public LLMs * Powerful LLMs trained on public web data are prompted to synthesize high-quality text that mimics mobile user interactions without ever accessing actual private user data. * The process involves a two-step prompting strategy: first, filtering public datasets to identify topics common in mobile communication, and second, generating new, domain-specific text based on those patterns. * This synthetic data serves as a bridge for pre-training small LMs, which are then refined through private post-training on-device to capture the nuances of user behavior. ### Adapting LLMs for Mobile Proofreading * To support advanced features like Gboard's "Proofread," researchers developed a "Synthesize-then-Adapt" pipeline specifically for error correction. * LLMs generate synthetic "corrupted" text to simulate common mobile typing errors, providing the necessary training pairs (error/correction) that are difficult to find in public datasets. * Federated learning is then used to adapt these error-correction models to specific app domains (such as messaging or email) using on-device signals, ensuring the model understands the specific context of the user's typing. The success of these techniques in Gboard demonstrates that synthetic data can effectively replace or augment private data throughout the machine learning lifecycle. For developers working with sensitive user information, adopting a "synthetic-first" approach combined with federated learning provides a scalable path to model improvement that adheres to the core principles of data minimization and anonymization.

line

Milvus: Building a Large-Scale (opens in new tab)

LINE VOOM transitioned its recommendation system from a batch-based offline process to a real-time infrastructure to solve critical content freshness issues. By adopting Milvus, an open-source vector database, the team enabled the immediate indexing and searching of new video content as soon as it is uploaded. This implementation ensures that time-sensitive posts are recommended to users without the previous 24-hour delay, significantly enhancing user engagement. ### Limitations of the Legacy Recommendation System * The original system relied on daily offline batch processing for embedding generation and similarity searches. * New content, such as holiday greetings or trending sports clips, suffered from a "lack of immediacy," often taking up to a full day to appear in user feeds. * To improve user experience, the team needed to shift from offline candidate pools to an online system capable of real-time Approximate Nearest Neighbor (ANN) searches. ### Selecting Milvus as the Vector Database * The team evaluated Milvus and Qdrant based on performance, open-source status, and on-premise compatibility. * Milvus was selected due to its superior performance, handling 2,406 requests per second compared to Qdrant's 326, with lower query latency (1ms vs 4ms). * Key architectural advantages of Milvus included the separation of storage and computing, support for both stream and batch inserts, and a diverse range of supported in-memory index types. ### Reliability Verification via Chaos Testing * Given the complexity of Milvus clusters, the team performed chaos testing by intentionally injecting failures like pod kills and scaling events. * Tests revealed critical vulnerabilities: killing the `Querycoord` led to collection release and search failure, while losing the `Etcd` quorum caused total metadata loss. * These findings highlighted the need for robust high-availability (HA) configurations to prevent service interruptions during component failures. ### High Availability (HA) Implementation Strategies * **Collection-Level HA:** To prevent search failures during coordinator issues, the team implemented a dual-writing system where embeddings are recorded in two separate collections simultaneously. * **Alias Switching:** Client applications use an "alias" to reference collections; if the primary collection becomes unavailable, the system instantly switches the alias to the backup collection to minimize downtime. * **Coordinator-Level HA:** To eliminate single points of failure, coordinators (such as `Indexcoord`) were configured in an Active-Standby mode, ensuring a backup is always ready to take over management tasks. To successfully deploy a large-scale real-time recommendation engine, it is critical to select a vector database that decouples storage from compute and to implement multi-layered high-availability strategies, such as dual-collection writing and active-standby coordinators, to ensure production stability.

google

LSM-2: Learning from incomplete wearable sensor data (opens in new tab)

LSM-2 introduces a paradigm shift in processing wearable sensor data by treating naturally occurring data gaps as inherent features rather than errors to be corrected. By utilizing the Adaptive and Inherited Masking (AIM) framework, the model learns directly from fragmented, real-world data streams without the need for biased imputation or data-discarding filters. This approach allows LSM-2 to achieve state-of-the-art performance in health-related classification and regression tasks, maintaining robustness even when sensors fail or data is highly interrupted. ## The Challenge of Pervasive Missingness * Real-world wearable data is almost never continuous; factors such as device charging, motion artifacts, and battery-saving modes create frequent "missingness." * Traditional self-supervised learning models require complete data, forcing researchers to use imputation—which can introduce artificial bias—or aggressive filtering that discards over 90% of potentially useful samples. * In a dataset of 1.6 million day-long windows, research found that not a single sample had 0% missingness, highlighting the impracticality of training only on complete datasets. ## Adaptive and Inherited Masking (AIM) * AIM extends the Masked Autoencoder (MAE) framework by treating "inherited" masks (naturally occurring gaps) and "artificial" masks (training objectives) as equivalent. * The framework utilizes a dual masking strategy: it employs token dropout on a fixed ratio of tokens to ensure computational efficiency during encoding. * To handle the unpredictable and variable nature of real-world gaps, AIM uses attention masking within the transformer blocks for any remaining masked tokens. * During evaluation and fine-tuning, the model relies solely on attention masking to navigate naturally occurring gaps, allowing for accurate physiological modeling without filling in missing values. ## Scale and Training Architecture * LSM-2 was trained on a massive dataset comprising 40 million hours of de-identified wearable data from more than 60,000 participants using Fitbit and Google Pixel devices. * The model learns to understand underlying physiological structures by reconstructing masked segments across multimodal inputs, including heart signals, sleep patterns, and activity levels. * Because it is trained on fragmented data, the resulting foundation model is significantly more resilient to sensor dropouts in downstream tasks like hypertension prediction or stress monitoring. LSM-2 demonstrates that foundation models for health should be built to embrace the messiness of real-world environments. By integrating missingness directly into the self-supervised learning objective, developers can bypass the computational and statistical overhead of imputation while building more reliable diagnostic and monitoring tools.

discord

*FLAILS AROUND* SUMMER SPECIAL! JOIN NITRO, GET AN EXTRA MONTH OF NITRO ON US! (opens in new tab)

Discord is launching a limited-time "Subscriber Speedway" promotion to incentivize new users to join its premium Nitro service. From now until July 15th, 2025, eligible individuals who sign up for a monthly Nitro membership will receive a second month at no additional cost. This "buy one, get one free" offer effectively provides 60 days of premium features for the price of a standard 30-day subscription. ### Discord Nitro Feature Set Discord Nitro offers a suite of cosmetic and functional upgrades designed to enhance the standard user experience. Key technical and social benefits include: * **Enhanced Expression:** Access to a wider array of custom emojis and stickers across all servers. * **Profile Personalization:** Additional tools and assets for customizing user profiles. * **Performance Upgrades:** Higher-quality gameplay streaming capabilities for sharing screens with friends. * **Increased Data Limits:** Expanded file-sharing capacities, allowing for the transmission of larger assets and media. ### Summer Promotion Terms The current "Subscriber Speedway" deal is structured to attract users who are not currently enrolled in a Nitro plan. Specific details include: * **Duration:** The promotion is active through July 15th, 2025. * **Eligibility:** The offer is targeted at users who do not have an active Nitro membership at the time of purchase. * **Subscription Model:** The deal applies specifically to those starting a new monthly Nitro membership, granting a full second month as a bonus. Users interested in testing Discord’s premium features should initiate their monthly membership before the July 15th deadline to maximize the value of the 60-day promotional window.