predictive-modeling

6 posts

google

Hard-braking events as indicators of road segment crash risk (opens in new tab)

Google Research has established a statistically significant correlation between hard-braking events (HBEs) collected via Android Auto and actual road crash rates. By utilizing HBEs as a "leading" indicator rather than relying on sparse, lagging historical crash data, researchers can proactively identify high-risk road segments with much greater speed and spatial granularity. This validation suggests that connected vehicle data can serve as a scalable proxy for traditional safety assessments. ### Data Density and Scalability * HBEs—defined as forward deceleration exceeding -3m/s²—provide a signal that is 18 times denser than reported crash data. * While crashes are statistically rare and can take years to provide a valid safety profile for a specific road segment, HBEs offer a continuous stream of information. * This high density allows for the creation of a comprehensive "safety map" that includes local and arterial roads where crash reporting is often inconsistent or sparse. ### Statistical Validation of HBEs * Researchers employed negative binomial regression models to analyze 10 years of public crash data from California and Virginia alongside anonymized HBE data. * The models controlled for confounding factors such as traffic volume, segment length, road type (local, arterial, highway), and infrastructure dynamics like slope and lane changes. * The results confirmed a consistent positive association between HBE frequency and crash rates across all road types, proving HBEs are a reliable surrogate for risk regardless of geography. ### High-Risk Identification Case Study * An analysis of a freeway merge connecting Highway 101 and Highway 880 in California served as a practical validation of the metric. * This specific segment was found to have an HBE rate 70 times higher than the state average, correlating with a historical record of one crash every six weeks. * The HBE signal successfully flagged this location as being in the top 1% of high-risk segments without needing years of collision reports to confirm the danger, demonstrating its utility in identifying "black spots" early. ### Real-World Application and Road Management * Validating HBEs transforms raw sensor data into a trusted tool for urban planners and road authorities to perform network-wide safety assessments. * This approach allows for proactive infrastructure interventions, such as adjusting signage or merge patterns, before fatalities or injuries occur. * The findings support the integration of connected vehicle insights into platforms like Google Maps to help authorities manage road safety more dynamically.

google

Reducing EV range anxiety: How a simple AI model predicts port availability (opens in new tab)

Google Research has developed a lightweight AI model designed to predict the probability of EV charging port availability at specific future intervals, directly addressing the "range anxiety" experienced by electric vehicle drivers. By co-designing the model with deployment infrastructure, researchers found that a simple linear regression approach outperformed more complex architectures like neural networks and decision trees. The resulting system effectively predicts availability changes during high-turnover periods, providing more reliable navigation and planning data than traditional "no-change" assumptions. ### Model Architecture and Feature Selection * The development team prioritized a minimal feature set to ensure low-latency deployment and high speed in real-world navigational applications. * After testing various architectures, a straightforward linear regression model was selected for its robustness and superior performance in this specific predictive task. * The model was trained using real-time availability data from diverse geographical regions, specifically California and Germany, with an emphasis on larger charging stations that reflect high-traffic usage patterns. ### Temporal Feature Weights and Occupancy Trends * The model uses the hour of the day as a primary feature, treating each hour as an independent variable to capture specific daily cycles. * Learned numerical "weights" dictate the predicted rate of occupancy change: positive weights indicate ports are becoming occupied (e.g., during morning rush), while negative weights indicate ports are being freed up (e.g., during evening hours). * The system is designed to only deviate from the current occupancy state when the change rate is statistically significant or when a station's large size amplifies the likelihood of a status change. ### Performance Benchmarking and Validation * The model was evaluated against a "Keep Current State" baseline, which assumes future availability will be identical to the present status—a difficult baseline to beat since port status remains unchanged roughly 90% of the time over 30-minute windows. * Accuracy was measured using Mean Squared Error (MSE) and Mean Absolute Error (MAE) over 30-minute and 60-minute time horizons across 100 randomly selected stations. * Testing confirmed that the linear regression model provides its greatest value during infrequent but critical moments of high turnover, successfully identifying when a station is likely to become full or available. The success of this model demonstrates that sophisticated deep learning is not always the optimal solution for infrastructure challenges. By combining intuitive real-world logic—such as driver schedules and station capacity—with simple machine learning techniques, developers can create highly efficient tools that significantly improve the EV user experience without requiring massive computational overhead.

google

Forecasting the future of forests with AI: From counting losses to predicting risk (opens in new tab)

Research from Google DeepMind and Google Research introduces ForestCast, a deep learning-based framework designed to transition forest management from retrospective loss monitoring to proactive risk forecasting. By utilizing vision transformers and pure satellite data, the team has developed a scalable method to predict future deforestation that matches or exceeds the accuracy of traditional models dependent on inconsistent manual inputs. This approach provides a repeatable, future-proof benchmark for protecting biodiversity and mitigating climate change on a global scale. ### Limitations of Traditional Forecasting * Existing state-of-the-art models rely on specialized geospatial maps, such as infrastructure development, road networks, and regional economic indicators. * These traditional inputs are often "patchy" and inconsistent across different countries, requiring manual assembly that is difficult to replicate globally. * Manual data sources are not future-proof; they tend to go out of date quickly with no guarantee of regular updates, unlike continuous satellite streams. ### A Scalable Pure-Satellite Architecture * The ForestCast model adopts a "pure satellite" approach, using only raw inputs from Landsat and Sentinel-2 satellites. * The architecture is built on vision transformers (ViTs) that process an entire tile of pixels in a single pass to capture critical spatial context and landscape-level trends. * The model incorporates a satellite-derived "change history" layer, which identifies previously deforested pixels and the specific year the loss occurred. * By avoiding socio-political or infrastructure maps, the method can be applied consistently to any region on Earth, allowing for meaningful cross-regional comparisons. ### Key Findings and Benchmark Release * Research indicates that "change history" is the most information-dense input; a model trained on this data alone performs almost as well as those using raw multi-spectral data. * The model successfully predicts tile-to-tile variation in deforestation amounts and identifies the specific pixels most likely to be cleared next. * Google has released the training and evaluation data as a public benchmark dataset, focusing initially on Southeast Asia to allow the machine learning community to verify and improve upon the results. The release of ForestCast provides a template for scaling predictive modeling to Latin America, Africa, and boreal latitudes. Conservationists and policymakers should utilize these forecasting tools to move beyond counting historical losses and instead direct resources toward "frontline" areas where the model identifies imminent risk of habitat conversion.

google

Insulin resistance prediction from wearables and routine blood biomarkers (opens in new tab)

Researchers at Google have developed a novel machine learning approach to predict insulin resistance (IR) by integrating wearable device data with routine blood biomarkers. This method aims to provide a scalable, less invasive alternative to traditional "gold standard" tests like the euglycemic insulin clamp or specialized HOMA-IR assessments. The study demonstrates that combining digital biomarkers with common laboratory results can effectively identify individuals at risk for type 2 diabetes, particularly within high-risk populations. ## Barriers to Early Diabetes Screening * Insulin resistance is a primary precursor to approximately 70% of type 2 diabetes cases, yet it often remains undetected until the disease has progressed. * Current diagnostic standards are frequently omitted from routine check-ups due to high costs, invasiveness, and the requirement for specific insulin blood tests that are not standard practice. * Early detection is vital because insulin resistance is often reversible through lifestyle modifications, making accessible screening tools a high priority for preventative medicine. ## The WEAR-ME Multimodal Dataset * The research utilized the "WEAR-ME" study, which collected data from 1,165 remote participants across the U.S. via the Google Health Studies app. * Digital biomarkers were gathered from Fitbit and Google Pixel Watch devices, tracking metrics such as resting heart rate, step counts, and sleep patterns. * Clinical data was provided through a partnership with Quest Diagnostics, focusing on routine blood biomarkers like fasting glucose and lipid panels, supplemented by participant surveys on diet, fitness, and demographics. ## Predictive Modeling and Performance * Deep neural network models were trained to estimate HOMA-IR scores by analyzing different combinations of the collected data streams. * While models using only wearables and demographics achieved an area under the receiver operating characteristic curve (auROC) of 0.70, adding fasting glucose data boosted the auROC to 0.78. * The most comprehensive models, which combined wearables, demographics, and full routine blood panels, achieved the highest accuracy across the study population. * Performance was notably strong in high-risk sub-groups, specifically individuals with obesity or sedentary lifestyles. ## AI-Driven Interpretation and Literacy * To assist with data translation, the researchers developed a prototype "Insulin Resistance Literacy and Understanding Agent" built on the Gemini family of large language models. * The agent is designed to help users interpret their IR risk predictions and provide personalized, research-backed educational content. * This AI integration aims to facilitate better communication between the data results and actionable health strategies, though it is currently intended for informational and research purposes. By utilizing ubiquitous wearable technology and existing clinical infrastructure, this approach offers a path toward proactive metabolic health monitoring. Integrating these models into consumer or clinical platforms could lower the barrier to early diabetes intervention and enable more personalized preventative care.

coupang

Optimizing Logistics Inbound Process Using (opens in new tab)

Coupang has implemented a machine learning-based prediction system to optimize its logistics inbound process by accurately forecasting the number of trucks required for product deliveries. By analyzing historical logistics data and vendor characteristics, the system minimizes resource waste at fulfillment center docks and prevents operational delays caused by slot shortages. This data-driven approach ensures that limited dock slots are allocated efficiently, improving overall supply chain speed and reliability. ### Challenges in Inbound Logistics * Fulfillment centers operate with a fixed number of "docks" for unloading and specific time "slots" assigned to each truck. * Inaccurate predictions create a resource dilemma: under-estimating slots causes unloading delays and backlogs, while over-estimating leads to idle docks and wasted capacity. * The goal was to move beyond manual estimation to an automated system that balances vendor requirements with actual facility throughput. ### Feature Engineering and Data Collection * The team performed Exploratory Data Analysis (EDA) on approximately 800,000 instances of inbound data collected over two years. * In-depth interviews with domain experts and logistics managers were conducted to identify hidden patterns and qualitative factors that influence truck requirements. * Final feature sets were refined through feature engineering, focusing on vendor-specific behaviors and the physical characteristics of the products being delivered. ### LightGBM Implementation and Optimization * The LightGBM algorithm was selected due to its high performance with large datasets and its efficiency in handling categorical features. * The model utilizes a leaf-wise tree growth strategy, which allows for faster training speeds and lower loss compared to traditional level-wise growth algorithms. * Hyperparameters were optimized using Bayesian Optimization, a method that finds the most effective model configurations more efficiently than traditional grid search methods. * The trained model is integrated directly into the booking system, providing real-time truck quantity recommendations to vendors during the application process. ### Operational Trade-offs and Results * The system must navigate the trade-off between under-prediction (which risks logistical bottlenecks) and over-prediction (which risks resource waste). * By automating the prediction of necessary slots, Coupang has reduced the manual workload for vendors and improved the accuracy of fulfillment center scheduling. * This optimization allows for more products to be processed in a shorter time frame, directly contributing to faster delivery times for the end customer. By replacing manual estimates with a LightGBM-based predictive model, Coupang has successfully synchronized vendor deliveries with fulfillment center capacity. This technical shift not only maximizes dock utilization but also builds a more resilient and scalable inbound supply chain.

coupang

Optimizing the inbound process with a machine learning model | by Coupang Engineering | Coupang Engineering Blog | Medium (opens in new tab)

Coupang optimized its fulfillment center inbound process by implementing a machine learning model to predict the exact number of delivery trucks and dock slots required for vendor shipments. By moving away from manual estimates, the system minimizes resource waste from over-allocation while preventing processing delays caused by under-prediction. This automated approach ensures that the limited capacity of fulfillment center docks is utilized with maximum efficiency. ### The Challenges of Dock Slot Allocation * Fulfillment centers operate with a fixed number of hourly "slots," representing the time and space a single truck occupies at a dock to unload goods. * Inaccurate slot forecasting creates a binary risk: under-prediction leads to logistical bottlenecks and delivery delays, while over-prediction results in idle docks and wasted operational overhead. * The diversity of vendor behaviors and product types makes manual estimation of truck requirements highly inconsistent across the supply chain. ### Predictive Modeling and Feature Engineering * Coupang utilized years of historical logistics data to extract features influencing truck counts, including product dimensions, categories, and vendor-specific shipment patterns. * The system employs the LightGBM algorithm, a gradient-boosting framework selected for its high performance and ability to handle large-scale tabular logistics data. * Hyperparameter tuning is managed via Bayesian optimization, which efficiently searches the parameter space to minimize prediction error. * The model accounts for the inherent trade-off between under-prediction and over-prediction, prioritizing a balance that maintains high throughput without straining labor resources. ### System Integration and Real-time Processing * The trained ML model is integrated directly into the inbound reservation system, providing vendors with an immediate prediction of required slots during the request process. * By automating the truck-count calculation, the system removes the burden of estimation from vendors and ensures consistency across different fulfillment centers. * This integration allows Coupang to dynamically adjust its dock capacity planning based on real-time data rather than static, historical averages. To maximize logistics efficiency, organizations should leverage granular product data and historical vendor behavior to automate capacity planning. Integrating predictive models directly into the reservation workflow ensures that data-driven insights are applied at the point of action, reducing human error and resource waste.