google

SensorLM: Learning the language of wearable sensors (opens in new tab)

SensorLM is a new family of foundation models designed to bridge the gap between high-dimensional wearable sensor data and natural language descriptions. By training on a massive dataset of nearly 60 million hours of de-identified health data, the models learn to interpret complex physiological signals to provide meaningful context for human activities. This research demonstrates that integrating multimodal sensor signals with language models enables sophisticated health insights, such as zero-shot activity recognition and automated health captioning, that significantly outperform general-purpose large language models.

Dataset Scale and Automated Annotation

  • The models were pre-trained on an unprecedented 59.7 million hours of multimodal sensor data collected from over 103,000 individuals across 127 countries.
  • To overcome the high cost of manual annotation, researchers developed a hierarchical pipeline that automatically generates text descriptions by calculating statistics and identifying trends within the raw sensor streams.
  • Data was sourced from Fitbit and Pixel Watch devices, representing nearly 2.5 million person-days of activity and health information.

Hybrid Training Architecture

  • SensorLM unifies two primary multimodal strategies: contrastive learning and generative pre-training.
  • Through contrastive learning, the model learns to discriminate between different states—such as a "light swim" versus a "strength workout"—by matching sensor segments to corresponding text descriptions.
  • The generative component allows the model to "speak" for the sensors, producing nuanced, context-aware natural language captions directly from high-dimensional biometric signals.

Activity Recognition and Cross-Modal Capabilities

  • The model demonstrates state-of-the-art performance in zero-shot human activity recognition, accurately classifying 20 different activities without any specific fine-tuning.
  • Its few-shot learning capabilities allow the model to adapt to new tasks or individual user patterns with only a handful of examples.
  • SensorLM facilitates cross-modal retrieval, enabling users or experts to find specific sensor patterns using natural language queries or to generate descriptions based on specific sensor inputs.

Generative Health Captioning

  • Beyond simple classification, the model can generate hierarchical captions that describe the statistical, structural, and semantic dimensions of a user’s data.
  • Experimental results using metrics like BERTScore show that SensorLM produces captions that are more factually correct and coherent than those created by powerful non-specialist LLMs.
  • This capability allows for the translation of abstract data points, such as heart rate variability or step counts, into readable summaries that explain the "why" behind physiological changes.

By providing a framework where wearable data can be understood through the lens of human language, SensorLM paves the way for more intuitive and personalized health monitoring. This technology holds the potential to transform raw biometric streams into actionable insights, helping users better understand the relationship between their activities and their overall physical well-being.