google

Teaching machines the language of biology: Scaling large language models for next-generation single-cell analysis (opens in new tab)

Cell2Sentence-Scale (C2S-Scale) is a new family of open-source large language models designed to transform complex single-cell transcriptomic data into a text-based format accessible to natural language processing. By representing gene expression profiles as "cell sentences," the framework allows researchers to use general-purpose LLM architectures to "read" and "write" biological information. This approach simplifies single-cell analysis, enabling conversational queries and automated data interpretation that were previously limited to specialized tools and expert users.

The Cell2Sentence Mapping Method

  • Translates single-cell RNA sequencing (scRNA-seq) measurements into sequences of text by ordering gene names according to their expression levels.
  • Enables the integration of cellular data with text-based biological context, such as cell types, experimental metadata, and scientific literature.
  • Leverages the existing vocabulary of biology—gene names and functions—to make high-dimensional data interpretable by standard language model tokenizers.

C2S-Scale Model Architecture and Training

  • Built upon Google’s Gemma open model family, maintaining the original architecture to benefit from existing scalability and infrastructure.
  • Trained on a dataset exceeding 1 billion tokens derived from real-world transcriptomic data and biological metadata.
  • Features a range of model sizes from 410 million to 27 billion parameters, allowing researchers to choose between computational efficiency for exploratory work and high performance for complex tasks.

Functional Applications in Biology

  • Conversational Querying: Researchers can interact with data through natural language to ask specific questions, such as predicting how a T cell might respond to a particular cancer therapy.
  • Automated Interpretation: The models can generate biological summaries of experiments, describing everything from individual cell types to the characteristics of entire tissues.
  • Predictive Tasks: The framework handles diverse tasks including cell type annotation and the generation of synthetic cells or tissues for research simulations.

Performance and Biological Scaling Laws

  • Research demonstrates that biological language models follow predictable scaling laws, where performance in tasks like cell type annotation improves as model size increases.
  • Larger models show superior gene overlap and semantic similarity scores when interpreting datasets compared to smaller versions.
  • Smaller models remain highly effective for parameter-efficient fine-tuning in resource-constrained environments.

C2S-Scale is available as an open-source resource on GitHub and HuggingFace, offering a flexible toolkit for the research community to apply large language models to next-generation genomic discovery.