cell2sentence

1 posts

google

Teaching machines the language of biology: Scaling large language models for next-generation single-cell analysis (opens in new tab)

Cell2Sentence-Scale (C2S-Scale) is a new family of open-source large language models designed to transform complex single-cell transcriptomic data into a text-based format accessible to natural language processing. By representing gene expression profiles as "cell sentences," the framework allows researchers to use general-purpose LLM architectures to "read" and "write" biological information. This approach simplifies single-cell analysis, enabling conversational queries and automated data interpretation that were previously limited to specialized tools and expert users. ### The Cell2Sentence Mapping Method * Translates single-cell RNA sequencing (scRNA-seq) measurements into sequences of text by ordering gene names according to their expression levels. * Enables the integration of cellular data with text-based biological context, such as cell types, experimental metadata, and scientific literature. * Leverages the existing vocabulary of biology—gene names and functions—to make high-dimensional data interpretable by standard language model tokenizers. ### C2S-Scale Model Architecture and Training * Built upon Google’s Gemma open model family, maintaining the original architecture to benefit from existing scalability and infrastructure. * Trained on a dataset exceeding 1 billion tokens derived from real-world transcriptomic data and biological metadata. * Features a range of model sizes from 410 million to 27 billion parameters, allowing researchers to choose between computational efficiency for exploratory work and high performance for complex tasks. ### Functional Applications in Biology * **Conversational Querying:** Researchers can interact with data through natural language to ask specific questions, such as predicting how a T cell might respond to a particular cancer therapy. * **Automated Interpretation:** The models can generate biological summaries of experiments, describing everything from individual cell types to the characteristics of entire tissues. * **Predictive Tasks:** The framework handles diverse tasks including cell type annotation and the generation of synthetic cells or tissues for research simulations. ### Performance and Biological Scaling Laws * Research demonstrates that biological language models follow predictable scaling laws, where performance in tasks like cell type annotation improves as model size increases. * Larger models show superior gene overlap and semantic similarity scores when interpreting datasets compared to smaller versions. * Smaller models remain highly effective for parameter-efficient fine-tuning in resource-constrained environments. C2S-Scale is available as an open-source resource on GitHub and HuggingFace, offering a flexible toolkit for the research community to apply large language models to next-generation genomic discovery.