How the Lotfollahi lab created Tripso to interpret single cell data

analysis tools Apr 13, 2026

Marie Moullet is a fourth-year PhD student working with Mo Lotfollahi and Roser Vento at the Wellcome Sanger Institute as part of an Open Targets project to develop single-cell foundational models and algorithms for analysing single-cell perturbational data.

We chatted to her about Tripso—Transformers for learning Representations of Interpretable gene Programs in Single-cell transcriptOmics—a method she co-developed with Amirhossein Vahidi, Tomoya Isobe and Carlo Leonardi, for single cell data analyses that goes beyond canonical cell type markers and annotations to understand cell states.

Who is Tripso for?

This project started from a question: how do you compare data from in vitro experiments with data from in vivo experiments?

In vitro models are the foundation of modern biology, but we know that they are incomplete compared to in vivo models. We initially developed Tripso to measure how similar in vitro cells are to the reference state and find differences that we could follow up experimentally. Using this information, we can make in vitro experimental methods more representative of in vivo conditions.

Beyond that, Tripso can be applied to any experiment in which you have heterogeneous conditions that you want to compare, for example health and disease. It allows you to look at one biological axis of variation and see how different cell populations are within that, and even identify whether a population present in one condition exists in the other.

What approach did you take?

We wanted a method that would analyse single cell data through the lens of gene programmes. A gene programme is any set of biologically related genes, for example a set of genes involved in response to TGFβ signalling. They are a useful way to analyse complex, high-dimensional data: instead of analysing signals from hundreds of individual genes, we reduce it to a set of genes in a programme.

However, existing methods for modeling single cell RNA sequencing data learn a single representation for the cell’s transcriptome, and methods for gene programme activity modelling compress the information into a single number, which limits our ability to capture different axes of variation in complex experimental designs.

What if we could have multiple representations per cell, each corresponding to a specific, pre-defined gene programme?

By analogy, current large scale single cell foundation models analyse one sentence worth of gene ‘words’. We wanted to have multiple sentences per cell, where all the words in a sentence correspond to a pre-defined gene programme. One sentence might represent TGFβ signalling, but another might represent stress response.

How did you solve this computationally?

When I first started this project, there was a lot of buzz about transformers. They are a useful architecture to model gene programmes: genes can be weighted differently depending on the other genes that are present in the cell. That flexibility makes a lot of sense with gene programmes, which tend to be context-dependent.

Additionally, we implemented a module to discover novel, data-driven gene programmes. We achieved this by training the model to reconstruct the original gene expression counts, and then identifying groups of genes that share similar attention patterns. This approach is important because it combines the interpretability of biologically informed processes guided by prior knowledge with the ability to capture data-specific patterns. Such flexibility is particularly valuable in complex tissues, where context-dependent effects may not be fully represented in existing gene programme databases.

Was there anything particularly challenging in this project?

Deep learning comes with a few practical challenges — you need to keep the computational cost and runtime manageable, while also making sure the model is actually learning what it’s supposed to. We spent a lot of time carefully evaluating the model from different angles, combining quantitative benchmarks with more qualitative checks to ensure the patterns it captured were biologically meaningful.

What is the impact of this work?

We applied Tripso to a number of use cases, but for me the most interesting application came about through our collaboration with the Göttgens’s lab at the Cambridge Stem Cell Institute, showing how we could improve the maintenance of hematopoietic stem cells (HSCs) in vitro.

HSCs are very useful but quite rare, so maintaining them in vitro is very valuable. We built on a previously published paper in which they compared single cells cultured in three different media to determine the best method for maintaining the cells in a stem-like state. Leveraging public and newly generated data, we trained the model on a million hematopoietic cells, and focused our comparison on in vitro and in vivo conditions.

We first established which gene programmes were active in vivo HSC, and which genes were important for stem cell identity. But my wet lab colleagues tell me it’s easier to inhibit something than activate it so we looked for genes that are upregulated as cells exit the stem cell state. Our intuition was that if we could inhibit these, then maybe we could keep cells in a stem-like state.

We identified a gene programme that was important in the in vivo long term HSCs. Within that, we looked for genes that increased differentiation across lineages. We then analysed the data from the three in vitro media they tested, looking for genes within this gene programme with higher activity in the other media compared to the “best” of the three media tested. We found one gene—SSR1—that was consistent in both comparisons.

When we applied this finding, we showed that we could increase the proportion of HSCs in the culture media when we added the inhibitor. It was amazing to be able to show that we could use gene programmes to compare cells from heterogeneous conditions in a biologically meaningful space, and get hypotheses from this that we then validated experimentally.

Read the full research in the preprint: Moullet, M., Isobe, T., Vahidi, A., Leonardi, C., et al. Self-supervised learning for a gene program-centric view of cell states (2026) bioRxiv