Single-cell RNA-sequencing analysis without double-dipping


Event details

Date 30.11.2023
Hour 15:1516:30
Speaker Prof. Daniela M. Witten
Dorothy Gilford Endowed Chair
University of Washington
Location Online
Category Conferences - Seminars
Event Language English

When analyzing single-cell RNA-sequencing data, we often wish to learn some latent structure among the cells, and then validate this structure on the same set of cells. For example, we might cluster the cells into cell types, and then test whether gene expression differs between the clusters. Or we might estimate a low-dimensional subspace representing a cellular developmental trajectory, and then test whether gene expression is correlated with this trajectory. However, a classical statistical test to validate the latent structure will not control the Type 1 error, since the latent structure was estimated on the same data used for hypothesis testing. Furthermore, a straightforward sample splitting approach does not fix the problem.

In this talk, I will present "count splitting", a simple variant of sample splitting that does control the Type 1 error. The idea is simple but powerful: rather than splitting the n cells in the data matrix into a separate training set of m<n cells and a test set of n-m cells, we instead split the n cells into a training set of n cells and a test set of n cells, in a very particular way such that the training and test sets are independent and follow the same distribution as the original n cells. This allows us to, for instance, define cell types on the training cells and validate them on the test cells, without the pitfalls that arise due to double dipping. 

This is joint work with PhD alumni Anna Neufeld (now at Fred Hutch) and Lucy Gao (now at U. British Columbia) and collaborators Jacob Bien (USC), Alexis Battle, Joshua Popp (Johns Hopkins).

Practical information

  • Informed public
  • Free


  • Prof. Gioele La Manno

Event broadcasted in