Question

Removal of batch effects before non-linear clustering for scRNAseq

0

Entering edit mode

Andrew_McDavid ▴ 270

@andrew_mcdavid-11488

Last seen 13 months ago

United States

I have dataset of 443 cells from 6 different healthy donors (136,92,117,44,28,26 cells from each donor respectively). We are trying to look at gene-gene correlations over the 443 cell samples. Now before looking at the correlations we clustered the cells to check if the inter cell variation is now larger than the donor differences. We used the SIMLR algorithm with the 1000 most variable genes like this. To do that we thought we remove variation arising from CDR and donor differences, by looking at the deviance residuals using MAST. I used a model like this:

library(MAST)
colData(scaRaw)$cngeneson <- scale(rowSums(testData != 0))
colData(scaRaw)$donor <- as.factor(subjectArrayH)
zlmResidDE <- zlm.SingleCellAssay(~cngeneson+donor, scaRaw, hook=deviance_residuals_hook)
residDE <- zlmResidDE@hookOut
resMatrix <- t(do.call(rbind, residDE))

We used the SIMLR algorithm with the 1000 most variable genes like this:

disp = zlmResidDE@dispersion[,1]
require(SIMLR)
cut1 = 1000
noClusters = 6
test2 = resMatrix[,rank(-disp) < cut1] # here we choose the most variable genes
res = SIMLR(t(test2), c = noClusters, cores.ratio = 0.5)

The result was this plot:

Clusters of cells, with donor carrying much of the weight

Now as you can see there is still a lot of clustering according to donor although donors to not fall into separate clusters at least. This suggest that donor differences still mask variation due to the biological state of the cell. Would you have any idea where this problem with the first dataset could arise? Or any suggestion how we could improve on the deviance residuals?

ED: This question was originally posted here as an issue.

scrnaseq MAST • 1.3k views

ADD COMMENT • link 7.0 years ago Andrew_McDavid ▴ 270

score 0 · Answer 1 · 2017-04-28

SIMLR, and related methods that use the cell-to-cell distance matrix such as TSNE will capture non-linear structure of the data. Regressing out variables and using the deviance residuals (roughly speaking) makes each gene orthogonal to the nuisance covariates. But that does not make the distance matrix orthogonal to nuisance covariates (eg if cells are more similar to each other within ID than they are between ID).

One thing to try would be to explicitly regress out the nuisance covariates from the distance matrix (rather than the expression matrix) then run your favorite dimensionality reduction algorithm, eg resid(lm(distance_matrix ~ covariates)). There is also this recent paper that might be of interest:

Removal of Batch Effects using Distribution-Matching Residual Networks. Uri Shaham, Kelly P. Stanton, Jun Zhao, Huamin Li, Khadir Raddassi, Ruth Montgomery, Yuval Kluger. Bioinformatics 2017