Question: Removal of batch effects before non-linear clustering for scRNAseq
0
2.3 years ago by
Andrew_McDavid190 wrote:

I have dataset of 443 cells from 6 different healthy donors (136,92,117,44,28,26 cells from each donor respectively). We are trying to look at gene-gene correlations over the 443 cell samples. Now before looking at the correlations we clustered the cells to check if the inter cell variation is now larger than the donor differences. We used the SIMLR algorithm with the 1000 most variable genes like this.  To do that we thought we remove variation arising from CDR and donor differences, by looking at the deviance residuals using MAST.  I used a model like this:

library(MAST)
colData(scaRaw)$cngeneson <- scale(rowSums(testData != 0)) colData(scaRaw)$donor <- as.factor(subjectArrayH)
zlmResidDE <- zlm.SingleCellAssay(~cngeneson+donor, scaRaw, hook=deviance_residuals_hook)
residDE <- zlmResidDE@hookOut
resMatrix <- t(do.call(rbind, residDE))

We used the SIMLR algorithm with the 1000 most variable genes like this:

disp = zlmResidDE@dispersion[,1]
require(SIMLR)
cut1 = 1000
noClusters = 6
test2 = resMatrix[,rank(-disp) < cut1] # here we choose the most variable genes
res = SIMLR(t(test2), c = noClusters, cores.ratio = 0.5)

Now as you can see there is still a lot of clustering according to donor although donors to not fall into separate clusters at least. This suggest that donor differences still mask variation due to the biological state of the cell. Would you have any idea where this problem with the first dataset could arise? Or any suggestion how we could improve on the deviance residuals?

ED: This question was originally posted here as an issue.

scrnaseq mast • 459 views
modified 2.3 years ago • written 2.3 years ago by Andrew_McDavid190
Answer: Removal of batch effects before non-linear clustering for scRNAseq
0
2.3 years ago by
Andrew_McDavid190 wrote:

SIMLR, and related methods that use the cell-to-cell distance matrix such as TSNE will capture non-linear structure of the data. Regressing out variables and using the deviance residuals (roughly speaking) makes each gene orthogonal to the nuisance covariates. But that does not make the distance matrix orthogonal to nuisance covariates (eg if cells are more similar to each other within ID than they are between ID).

One thing to try would be to explicitly regress out the nuisance covariates from the distance matrix (rather than the expression matrix) then run your favorite dimensionality reduction algorithm, eg resid(lm(distance_matrix ~ covariates)).  There is also this recent paper that might be of interest:

Removal of Batch Effects using Distribution-Matching Residual Networks. Uri Shaham, Kelly P. Stanton, Jun Zhao, Huamin Li, Khadir Raddassi, Ruth Montgomery, Yuval Kluger. Bioinformatics 2017