Removal of batch effects before non-linear clustering for scRNAseq
Entering edit mode
Last seen 4 months ago

I have dataset of 443 cells from 6 different healthy donors (136,92,117,44,28,26 cells from each donor respectively). We are trying to look at gene-gene correlations over the 443 cell samples. Now before looking at the correlations we clustered the cells to check if the inter cell variation is now larger than the donor differences. We used the SIMLR algorithm with the 1000 most variable genes like this.  To do that we thought we remove variation arising from CDR and donor differences, by looking at the deviance residuals using MAST.  I used a model like this:

colData(scaRaw)$cngeneson <- scale(rowSums(testData != 0))
colData(scaRaw)$donor <- as.factor(subjectArrayH)
zlmResidDE <- zlm.SingleCellAssay(~cngeneson+donor, scaRaw, hook=deviance_residuals_hook)
residDE <- zlmResidDE@hookOut
resMatrix <- t(, residDE))

We used the SIMLR algorithm with the 1000 most variable genes like this:

disp = zlmResidDE@dispersion[,1]
cut1 = 1000
noClusters = 6
test2 = resMatrix[,rank(-disp) < cut1] # here we choose the most variable genes
res = SIMLR(t(test2), c = noClusters, cores.ratio = 0.5)

The result was this plot:

Clusters of cells, with donor carrying much of the weight

Now as you can see there is still a lot of clustering according to donor although donors to not fall into separate clusters at least. This suggest that donor differences still mask variation due to the biological state of the cell. Would you have any idea where this problem with the first dataset could arise? Or any suggestion how we could improve on the deviance residuals?

ED: This question was originally posted here as an issue.

scrnaseq MAST • 643 views
Entering edit mode
Last seen 4 months ago

SIMLR, and related methods that use the cell-to-cell distance matrix such as TSNE will capture non-linear structure of the data. Regressing out variables and using the deviance residuals (roughly speaking) makes each gene orthogonal to the nuisance covariates. But that does not make the distance matrix orthogonal to nuisance covariates (eg if cells are more similar to each other within ID than they are between ID). 

One thing to try would be to explicitly regress out the nuisance covariates from the distance matrix (rather than the expression matrix) then run your favorite dimensionality reduction algorithm, eg resid(lm(distance_matrix ~ covariates)).  There is also this recent paper that might be of interest:

Removal of Batch Effects using Distribution-Matching Residual Networks. Uri Shaham, Kelly P. Stanton, Jun Zhao, Huamin Li, Khadir Raddassi, Ruth Montgomery, Yuval Kluger. Bioinformatics 2017



Login before adding your answer.

Traffic: 365 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6