I have dataset of 443 cells from 6 different healthy donors (136,92,117,44,28,26 cells from each donor respectively). We are trying to look at gene-gene correlations over the 443 cell samples. Now before looking at the correlations we clustered the cells to check if the inter cell variation is now larger than the donor differences. We used the SIMLR algorithm with the 1000 most variable genes like this. To do that we thought we remove variation arising from CDR and donor differences, by looking at the deviance residuals using MAST. I used a model like this:
library(MAST) colData(scaRaw)$cngeneson <- scale(rowSums(testData != 0)) colData(scaRaw)$donor <- as.factor(subjectArrayH) zlmResidDE <- zlm.SingleCellAssay(~cngeneson+donor, scaRaw, hook=deviance_residuals_hook) residDE <- zlmResidDE@hookOut resMatrix <- t(do.call(rbind, residDE))
We used the SIMLR algorithm with the 1000 most variable genes like this:
disp = zlmResidDE@dispersion[,1] require(SIMLR) cut1 = 1000 noClusters = 6 test2 = resMatrix[,rank(-disp) < cut1] # here we choose the most variable genes res = SIMLR(t(test2), c = noClusters, cores.ratio = 0.5)
The result was this plot:
Now as you can see there is still a lot of clustering according to donor although donors to not fall into separate clusters at least. This suggest that donor differences still mask variation due to the biological state of the cell. Would you have any idea where this problem with the first dataset could arise? Or any suggestion how we could improve on the deviance residuals?
ED: This question was originally posted here as an issue.