Hi, I've been using
seurat-wrappers (https://github.com/satijalab/seurat-wrappers) to align my datasets using MNN but I have some issues with downstream analysis. I'm having problem with projecting my dimension reduction onto the full dataset. I realized that the output of
RunFastMNN is a merged
Seurat object with the MNN dimensional reduction information but without the feature-level metadata such as variable feature information. (see issue here: https://github.com/satijalab/seurat-wrappers/issues/15)
However, it is nice to examine the projection of the data with this reduction and I would like to retain the information in my MNN-aligned
Seurat object. I worked around the issue by first performing MNN correction with
batchelor then converting it into a
Seurat object, then save the highly variable genes list into the variable feature slot in the
I wonder if what I did is correct & logical. Did anyone experience the same issue?
My code is as below. Do my steps seem logical?
Thank you very much.
## Load Seurat object so <- readRDS(file = paste0(output, "/PBMC/SO_merge.Rds")) ## Create SingleCellExperiment object sce <- as.SingleCellExperiment(so) rowData(sce) <- NULL reducedDim(sce) <- NULL reducedDim(sce, type = "UMAP") <- NULL ## Correct by sample ID s11 <- sce[ , grepl("S11", sce$orig.ident)] s12 <- sce[ , grepl("S12", sce$orig.ident)] s13 <- sce[ , grepl("S13", sce$orig.ident)] s14 <- sce[ , grepl("S14", sce$orig.ident)] s15 <- sce[ , grepl("S15", sce$orig.ident)] s16 <- sce[ , grepl("S16", sce$orig.ident)] all.sce <- list(S11 = s11, S12 = s12, S13 = s13, S14 = s14, S15 = s15, S16 = s16) ## Subset all batches to common universe of genes universe <- Reduce(intersect, lapply(all.sce, rownames)) all.sce <- lapply(all.sce, "[", i = universe,) ## Adjust scaling to equalize sequencing coverage normed.sce <- do.call(multiBatchNorm, all.sce) ## Find highly variable genes all.var <- lapply(all.sce, modelGeneVar) combined.var <- do.call(combineVar, all.var) hvg.list <- rownames(combined.var)[combined.var$bio > 0] ## Correct batch effect set.seed(920101) mnn.sce <- do.call(fastMNN, c(normed.sce, list(subset.row = hvg.list))) ## Save computed MNN into SCE object, then convert to Seurat object reducedDim(sce, "MNN") <- reducedDim(mnn.sce, "corrected") so.fastmnn <- as.Seurat(sce) ## Keep highly variable genes list into Seurat object so.fastmnn@assays$RNA@var.features <- hvg.list ## Scale data & project loadings so.fastmnn <- ScaleData(so.fastmnn) ProjectDim(so.fastmnn, reduction = "mnn", dims.print = 1:2, nfeatures.print = 5)
OK, thank you for your feedback.
Two additional questions here:
1) I'm aware that we can use the MNN reduced dimension for plotting & visualization. But is there any way we can decide the optimal number of reduced dimensions to use to run UMAP for example, or determine the number of MNN dimensions that contribute to significant structure in reduced dimensional space? Something like an elbow plot for principal components?
2) Is it logical to run PCA on a MNN-merged object?
Thank you very much.
1. In the past, I did, in fact, think that I wanted to get a better estimate of the number of PCs from
fastMNN. This was the motivation behind reporting the percentage of variance explained by
multiBatchPCA()and adding a
denoisePCANumber()function to scran. But IIRC it cut away too many PCs and was enriching for the batch effect, so I just forgot about it.
The problem is that you can't freely drop the later PCs because the earliest PCs are often dominated by a strong batch effect. Consider an elbow plot on the uncorrected data; for typical batch effects, you can expect that the elbow will be preceded by PCs that are totally driven by batch effects! For the corrected data, the problem then flips around because - in the best case where those batch effects are removed - the earlier PCs may no longer explain the most variance, potentially resulting in a strange-looking scree plot with multiple elbows and valleys and what have you.
Yes, it is possible to get around all this (see my answer below) but is it really worth it? What is so bad about the default settings that you feel the need to "optimize" it? What does the optimum even mean? There are certainly mathematical definitions of the optimal number of PCs but these have less-than-appealing consequences for biological data analysis, e.g., retaining a large number of PCs to accurately represent real but uninteresting processes (e.g., metabolic activity) in later PCs.
2. If you're running it with the same number of PCs as that used in
fastMNN(and you're planning to use all of those PCs), then no, it's just a rotation of the data. However, I suppose you could use it to get yourself a nice-looking elbow plot that would permit you to make a decision on how to discard the later PCs. Whether that is sensible is subject to the same considerations as for the uncorrected data.
However, I think that's a waste of time. As with most discussions about the optimal number of PCs, it's usually much faster to just try a bunch of options and see how much your conclusions change. Or you can just stick with one "reasonable" choice; the important thing is whether you can answer your scientific question, you're not obliged to squeeze every last bit of biological information from your data.
Thanks a lot for your reply, Aaron!
Your explanation is really helpful. Reason I'm asking question (1) is because I was going to perform a gene module analysis and from what I can find, people usually determine the number of PCs contributing to significant structure in PC space then pick the first and last 50 genes from each dimension.
As I couldn't do that, I arbitrary used 30 dimensions for this step. The thing is my PI asked me why I pick 30 dimensions and how I decided on this number, in which I could not explain myself.
But your explanation helped a lot and I will try to talk about this again with him! Thank you very much for your reply! I agree with you that the important thing is being able to answer my scientific questions... Cheers!
It sounds like you should do your module analysis within each batch and then do some meta-analysis across batches. I don't do much de novo module analysis but you can certainly do some good ol' gene set tests within each batch and combine the p-values across batches.
I usually pick the number of PCs equal to my age at time of analysis, with the expectation that I would be long retired before I started putting too many PCs into downstream steps.
Seriously, though, as long as you hold onto enough PCs to capture the biology of interest, any choice is as good as the other. Consider the elbow plot, and see how quickly the plateau is reached; if your choice of the number of PCs already lies in this plateau, there's not much harm in picking a handful of more PCs. You're barely adding any extra variance anyway from these later PCs, the biggest practical effect is to increase computational time.
OK, thanks for your suggestion. I will try it out.
Thanks for the explanation. It's really clear.
Aaron, I appreciate all your comments & feedback. Thanks a lot again!