Carry over variable features when converting FastMNN SCE object to Seurat object
1
0
Entering edit mode
@heir_of_isildur88-15346
Last seen 2.8 years ago

Hi, I've been using RunFastMNN from seurat-wrappers (https://github.com/satijalab/seurat-wrappers) to align my datasets using MNN but I have some issues with downstream analysis. I'm having problem with projecting my dimension reduction onto the full dataset. I realized that the output of RunFastMNN is a merged Seurat object with the MNN dimensional reduction information but without the feature-level metadata such as variable feature information. (see issue here: https://github.com/satijalab/seurat-wrappers/issues/15)

However, it is nice to examine the projection of the data with this reduction and I would like to retain the information in my MNN-aligned Seurat object. I worked around the issue by first performing MNN correction with batchelor then converting it into a Seurat object, then save the highly variable genes list into the variable feature slot in the Seurat object.

I wonder if what I did is correct & logical. Did anyone experience the same issue?

My code is as below. Do my steps seem logical?

Thank you very much.

## Load Seurat object
so <- readRDS(file = paste0(output, "/PBMC/SO_merge.Rds"))

## Create SingleCellExperiment object
sce <- as.SingleCellExperiment(so)
rowData(sce) <- NULL
reducedDim(sce) <- NULL
reducedDim(sce, type = "UMAP") <- NULL

## Correct by sample ID
s11 <- sce[ , grepl("S11", sce$orig.ident)] s12 <- sce[ , grepl("S12", sce$orig.ident)]
s13 <- sce[ , grepl("S13", sce$orig.ident)] s14 <- sce[ , grepl("S14", sce$orig.ident)]
s15 <- sce[ , grepl("S15", sce$orig.ident)] s16 <- sce[ , grepl("S16", sce$orig.ident)]

all.sce <- list(S11 = s11, S12 = s12, S13 = s13, S14 = s14, S15 = s15, S16 = s16)

## Subset all batches to common universe of genes
universe <- Reduce(intersect, lapply(all.sce, rownames))
all.sce <- lapply(all.sce, "[", i = universe,)

## Adjust scaling to equalize sequencing coverage
normed.sce <- do.call(multiBatchNorm, all.sce)

## Find highly variable genes
all.var <- lapply(all.sce, modelGeneVar)
combined.var <- do.call(combineVar, all.var)
hvg.list <- rownames(combined.var)[combined.var$bio > 0] ## Correct batch effect set.seed(920101) mnn.sce <- do.call(fastMNN, c(normed.sce, list(subset.row = hvg.list))) ## Save computed MNN into SCE object, then convert to Seurat object reducedDim(sce, "MNN") <- reducedDim(mnn.sce, "corrected") so.fastmnn <- as.Seurat(sce) ## Keep highly variable genes list into Seurat object so.fastmnn@assays$RNA@var.features <- hvg.list

so.fastmnn <- ScaleData(so.fastmnn)
ProjectDim(so.fastmnn, reduction = "mnn", dims.print = 1:2, nfeatures.print = 5)

mnn batchelor scrna-seq seurat • 1.9k views
2
Entering edit mode
Aaron Lun ★ 28k
@alun
Last seen 1 hour ago
The city by the bay

The steps related to scran and batchelor look fine. I can't comment on the stuff before and after that, Seurat is unfortunately not a Bioconductor package and thus is out of scope for this forum.

0
Entering edit mode

OK, thank you for your feedback.

1) I'm aware that we can use the MNN reduced dimension for plotting & visualization. But is there any way we can decide the optimal number of reduced dimensions to use to run UMAP for example, or determine the number of MNN dimensions that contribute to significant structure in reduced dimensional space? Something like an elbow plot for principal components?

2) Is it logical to run PCA on a MNN-merged object?

Thank you very much.

1
Entering edit mode

1. In the past, I did, in fact, think that I wanted to get a better estimate of the number of PCs from fastMNN. This was the motivation behind reporting the percentage of variance explained by multiBatchPCA() and adding a denoisePCANumber() function to scran. But IIRC it cut away too many PCs and was enriching for the batch effect, so I just forgot about it.

The problem is that you can't freely drop the later PCs because the earliest PCs are often dominated by a strong batch effect. Consider an elbow plot on the uncorrected data; for typical batch effects, you can expect that the elbow will be preceded by PCs that are totally driven by batch effects! For the corrected data, the problem then flips around because - in the best case where those batch effects are removed - the earlier PCs may no longer explain the most variance, potentially resulting in a strange-looking scree plot with multiple elbows and valleys and what have you.

Yes, it is possible to get around all this (see my answer below) but is it really worth it? What is so bad about the default settings that you feel the need to "optimize" it? What does the optimum even mean? There are certainly mathematical definitions of the optimal number of PCs but these have less-than-appealing consequences for biological data analysis, e.g., retaining a large number of PCs to accurately represent real but uninteresting processes (e.g., metabolic activity) in later PCs.

2. If you're running it with the same number of PCs as that used in fastMNN (and you're planning to use all of those PCs), then no, it's just a rotation of the data. However, I suppose you could use it to get yourself a nice-looking elbow plot that would permit you to make a decision on how to discard the later PCs. Whether that is sensible is subject to the same considerations as for the uncorrected data.

However, I think that's a waste of time. As with most discussions about the optimal number of PCs, it's usually much faster to just try a bunch of options and see how much your conclusions change. Or you can just stick with one "reasonable" choice; the important thing is whether you can answer your scientific question, you're not obliged to squeeze every last bit of biological information from your data.

0
Entering edit mode

Your explanation is really helpful. Reason I'm asking question (1) is because I was going to perform a gene module analysis and from what I can find, people usually determine the number of PCs contributing to significant structure in PC space then pick the first and last 50 genes from each dimension.

As I couldn't do that, I arbitrary used 30 dimensions for this step. The thing is my PI asked me why I pick 30 dimensions and how I decided on this number, in which I could not explain myself.

But your explanation helped a lot and I will try to talk about this again with him! Thank you very much for your reply! I agree with you that the important thing is being able to answer my scientific questions... Cheers!

0
Entering edit mode

Reason I'm asking question (1) is because I was going to perform a gene module analysis and from what I can find, people usually determine the number of PCs contributing to significant structure in PC space then pick the first and last 50 genes from each dimension.

It sounds like you should do your module analysis within each batch and then do some meta-analysis across batches. I don't do much de novo module analysis but you can certainly do some good ol' gene set tests within each batch and combine the p-values across batches.

As I couldn't do that, I arbitrary used 30 dimensions for this step. The thing is my PI asked me why I pick 30 dimensions and how I decided on this number, in which I could not explain myself.

I usually pick the number of PCs equal to my age at time of analysis, with the expectation that I would be long retired before I started putting too many PCs into downstream steps.

Seriously, though, as long as you hold onto enough PCs to capture the biology of interest, any choice is as good as the other. Consider the elbow plot, and see how quickly the plateau is reached; if your choice of the number of PCs already lies in this plateau, there's not much harm in picking a handful of more PCs. You're barely adding any extra variance anyway from these later PCs, the biggest practical effect is to increase computational time.

0
Entering edit mode

It sounds like you should do your module analysis within each batch and then do some meta-analysis across batches. I don't do much de novo module analysis but you can certainly do some good ol' gene set tests within each batch and combine the p-values across batches.

OK, thanks for your suggestion. I will try it out.

Seriously, though, as long as you hold onto enough PCs to capture the biology of interest, any choice is as good as the other. Consider the elbow plot, and see how quickly the plateau is reached; if your choice of the number of PCs already lies in this plateau, there's not much harm in picking a handful of more PCs. You're barely adding any extra variance anyway from these later PCs, the biggest practical effect is to increase computational time.

Thanks for the explanation. It's really clear.

Aaron, I appreciate all your comments & feedback. Thanks a lot again!