Hi, I am trying to reanalyze the data in the package MouseGastrulationData and have two questions need help:
Q1) I am interested in only E8.25 cells and did the following 4 steps. However, the cell classification of the merged data of 3 samples (using fastMNN for batcheffect control) did poor agreement with the original celltype assignment. Could anyone tell me what is wrong ?
Q2) If I want to focused on the cells expressing one particular marker (e.g., Gata1 counts>0), can I used the strategy above by subset the counts matrix of each sample, merge, and correct batch?
Thank you, Holly
1) find out the sample of E8.25######################
<h6>#</h6>subset(AtlasSampleMetadata, stage=='E8.25')
# sample stage pool_index seq_batch ncells
# 23 24 E8.25 19 2 6707
# 24 25 E8.25 19 2 7289
# 27 28 E8.25 21 2 4646
2) extract the data one-by-one, e.g. for sample 23######################
<h6>#</h6>sce.23 <- EmbryoAtlasData(samples = 23)
drop <- (sce.23$doublet | sce.23$stripped)
sce.23 <- sce.23[,!drop]
sce.23 <- logNormCounts(sce.23)
3) merge and correct batcheffct for all 3 datasets######################
<h6>#</h6>library(batchelor)
sce <- multiBatchNorm(sce.23, sce.24, sce.27)
sce.23 <- sce[[1]]
sce.24 <- sce[[2]]
sce.27 <- sce[[3]]
chosen.hvgs <- dec23$bio > 0 & dec24$bio & dec27$bio
sum(chosen.hvgs) # 11305
mnn.out <- fastMNN(sce.23, sce.24, sce.27, d=50, k=20, subset.row=chosen.hvgs,
BSPARAM=BiocSingular::RandomParam(deferred=TRUE))
mnn.out
# class: SingleCellExperiment
# dim: 11305 13358
# metadata(2): merge.info pca.info
# assays(1): reconstructed
# rownames(11305): Xkr4 Rp1 ... AC149090.1 DHRSX
# rowData names(1): rotation
# colnames(13358): cell_61791 cell_61792 ... cell_91079 cell_91080
# colData names(1): batch
# reducedDimNames(1): corrected
# altExpNames(0):
#
# add back the colData information for cell annotations
tmp <- rbind(colData(sce.23), colData(sce.24), colData(sce.27))
colData(mnn.out) <- cbind(colData(mnn.out), tmp)
4) cluster cells and compare with the original celltype annotation ######################
<h6>#</h6>library(scran)
snn.gr <- buildSNNGraph(mnn.out, use.dimred="corrected")
clusters.mnn <- igraph::cluster_walktrap(snn.gr)$membership
library(bluster)
ri23 <- pairwiseRand(clusters.mnn[mnn.out$batch==1], mnn.out$celltype[mnn.out$batch==1], mode="index")
ri23 ## [1] 0.7516883
ri24 <- pairwiseRand(clusters.mnn[mnn.out$batch==2], mnn.out$celltype[mnn.out$batch==2], mode="index")
ri24 ## 0.3265493 !!! why disagree ??
ri27 <- pairwiseRand(clusters.mnn[mnn.out$batch==3], mnn.out$celltype[mnn.out$batch==3], mode="index")
ri27 ## 0.675513
sessionInfo( )
R version 4.0.2 (2020-06-22)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18363)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets methods
[9] base
other attached packages:
[1] bluster_1.0.0 rstudioapi_0.13
[3] MouseGastrulationData_1.4.0 scater_1.18.3
[5] ggplot2_3.3.2 scran_1.18.1
[7] SingleCellExperiment_1.11.9 SummarizedExperiment_1.19.9
[9] Biobase_2.49.1 GenomicRanges_1.41.6
[11] GenomeInfoDb_1.25.11 IRanges_2.23.10
[13] S4Vectors_0.27.14 BiocGenerics_0.35.4
[15] MatrixGenerics_1.1.8 matrixStats_0.57.0
loaded via a namespace (and not attached):
[1] bitops_1.0-6 bit64_4.0.5
[3] RColorBrewer_1.1-2 RcppAnnoy_0.0.16
[5] httr_1.4.2 tools_4.0.2
[7] R6_2.5.0 irlba_2.3.3
[9] vipor_0.4.5 uwot_0.1.8
[11] DBI_1.1.0 colorspace_1.4-1
[13] withr_2.3.0 tidyselect_1.1.0
[15] gridExtra_2.3 bit_4.0.4
[17] curl_4.3 compiler_4.0.2
[19] BiocNeighbors_1.8.1 DelayedArray_0.15.17
[21] labeling_0.4.2 scales_1.1.1
[23] rappdirs_0.3.1 digest_0.6.27
[25] XVector_0.29.3 htmltools_0.5.0
[27] pkgconfig_2.0.3 sparseMatrixStats_1.2.0
[29] fastmap_1.0.1 dbplyr_2.0.0
[31] limma_3.46.0 rlang_0.4.8
[33] RSQLite_2.2.1 shiny_1.5.0
[35] DelayedMatrixStats_1.12.0 farver_2.0.3
[37] generics_0.1.0 BiocParallel_1.23.3
[39] dplyr_1.0.2 RCurl_1.98-1.2
[41] magrittr_1.5 BiocSingular_1.6.0
[43] GenomeInfoDbData_1.2.4 scuttle_1.0.0
[45] Matrix_1.2-18 Rcpp_1.0.5
[47] ggbeeswarm_0.6.0 munsell_0.5.0
[49] viridis_0.5.1 lifecycle_0.2.0
[51] yaml_2.2.1 edgeR_3.32.0
[53] zlibbioc_1.35.0 Rtsne_0.15
[55] BiocFileCache_1.14.0 AnnotationHub_2.22.0
[57] grid_4.0.2 blob_1.2.1
[59] promises_1.1.1 dqrng_0.2.1
[61] ExperimentHub_1.16.0 crayon_1.3.4
[63] lattice_0.20-41 cowplot_1.1.0
[65] beachmat_2.6.1 locfit_1.5-9.4
[67] pillar_1.4.6 igraph_1.2.6
[69] codetools_0.2-18 glue_1.4.2
[71] BiocVersion_3.12.0 BiocManager_1.30.10
[73] httpuv_1.5.4 vctrs_0.3.4
[75] gtable_0.3.0 purrr_0.3.4
[77] assertthat_0.2.1 mime_0.9
[79] rsvd_1.0.3 xtable_1.8-4
[81] RSpectra_0.16-0 later_1.1.0.1
[83] viridisLite_0.3.0 tibble_3.0.4
[85] AnnotationDbi_1.52.0 beeswarm_0.2.3
[87] memoise_1.1.0 statmod_1.4.35
[89] ellipsis_0.3.1 interactiveDisplayBase_1.28.0
I figured it out. The dataset IDs turn to be different with sample IDs. I should use samples 24, 25, and 28 rather than samples 23, 24, and 27!
Interestingly, similar results came from fastMNN (for batcheffect control) and uncorrected (cbind samples).
And if someone could comment more on my questions 2, that would be much appreciated.
Holly