Question

extract certain subpopulation of MouseGastrulationData

0

Entering edit mode

xyang2 ▴ 120

@xyang2-4387

Last seen 4.6 years ago

Hi, I am trying to reanalyze the data in the package MouseGastrulationData and have two questions need help:

Q1) I am interested in only E8.25 cells and did the following 4 steps. However, the cell classification of the merged data of 3 samples (using fastMNN for batcheffect control) did poor agreement with the original celltype assignment. Could anyone tell me what is wrong ?

Q2) If I want to focused on the cells expressing one particular marker (e.g., Gata1 counts>0), can I used the strategy above by subset the counts matrix of each sample, merge, and correct batch?

Thank you, Holly

1) find out the sample of E8.25######################

subset(AtlasSampleMetadata, stage=='E8.25')
#     sample stage pool_index seq_batch ncells
# 23     24 E8.25         19         2   6707
# 24     25 E8.25         19         2   7289
# 27     28 E8.25         21         2   4646

2) extract the data one-by-one, e.g. for sample 23######################

sce.23 <- EmbryoAtlasData(samples = 23)
drop <- (sce.23$doublet | sce.23$stripped)
sce.23 <- sce.23[,!drop]
sce.23 <- logNormCounts(sce.23)

3) merge and correct batcheffct for all 3 datasets######################

library(batchelor)
sce <- multiBatchNorm(sce.23, sce.24, sce.27)
sce.23 <- sce[[1]]
sce.24 <- sce[[2]]
sce.27 <- sce[[3]]
chosen.hvgs <- dec23$bio > 0 & dec24$bio & dec27$bio
sum(chosen.hvgs)  # 11305

mnn.out <- fastMNN(sce.23, sce.24, sce.27, d=50, k=20, subset.row=chosen.hvgs,
                   BSPARAM=BiocSingular::RandomParam(deferred=TRUE))
mnn.out
# class: SingleCellExperiment 
# dim: 11305 13358 
# metadata(2): merge.info pca.info
# assays(1): reconstructed
# rownames(11305): Xkr4 Rp1 ... AC149090.1 DHRSX
# rowData names(1): rotation
# colnames(13358): cell_61791 cell_61792 ... cell_91079 cell_91080
# colData names(1): batch
# reducedDimNames(1): corrected
# altExpNames(0):
#   

# add back the colData information for cell annotations
tmp <- rbind(colData(sce.23), colData(sce.24), colData(sce.27))
colData(mnn.out) <- cbind(colData(mnn.out), tmp)

4) cluster cells and compare with the original celltype annotation ######################

library(scran)
snn.gr <- buildSNNGraph(mnn.out, use.dimred="corrected")
clusters.mnn <- igraph::cluster_walktrap(snn.gr)$membership

library(bluster)
ri23 <- pairwiseRand(clusters.mnn[mnn.out$batch==1], mnn.out$celltype[mnn.out$batch==1], mode="index")
ri23 ## [1]  0.7516883
ri24 <- pairwiseRand(clusters.mnn[mnn.out$batch==2], mnn.out$celltype[mnn.out$batch==2], mode="index")
ri24  ## 0.3265493  !!!  why disagree ??
ri27 <- pairwiseRand(clusters.mnn[mnn.out$batch==3], mnn.out$celltype[mnn.out$batch==3], mode="index")
ri27 ## 0.675513

sessionInfo( )
R version 4.0.2 (2020-06-22)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18363)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods  
[9] base     

other attached packages:
 [1] bluster_1.0.0               rstudioapi_0.13            
 [3] MouseGastrulationData_1.4.0 scater_1.18.3              
 [5] ggplot2_3.3.2               scran_1.18.1               
 [7] SingleCellExperiment_1.11.9 SummarizedExperiment_1.19.9
 [9] Biobase_2.49.1              GenomicRanges_1.41.6       
[11] GenomeInfoDb_1.25.11        IRanges_2.23.10            
[13] S4Vectors_0.27.14           BiocGenerics_0.35.4        
[15] MatrixGenerics_1.1.8        matrixStats_0.57.0         

loaded via a namespace (and not attached):
 [1] bitops_1.0-6                  bit64_4.0.5                  
 [3] RColorBrewer_1.1-2            RcppAnnoy_0.0.16             
 [5] httr_1.4.2                    tools_4.0.2                  
 [7] R6_2.5.0                      irlba_2.3.3                  
 [9] vipor_0.4.5                   uwot_0.1.8                   
[11] DBI_1.1.0                     colorspace_1.4-1             
[13] withr_2.3.0                   tidyselect_1.1.0             
[15] gridExtra_2.3                 bit_4.0.4                    
[17] curl_4.3                      compiler_4.0.2               
[19] BiocNeighbors_1.8.1           DelayedArray_0.15.17         
[21] labeling_0.4.2                scales_1.1.1                 
[23] rappdirs_0.3.1                digest_0.6.27                
[25] XVector_0.29.3                htmltools_0.5.0              
[27] pkgconfig_2.0.3               sparseMatrixStats_1.2.0      
[29] fastmap_1.0.1                 dbplyr_2.0.0                 
[31] limma_3.46.0                  rlang_0.4.8                  
[33] RSQLite_2.2.1                 shiny_1.5.0                  
[35] DelayedMatrixStats_1.12.0     farver_2.0.3                 
[37] generics_0.1.0                BiocParallel_1.23.3          
[39] dplyr_1.0.2                   RCurl_1.98-1.2               
[41] magrittr_1.5                  BiocSingular_1.6.0           
[43] GenomeInfoDbData_1.2.4        scuttle_1.0.0                
[45] Matrix_1.2-18                 Rcpp_1.0.5                   
[47] ggbeeswarm_0.6.0              munsell_0.5.0                
[49] viridis_0.5.1                 lifecycle_0.2.0              
[51] yaml_2.2.1                    edgeR_3.32.0                 
[53] zlibbioc_1.35.0               Rtsne_0.15                   
[55] BiocFileCache_1.14.0          AnnotationHub_2.22.0         
[57] grid_4.0.2                    blob_1.2.1                   
[59] promises_1.1.1                dqrng_0.2.1                  
[61] ExperimentHub_1.16.0          crayon_1.3.4                 
[63] lattice_0.20-41               cowplot_1.1.0                
[65] beachmat_2.6.1                locfit_1.5-9.4               
[67] pillar_1.4.6                  igraph_1.2.6                 
[69] codetools_0.2-18              glue_1.4.2                   
[71] BiocVersion_3.12.0            BiocManager_1.30.10          
[73] httpuv_1.5.4                  vctrs_0.3.4                  
[75] gtable_0.3.0                  purrr_0.3.4                  
[77] assertthat_0.2.1              mime_0.9                     
[79] rsvd_1.0.3                    xtable_1.8-4                 
[81] RSpectra_0.16-0               later_1.1.0.1                
[83] viridisLite_0.3.0             tibble_3.0.4                 
[85] AnnotationDbi_1.52.0          beeswarm_0.2.3               
[87] memoise_1.1.0                 statmod_1.4.35               
[89] ellipsis_0.3.1                interactiveDisplayBase_1.28.0

MouseGastrulationData Normalization BatchEffect • 1.3k views

ADD COMMENT • link updated 4.6 years ago by Jonathan Griffiths ▴ 90 • written 4.6 years ago by xyang2 ▴ 120

0

Entering edit mode

I figured it out. The dataset IDs turn to be different with sample IDs. I should use samples 24, 25, and 28 rather than samples 23, 24, and 27!

Interestingly, similar results came from fastMNN (for batcheffect control) and uncorrected (cbind samples).

And if someone could comment more on my questions 2, that would be much appreciated.

Holly

ADD REPLY • link 4.6 years ago xyang2 ▴ 120

score 0 · Answer 1 · 2020-11-18

Hi Holly,

Every time you subset the SCE it is useful to run logNormCounts because it will center the size factors for you, which lots of scran’s downstream methods are expecting - at least, this used to be the case. I'm sure Aaron will correct me if that has been changed.

If I recall correctly, I think the E8.25 samples were not subject to a very large batch effect, so there might be little to gain by correction. This is not generally true, though. When I am working with the data, I tend to use reducedDim(sce, “pca.corrected”) - which is the output of my fastMNN correction over the whole atlas. This tends to do well. If I need to look very closely at a particular set of cells (e.g., the developing endoderm), then I tend to use the approach that scran now has in it’s multi-batch correction function on only the subset of cells of interest.

On your second point, I think it’s probably better to identify clusters of cells that express Gata1 rather than individual cells, because we usually cannot "trust" individual measurements of a single gene in a single cell due to sampling noise. About correction of that cell subset, remember the assumption of MNN that there are some cells in common that are shared across the samples to be merged. I don’t know the expression pattern of Gata1, but if it is expressed in very different lineages at different timepoints, these may be falsely corrected, and different cell types merged. It would be safer to use the pca.corrected in this case, I imagine.

Jonny