Using limma duplicateCorrelation to account for donor/dataset effects in scRNA-Seq pseudobulk counts from multiple sources
Entering edit mode
Last seen 4 weeks ago

Dear all (especially limma people),

I'm trying to do something fairly ambitious and do differential gene expression between T cell subsets from multiple tumour and normal single cell datasets. I have compiled 15 datasets (comprised of a few hundred samples) from multiple tumour+normal, just tumour or just normal sources. I'm starting from the raw counts submitted to places like GEO. The key question I am trying to ask is: in a given T cell population (e.g CD4+ T cells), what genes are generally DE between cells taken from tumours and cells taken from normal tissues.

I've used scMatch to filter down to cells that generally look like T cells (some datasets are also immune selected). Using Seurat v3 to do dataset integration and clustering, I've seemingly been able to identify clusters comprising at least CD4+ and CD8+ T cells and NK cells. These clusters are generally supported by marker gene expression, scMatch and and one of the datasets where TCR sequencing was also done (although I would doubt 100% of these annotations are correct). I have then pseudobulked (summed) raw counts from cells from these clusters based on their donor, so patient_1_NK, patient_1_CD8, patient_2_NK etc.

I have then used limma to do DE between tumour and normal tissues for CD8+ T cells as an example. Obviously there are many, many sources of variation (including 3 different single cell protocols used) that could interfere with this comparison, but I have tried to account for the major sources and hope that the number of samples that I'm using gives me enough replication to still get some sensible results.

Since I've got multiple cell types per patient, I was wondering if it makes sense to use duplicateCorrelation to model patient as a random effect (in addition to the other batch effects I've included in my design matrix), as outlined below.

It is difficult for me to conceptualise, but I was also wondering if duplicateCorrelation would also go some way to accounting for dataset level effects since the within sample correlation will also be driven by by the dataset. Maybe there is a better way to account for dataset?

Thanks in advance for any advice,

Cheers, Andrew

# Simplified version the relevant parts of my setup
# CellType_location here would be tumour, normal tissue or blood and the cell type from clustering e.g. CD8_tumour
design <- model.matrix(~0 + CellType_location + gender + protocol, data = pseudobulk_anno)

# rnaseq is expressed genes with TMM scaling factors calcualted
v <- voom(rnaseq, design)

corfit <- duplicateCorrelation(v, design, block = pseudobulk_anno$patient)

v <- voom(rnaseq, design, block = pseudobulk_anno$patient, correlation = corfit$consensus.correlation)

fit <- lmFit(v, design, block = pseudobulk_anno$patient, correlation = corfit$consensus.correlation)

cont.matrix <- makeContrasts(CD8_T_vs_N = CD8_tumour - CD8_normal, levels = design)

sessionInfo( )
R version 4.1.0 (2021-05-18)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux 8.4 (Ootpa)

Matrix products: default
BLAS/LAPACK: /usr/lib64/

 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8   
 [6] LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C            

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] Glimma_2.2.0                GSA_1.03.1                  topconfects_1.8.0           schex_1.6.3                 shiny_1.6.0                
 [6] plotly_4.9.4.1              ggrepel_0.9.1               edgeR_3.34.0                limma_3.48.3                SingleCellExperiment_1.14.1
[11] SummarizedExperiment_1.22.0 GenomicRanges_1.44.0        GenomeInfoDb_1.28.1         IRanges_2.26.0              S4Vectors_0.30.0           
[16] MatrixGenerics_1.4.2        matrixStats_0.60.0          glmGamPoi_1.4.0             sctransform_0.3.2           GEOquery_2.60.0            
[21] Biobase_2.52.0              BiocGenerics_0.38.0         forcats_0.5.1               stringr_1.4.0               dplyr_1.0.7                
[26] purrr_0.3.4                 readr_1.4.0                 tidyr_1.1.3                 tibble_3.1.3                ggplot2_3.3.5              
[31] tidyverse_1.3.1             SeuratObject_4.0.2          Seurat_4.0.3               

loaded via a namespace (and not attached):
  [1] utf8_1.2.2             reticulate_1.20        tidyselect_1.1.1       RSQLite_2.2.7          AnnotationDbi_1.54.1   htmlwidgets_1.5.3     
  [7] grid_4.1.0             BiocParallel_1.26.1    Rtsne_0.15             munsell_0.5.0          codetools_0.2-18       ica_1.0-2             
 [13] future_1.21.0          miniUI_0.1.1.1         withr_2.4.2            colorspace_2.0-2       rstudioapi_0.13        ROCR_1.0-11           
 [19] tensor_1.5             listenv_0.8.0          labeling_0.4.2         GenomeInfoDbData_1.2.6 polyclip_1.10-0        bit64_4.0.5           
 [25] farver_2.1.0           parallelly_1.27.0      vctrs_0.3.8            generics_0.1.0         R6_2.5.0               locfit_1.5-9.4        
 [31] concaveman_1.1.0       bitops_1.0-7           spatstat.utils_2.2-0   cachem_1.0.5           DelayedArray_0.18.0    assertthat_0.2.1      
 [37] promises_1.2.0.1       scales_1.1.1           gtable_0.3.0           globals_0.14.0         goftest_1.2-2          rlang_0.4.11          
 [43] genefilter_1.74.0      splines_4.1.0          lazyeval_0.2.2         hexbin_1.28.2          spatstat.geom_2.2-0    broom_0.7.8           
 [49] reshape2_1.4.4         abind_1.4-5            modelr_0.1.8           backports_1.2.1        httpuv_1.6.1           tools_4.1.0           
 [55] ellipsis_0.3.2         spatstat.core_2.2-0    RColorBrewer_1.1-2     ggridges_0.5.3         Rcpp_1.0.7             plyr_1.8.6            
 [61] zlibbioc_1.38.0        RCurl_1.98-1.3         rpart_4.1-15           deldir_0.2-10          pbapply_1.4-3          cowplot_1.1.1         
 [67] zoo_1.8-9              haven_2.4.1            cluster_2.1.2          fs_1.5.0               magrittr_2.0.1         RSpectra_0.16-0       
 [73] data.table_1.14.0      scattermore_0.7        lmtest_0.9-38          reprex_2.0.0           RANN_2.6.1             fitdistrplus_1.1-5    
 [79] hms_1.1.0              patchwork_1.1.1        mime_0.11              xtable_1.8-4           XML_3.99-0.7           readxl_1.3.1          
 [85] gridExtra_2.3          compiler_4.1.0         KernSmooth_2.23-20     crayon_1.4.1           htmltools_0.5.1.1      entropy_1.3.0         
 [91] mgcv_1.8-35            later_1.2.0            geneplotter_1.70.0     lubridate_1.7.10       DBI_1.1.1              tweenr_1.0.2          
 [97] dbplyr_2.1.1           MASS_7.3-54            Matrix_1.3-3           cli_3.0.1              igraph_1.2.6           pkgconfig_2.0.3       
[103] spatstat.sparse_2.0-0  xml2_1.3.2             annotate_1.70.0        XVector_0.32.0         rvest_1.0.0            digest_0.6.27         
[109] RcppAnnoy_0.0.19       spatstat.data_2.1-0    Biostrings_2.60.2      cellranger_1.1.0       leiden_0.3.8           uwot_0.1.10           
[115] lifecycle_1.0.0        nlme_3.1-152           jsonlite_1.7.2         viridisLite_0.4.0      fansi_0.5.0            pillar_1.6.2          
[121] lattice_0.20-44        KEGGREST_1.32.0        fastmap_1.1.0          httr_1.4.2             survival_3.2-11        glue_1.4.2            
[127] png_0.1-7              bit_4.0.4              ggforce_0.3.3          stringi_1.7.3          blob_1.2.1             DESeq2_1.32.0         
[133] memoise_2.0.0          irlba_2.3.3            future.apply_1.7.0
limma SingleCell • 203 views
Entering edit mode
Last seen 8 hours ago
WEHI, Melbourne, Australia

It would make sense to use duplicateCorrelation to estimate the correlation between cell clusters from the same patient. That would certainly be better than treating the counts from the same patient as independent.

I suggest using the new function edgeR::voomLmFit, which will automate the estimation of voom weights and correlations, and will also be robust to having all zero counts for certain genes in certain cell types.

Entering edit mode

Awesome, that looks even better. Super handy when looking at cell types with very different patterns of gene expression I'm guessing. I hadn't thought about that. Thanks for the quick reply as well.


Login before adding your answer.

Traffic: 374 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6