Question

scRNA-seq pseudo-bulk differential expression analysis with pseudobulkDGE()

1

Entering edit mode

enricoferrero ▴ 680

@enricoferrero-6037

Last seen 4.2 years ago

Switzerland

Hello,

I'm trying to perform a simple differential expression analysis between two conditions across cell clusters, using pseudo-bulking of scRNA-seq data, here is a toy example:

# load libraries
library(tidyverse)
library(scater)
library(scuttle)
library(scran)

# create sce
set.seed(16)
sce <- mockSCE(ncells = 1000)
sce <- logNormCounts(sce)

# annotate sce
sce$sample <- rep(str_c("sample", 1:10), each = 100)
sce$celltype <- str_c("celltype", unname(kmeans(t(logcounts(sce)), centers=3)$cluster))
sce$condition <- rep(c("healthy", "disease"), each = 500)

# create pseudobulk se
se <- aggregateAcrossCells(sce, ids = colData(sce)[, c("sample", "celltype")])

Assuming I'm using aggregateAcrossCells() correctly, I would now like to compute, for each cell type, which genes are differentially expressed between disease and healthy pseudosamples. For my own clarity, I would like to be able to explicitly set the design to ~ 0 + condition and then get the results for the contrast conditiondisease - conditionhealthy.

How do I do this?

I tried a few things:

# differential expression analysis
dea <- pseudoBulkDGE(se, label = se$cluster, condition = se$condition, design = ~ 0 + condition, coef = "conditionhealthy")
dea <- pseudoBulkDGE(se, label = se$cluster, condition = se$condition, design = ~ 0 + condition, contrast = "conditiondisease - conditionhealthy")

but in both cases I get a List of length 0 as a result, so I must be doing something wrong.

Thank you!

My sessionInfo() output:

R version 4.0.2 (2020-06-22)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS/LAPACK: /usr/prog/OpenBLAS/0.2.20-GCC-6.4.0-2.28/lib/libopenblas_haswellp-r0.2.20.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8       
 [4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
[10] LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] scran_1.17.14               scuttle_0.99.11             scater_1.17.4              
 [4] SingleCellExperiment_1.10.1 SummarizedExperiment_1.18.2 DelayedArray_0.14.1        
 [7] matrixStats_0.56.0          Biobase_2.48.0              GenomicRanges_1.40.0       
[10] GenomeInfoDb_1.24.2         IRanges_2.22.2              S4Vectors_0.26.1           
[13] BiocGenerics_0.34.0         forcats_0.5.0               stringr_1.4.0              
[16] dplyr_1.0.1                 purrr_0.3.4                 readr_1.3.1                
[19] tidyr_1.1.1                 tibble_3.0.3                ggplot2_3.3.2              
[22] tidyverse_1.3.0            

loaded via a namespace (and not attached):
 [1] bitops_1.0-6              fs_1.5.0                  lubridate_1.7.9          
 [4] httr_1.4.2                tools_4.0.2               backports_1.1.8          
 [7] R6_2.4.1                  irlba_2.3.3               vipor_0.4.5              
[10] DBI_1.1.0                 colorspace_1.4-1          withr_2.2.0              
[13] tidyselect_1.1.0          gridExtra_2.3             compiler_4.0.2           
[16] cli_2.0.2                 rvest_0.3.6               BiocNeighbors_1.6.0      
[19] xml2_1.3.2                scales_1.1.1              XVector_0.28.0           
[22] pkgconfig_2.0.3           dbplyr_1.4.4              limma_3.44.3             
[25] rlang_0.4.7               readxl_1.3.1              rstudioapi_0.11          
[28] DelayedMatrixStats_1.10.1 generics_0.0.2            jsonlite_1.7.0           
[31] BiocParallel_1.22.0       RCurl_1.98-1.2            magrittr_1.5             
[34] BiocSingular_1.4.0        GenomeInfoDbData_1.2.3    Matrix_1.2-18            
[37] Rcpp_1.0.4.6              ggbeeswarm_0.6.0          munsell_0.5.0            
[40] fansi_0.4.1               viridis_0.5.1             lifecycle_0.2.0          
[43] stringi_1.4.6             edgeR_3.30.3              zlibbioc_1.34.0          
[46] grid_4.0.2                blob_1.2.1                dqrng_0.2.1              
[49] crayon_1.3.4              lattice_0.20-41           haven_2.3.1              
[52] hms_0.5.3                 locfit_1.5-9.4            pillar_1.4.6             
[55] igraph_1.2.5              reprex_0.3.0              glue_1.4.1               
[58] packrat_0.5.0             modelr_0.1.8              vctrs_0.3.2              
[61] cellranger_1.1.0          gtable_0.3.0              assertthat_0.2.1         
[64] rsvd_1.0.3                broom_0.7.0               viridisLite_0.3.0        
[67] beeswarm_0.2.3            statmod_1.4.34            bluster_0.99.1           
[70] ellipsis_0.3.1

scran scater scuttle • 9.5k views

ADD COMMENT • link updated 5.5 years ago by Aaron Lun ★ 29k • written 5.5 years ago by enricoferrero ▴ 680

0

Entering edit mode

Steve Lianoglou ★ 13k

@steve-lianoglou-2771

Last seen 12 weeks ago

United States

I think your pseudoBulkDGE() step is going one step too far.

Take a look at the examples in the OSCA book

Assuming this step worked:

se <- aggregateAcrossCells(sce, ids = colData(sce)[, c("sample", "celltype")])

You would then create a DGEList from that, and go wild, ie. something like:

y.all <- DGEList(counts(se), samples = colData(se))
# ... then do the standard edgeR DGE moves from here on out ...

ADD COMMENT • link 5.5 years ago Steve Lianoglou ★ 13k

score 3 · Accepted Answer · 2020-08-06

dea <- pseudoBulkDGE(se, label = se$cluster, condition = se$condition, design = ~ 0 + condition, coef = "conditionhealthy")

For starters, se$cluster is NULL. I assume you meant se$celltype, which yields the expected List. Also, your specified comparison doesn't make any sense; in a no-intercept model, coef="conditionhealthy" represents the null hypothesis that the healthy condition has an average log-expression of zero, rather than the log-fold change being zero.

dea <- pseudoBulkDGE(se, label = se$cluster, condition = se$condition, design = ~ 0 + condition, contrast = "conditiondisease - conditionhealthy")

The contrast is better this time, but the function isn't smart enough to run contrast through makeContrasts(). You need to give it some hints, so the final form of your function call should look like:

dea <- pseudoBulkDGE(se, label = se$celltype, condition = se$condition, design = ~ 0 + condition, coef=NULL, contrast = c(1, -1))

I guess I should add an option to pass in a string as well, in which case makeContrasts() is automatically called. I should also fix it so that you don't have to set coef=NULL here when you're specifying the contrast already. If someone can add an issue on the GitHub page to remind me, that would help.

I think your pseudoBulkDGE() step is going one step too far.

Tell me about it.

I was tired of manually setting up a for loop to repeat the process for multiple cell types, hence this function. Note that the intermediate DGEList and DGEGLM objects are packed into the metadata() of each DataFrame, so you can always pull them out and do stuff quickly on those objects. Of course, for anything more complex that doesn't follow the standard edgeR/voom pipeline, you'll have to write everything yourself.

Note that the muscat package has a pbDS function that does something very similar, but it expects a slightly different input. Specifically, it expects a SummarizedExperiment where each column corresponds to a sample-of-origin and each assay corresponds to a cell type, whereas pseudoBulkDGE expects a SummarizedExperiment with a single count matrix containing columns for all combinations of cell types and samples. Sometimes you happen to have the former, sometimes you get the latter, so you can choose how to proceed based on what you've got.