Hello,
I'm trying to perform a simple differential expression analysis between two conditions across cell clusters, using pseudo-bulking of scRNA-seq data, here is a toy example:
# load libraries
library(tidyverse)
library(scater)
library(scuttle)
library(scran)
# create sce
set.seed(16)
sce <- mockSCE(ncells = 1000)
sce <- logNormCounts(sce)
# annotate sce
sce$sample <- rep(str_c("sample", 1:10), each = 100)
sce$celltype <- str_c("celltype", unname(kmeans(t(logcounts(sce)), centers=3)$cluster))
sce$condition <- rep(c("healthy", "disease"), each = 500)
# create pseudobulk se
se <- aggregateAcrossCells(sce, ids = colData(sce)[, c("sample", "celltype")])
Assuming I'm using aggregateAcrossCells()
correctly, I would now like to compute, for each cell type, which genes are differentially expressed between disease and healthy pseudosamples.
For my own clarity, I would like to be able to explicitly set the design to ~ 0 + condition
and then get the results for the contrast conditiondisease - conditionhealthy
.
How do I do this?
I tried a few things:
# differential expression analysis
dea <- pseudoBulkDGE(se, label = se$cluster, condition = se$condition, design = ~ 0 + condition, coef = "conditionhealthy")
dea <- pseudoBulkDGE(se, label = se$cluster, condition = se$condition, design = ~ 0 + condition, contrast = "conditiondisease - conditionhealthy")
but in both cases I get a List of length 0
as a result, so I must be doing something wrong.
Thank you!
My sessionInfo()
output:
R version 4.0.2 (2020-06-22)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)
Matrix products: default
BLAS/LAPACK: /usr/prog/OpenBLAS/0.2.20-GCC-6.4.0-2.28/lib/libopenblas_haswellp-r0.2.20.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8
[4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets methods base
other attached packages:
[1] scran_1.17.14 scuttle_0.99.11 scater_1.17.4
[4] SingleCellExperiment_1.10.1 SummarizedExperiment_1.18.2 DelayedArray_0.14.1
[7] matrixStats_0.56.0 Biobase_2.48.0 GenomicRanges_1.40.0
[10] GenomeInfoDb_1.24.2 IRanges_2.22.2 S4Vectors_0.26.1
[13] BiocGenerics_0.34.0 forcats_0.5.0 stringr_1.4.0
[16] dplyr_1.0.1 purrr_0.3.4 readr_1.3.1
[19] tidyr_1.1.1 tibble_3.0.3 ggplot2_3.3.2
[22] tidyverse_1.3.0
loaded via a namespace (and not attached):
[1] bitops_1.0-6 fs_1.5.0 lubridate_1.7.9
[4] httr_1.4.2 tools_4.0.2 backports_1.1.8
[7] R6_2.4.1 irlba_2.3.3 vipor_0.4.5
[10] DBI_1.1.0 colorspace_1.4-1 withr_2.2.0
[13] tidyselect_1.1.0 gridExtra_2.3 compiler_4.0.2
[16] cli_2.0.2 rvest_0.3.6 BiocNeighbors_1.6.0
[19] xml2_1.3.2 scales_1.1.1 XVector_0.28.0
[22] pkgconfig_2.0.3 dbplyr_1.4.4 limma_3.44.3
[25] rlang_0.4.7 readxl_1.3.1 rstudioapi_0.11
[28] DelayedMatrixStats_1.10.1 generics_0.0.2 jsonlite_1.7.0
[31] BiocParallel_1.22.0 RCurl_1.98-1.2 magrittr_1.5
[34] BiocSingular_1.4.0 GenomeInfoDbData_1.2.3 Matrix_1.2-18
[37] Rcpp_1.0.4.6 ggbeeswarm_0.6.0 munsell_0.5.0
[40] fansi_0.4.1 viridis_0.5.1 lifecycle_0.2.0
[43] stringi_1.4.6 edgeR_3.30.3 zlibbioc_1.34.0
[46] grid_4.0.2 blob_1.2.1 dqrng_0.2.1
[49] crayon_1.3.4 lattice_0.20-41 haven_2.3.1
[52] hms_0.5.3 locfit_1.5-9.4 pillar_1.4.6
[55] igraph_1.2.5 reprex_0.3.0 glue_1.4.1
[58] packrat_0.5.0 modelr_0.1.8 vctrs_0.3.2
[61] cellranger_1.1.0 gtable_0.3.0 assertthat_0.2.1
[64] rsvd_1.0.3 broom_0.7.0 viridisLite_0.3.0
[67] beeswarm_0.2.3 statmod_1.4.34 bluster_0.99.1
[70] ellipsis_0.3.1
Thank you Aaron!
I have opened an issue here on the
coef
andcontrast
arguments ofpseudobulkDGE()
.I have two follow-up questions:
One reason I dislike numeric vector contrasts is that I'm never sure what is contrasted with what (and therefore the directionality of the resulting fold changes), especially in cases like this where I can't see the design matrix. How did you know that you had to set it up as
c(1, -1)
and notc(-1, 1)
to modelconditiondisease
overconditionhealthy
? I'm guessing that depends on the ordering of these$condition
factor? Assuming that's in alphabetical order (which may not always be the case), thendisease
is indeed the first level of the factor. Am I missing something?In my real-world example, all columns of all resulting
DataFrame
s containNA
s. Do you have any ideas on what might be causing this?i.e.:
1. In usual analyses, you would be able to see the design matrix so it is usually not too hard to manually construct an appropriate contrast matrix. In fact, if the columns of your design matrix do not have syntactically valid names, manual construction is really your only option because
makeContrasts()
won't be able to parse your expression properly.In this particular case, because the design matrix is constructed inside the
pseudoBulkDGE()
function, you have to guess as to what the columns will be. Supplying a string tocontrast=
takes the guesswork out of the order of the columns. However, you will still be susceptible to the absence of levels (e.g., because a cell type isn't present in a particular condition), which can cause coefficients to drop out of the design matrix and for yourcoef=
orcontrast=
to not make sense anymore.2. This happens when your comparison cannot be performed, e.g., because the design matrix is not of full rank or when your comparison doesn't make sense for a cell type that only occurs in one condition. The BioC-devel version of the function will not bother to report
DataFrame
s for such failures, instead registering the failed runs in themetadata()
of the outputList
.