I'm trying to perform a simple differential expression analysis between two conditions across cell clusters, using pseudo-bulking of scRNA-seq data, here is a toy example:
# load libraries library(tidyverse) library(scater) library(scuttle) library(scran) # create sce set.seed(16) sce <- mockSCE(ncells = 1000) sce <- logNormCounts(sce) # annotate sce sce$sample <- rep(str_c("sample", 1:10), each = 100) sce$celltype <- str_c("celltype", unname(kmeans(t(logcounts(sce)), centers=3)$cluster)) sce$condition <- rep(c("healthy", "disease"), each = 500) # create pseudobulk se se <- aggregateAcrossCells(sce, ids = colData(sce)[, c("sample", "celltype")])
Assuming I'm using
aggregateAcrossCells() correctly, I would now like to compute, for each cell type, which genes are differentially expressed between disease and healthy pseudosamples.
For my own clarity, I would like to be able to explicitly set the design to
~ 0 + condition and then get the results for the contrast
conditiondisease - conditionhealthy.
How do I do this?
I tried a few things:
# differential expression analysis dea <- pseudoBulkDGE(se, label = se$cluster, condition = se$condition, design = ~ 0 + condition, coef = "conditionhealthy") dea <- pseudoBulkDGE(se, label = se$cluster, condition = se$condition, design = ~ 0 + condition, contrast = "conditiondisease - conditionhealthy")
but in both cases I get a
List of length 0 as a result, so I must be doing something wrong.
R version 4.0.2 (2020-06-22) Platform: x86_64-pc-linux-gnu (64-bit) Running under: CentOS Linux 7 (Core) Matrix products: default BLAS/LAPACK: /usr/prog/OpenBLAS/0.2.20-GCC-6.4.0-2.28/lib/libopenblas_haswellp-r0.2.20.so locale:  LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8  LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8  LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C  LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages:  parallel stats4 stats graphics grDevices utils datasets methods base other attached packages:  scran_1.17.14 scuttle_0.99.11 scater_1.17.4  SingleCellExperiment_1.10.1 SummarizedExperiment_1.18.2 DelayedArray_0.14.1  matrixStats_0.56.0 Biobase_2.48.0 GenomicRanges_1.40.0  GenomeInfoDb_1.24.2 IRanges_2.22.2 S4Vectors_0.26.1  BiocGenerics_0.34.0 forcats_0.5.0 stringr_1.4.0  dplyr_1.0.1 purrr_0.3.4 readr_1.3.1  tidyr_1.1.1 tibble_3.0.3 ggplot2_3.3.2  tidyverse_1.3.0 loaded via a namespace (and not attached):  bitops_1.0-6 fs_1.5.0 lubridate_1.7.9  httr_1.4.2 tools_4.0.2 backports_1.1.8  R6_2.4.1 irlba_2.3.3 vipor_0.4.5  DBI_1.1.0 colorspace_1.4-1 withr_2.2.0  tidyselect_1.1.0 gridExtra_2.3 compiler_4.0.2  cli_2.0.2 rvest_0.3.6 BiocNeighbors_1.6.0  xml2_1.3.2 scales_1.1.1 XVector_0.28.0  pkgconfig_2.0.3 dbplyr_1.4.4 limma_3.44.3  rlang_0.4.7 readxl_1.3.1 rstudioapi_0.11  DelayedMatrixStats_1.10.1 generics_0.0.2 jsonlite_1.7.0  BiocParallel_1.22.0 RCurl_1.98-1.2 magrittr_1.5  BiocSingular_1.4.0 GenomeInfoDbData_1.2.3 Matrix_1.2-18  Rcpp_188.8.131.52 ggbeeswarm_0.6.0 munsell_0.5.0  fansi_0.4.1 viridis_0.5.1 lifecycle_0.2.0  stringi_1.4.6 edgeR_3.30.3 zlibbioc_1.34.0  grid_4.0.2 blob_1.2.1 dqrng_0.2.1  crayon_1.3.4 lattice_0.20-41 haven_2.3.1  hms_0.5.3 locfit_1.5-9.4 pillar_1.4.6  igraph_1.2.5 reprex_0.3.0 glue_1.4.1  packrat_0.5.0 modelr_0.1.8 vctrs_0.3.2  cellranger_1.1.0 gtable_0.3.0 assertthat_0.2.1  rsvd_1.0.3 broom_0.7.0 viridisLite_0.3.0  beeswarm_0.2.3 statmod_1.4.34 bluster_0.99.1  ellipsis_0.3.1
Thank you Aaron!
I have opened an issue here on the
I have two follow-up questions:
One reason I dislike numeric vector contrasts is that I'm never sure what is contrasted with what (and therefore the directionality of the resulting fold changes), especially in cases like this where I can't see the design matrix. How did you know that you had to set it up as
c(1, -1)and not
c(-1, 1)to model
conditionhealthy? I'm guessing that depends on the ordering of the
se$conditionfactor? Assuming that's in alphabetical order (which may not always be the case), then
diseaseis indeed the first level of the factor. Am I missing something?
In my real-world example, all columns of all resulting
NAs. Do you have any ideas on what might be causing this?
1. In usual analyses, you would be able to see the design matrix so it is usually not too hard to manually construct an appropriate contrast matrix. In fact, if the columns of your design matrix do not have syntactically valid names, manual construction is really your only option because
makeContrasts()won't be able to parse your expression properly.
In this particular case, because the design matrix is constructed inside the
pseudoBulkDGE()function, you have to guess as to what the columns will be. Supplying a string to
contrast=takes the guesswork out of the order of the columns. However, you will still be susceptible to the absence of levels (e.g., because a cell type isn't present in a particular condition), which can cause coefficients to drop out of the design matrix and for your
contrast=to not make sense anymore.
2. This happens when your comparison cannot be performed, e.g., because the design matrix is not of full rank or when your comparison doesn't make sense for a cell type that only occurs in one condition. The BioC-devel version of the function will not bother to report
DataFrames for such failures, instead registering the failed runs in the
metadata()of the output