Question

DESeq2 Shrinkage methods for scRNA-Seq

1

Entering edit mode

Pedro Tan ▴ 10

@2746a461

Last seen 2.8 years ago

Netherlands

Hi everyone,

I started using log fold change shrinkage from the DESeq2 package in single-cell RNA-seq datasets, and I was comparing the "normal" and "apeglm" methods.

I noticed that for many genes, the normal LFC shrinkage switched the sign of the estimated LFC (4524/19243 genes), which doesn't occur for apeglm (4/19243). This was surprising, as I didn't expect a shrinkage method to turn any positive LFC into (small) negative values. Is this expected behavior from the normal shrinkage?

Comparison of Normal and not shrunk LFC. Note that positive non shrunk genes cross zero

Upon inspection of some of these genes, very few single cells had non-zero counts, so it makes sense these should be shrunk. However, apeglm maintained a high LFC for genes with almost no counts. I included a couple of examples:

Comparison of normal and apeglm shrinkage for a few genes Frequency and cumulative frequency of counts in groups used for DE

Now I'm wondering if these genes should be used at all if only a few cells have one or two counts and the vast majority (>90%) has none. I thought shrinkage would always handle these cases. The normal method apparently does, shrinking most genes to 0 (or slightly negative values?), but apeglm method seems too conservative. Should it only be used after a filter on the minimum number of counts?

Sorry if I'm missing something. Any insights?

Thanks in advance,

Pedro Tan

scRNA_cn <- readRDS(file = "output_sc/Bakhoum_scRNA.rds")
sc_counts <- Seurat::GetAssayData(scRNA_cn, assay = "RNA", slot = "counts")
df.diff.design <- scRNA_cn@meta.data %>% select(CIN_status) %>% rownames_to_column("Sample") %>%
  mutate(Sample = as.factor(Sample),
    CIN_status = as.factor(CIN_status))

deseq.dataset <- DESeqDataSetFromMatrix(
  countData=as.matrix(sc_counts),
  colData=df.diff.design,
  design=~CIN_status
) %>% estimateSizeFactors()

ddsWald <- DESeq(deseq.dataset, test = "Wald", useT = TRUE, minmu = 1e-6, minReplicatesForReplace=Inf)

ddsWald_shrunk <- lfcShrink(ddsWald, coef = "CIN_status_CIN.high_vs_CIN.low", type = "normal")

ddsWald_apeglm <- lfcShrink(ddsWald, coef = "CIN_status_CIN.high_vs_CIN.low", type = "apeglm")

sessionInfo( )
R version 4.0.3 (2020-10-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.2 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/liblapack.so.3

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8       
 [4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
[10] LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods  
[9] base     

other attached packages:
 [1] biomaRt_2.46.3              AneuFinder_1.18.0           AneuFinderData_1.18.0      
 [4] cowplot_1.1.1               infercnv_1.6.0              DESeq2_1.30.1              
 [7] SummarizedExperiment_1.20.0 GenomicRanges_1.42.0        GenomeInfoDb_1.26.7        
[10] glmGamPoi_1.2.0             gplots_3.1.1                pheatmap_1.0.12            
[13] plotly_4.9.3                fgsea_1.16.0                proxy_0.4-26               
[16] forcats_0.5.1               stringr_1.4.0               dplyr_1.0.6                
[19] purrr_0.3.4                 readr_1.4.0                 tidyr_1.1.3                
[22] tibble_3.1.1                tidyverse_1.3.1             org.Hs.eg.db_3.12.0        
[25] AnnotationDbi_1.52.0        IRanges_2.24.1              S4Vectors_0.28.1           
[28] Biobase_2.50.0              BiocGenerics_0.36.1         msigdbr_7.4.1              
[31] sparseMatrixStats_1.2.1     MatrixGenerics_1.2.1        matrixStats_0.58.0         
[34] ggrepel_0.9.1               ggplot2_3.3.3               patchwork_1.1.1            
[37] SeuratObject_4.0.1          Seurat_4.0.1

DESeq2 • 1.3k views

ADD COMMENT • link 2.8 years ago Pedro Tan ▴ 10

score 1 · Answer 1 · 2021-07-27

This was surprising, as I didn't expect a shrinkage method to turn any positive LFC into (small) negative values. Is this expected behavior from the normal shrinkage?

This is not surprising to me. There is no "true" LFC (e.g. you can get change in sign if you compare log2 of ratio of means, or of medians, of two groups of counts), and modeling choices then make a big difference in the sign. We don't in general recommend "normal" anymore (since the introduction of apeglm 4 years ago), as I don't think the prior is a good fit to the distribution of effect sizes in transcriptomics that we often see: some very large effects that are incompatible with normal prior.

Now I'm wondering if these genes should be used at all if only a few cells have one or two counts and the vast majority (>90%) has none.

From my experience with single cell, what you describe is common for many genes. And the high LFC for TNS4 seems justified.

I would recommend you use an alternative size factor though than the default, see the last bullet here:

https://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#recommendations-for-single-cell-analysis

So before running DESeq() you can use an alternative method for estimating size factors, sf, then set:

sizeFactors(dds) <- sf