DESeq2 Shrinkage methods for scRNA-Seq
1
1
Entering edit mode
Pedro Tan ▴ 10
@2746a461
Last seen 2.6 years ago
Netherlands

Hi everyone,

I started using log fold change shrinkage from the DESeq2 package in single-cell RNA-seq datasets, and I was comparing the "normal" and "apeglm" methods.

I noticed that for many genes, the normal LFC shrinkage switched the sign of the estimated LFC (4524/19243 genes), which doesn't occur for apeglm (4/19243). This was surprising, as I didn't expect a shrinkage method to turn any positive LFC into (small) negative values. Is this expected behavior from the normal shrinkage?

Comparison of Normal and not shrunk LFC. Note that positive non shrunk genes cross zero

Upon inspection of some of these genes, very few single cells had non-zero counts, so it makes sense these should be shrunk. However, apeglm maintained a high LFC for genes with almost no counts. I included a couple of examples:

Comparison of normal and apeglm shrinkage for a few genes Frequency and cumulative frequency of counts in groups used for DE

Now I'm wondering if these genes should be used at all if only a few cells have one or two counts and the vast majority (>90%) has none. I thought shrinkage would always handle these cases. The normal method apparently does, shrinking most genes to 0 (or slightly negative values?), but apeglm method seems too conservative. Should it only be used after a filter on the minimum number of counts?

Sorry if I'm missing something. Any insights?

Thanks in advance,

Pedro Tan

scRNA_cn <- readRDS(file = "output_sc/Bakhoum_scRNA.rds")
sc_counts <- Seurat::GetAssayData(scRNA_cn, assay = "RNA", slot = "counts")
df.diff.design <- scRNA_cn@meta.data %>% select(CIN_status) %>% rownames_to_column("Sample") %>%
  mutate(Sample = as.factor(Sample),
    CIN_status = as.factor(CIN_status))

deseq.dataset <- DESeqDataSetFromMatrix(
  countData=as.matrix(sc_counts),
  colData=df.diff.design,
  design=~CIN_status
) %>% estimateSizeFactors()

ddsWald <- DESeq(deseq.dataset, test = "Wald", useT = TRUE, minmu = 1e-6, minReplicatesForReplace=Inf)

ddsWald_shrunk <- lfcShrink(ddsWald, coef = "CIN_status_CIN.high_vs_CIN.low", type = "normal")

ddsWald_apeglm <- lfcShrink(ddsWald, coef = "CIN_status_CIN.high_vs_CIN.low", type = "apeglm")

sessionInfo( )
R version 4.0.3 (2020-10-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.2 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/liblapack.so.3

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8       
 [4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
[10] LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods  
[9] base     

other attached packages:
 [1] biomaRt_2.46.3              AneuFinder_1.18.0           AneuFinderData_1.18.0      
 [4] cowplot_1.1.1               infercnv_1.6.0              DESeq2_1.30.1              
 [7] SummarizedExperiment_1.20.0 GenomicRanges_1.42.0        GenomeInfoDb_1.26.7        
[10] glmGamPoi_1.2.0             gplots_3.1.1                pheatmap_1.0.12            
[13] plotly_4.9.3                fgsea_1.16.0                proxy_0.4-26               
[16] forcats_0.5.1               stringr_1.4.0               dplyr_1.0.6                
[19] purrr_0.3.4                 readr_1.4.0                 tidyr_1.1.3                
[22] tibble_3.1.1                tidyverse_1.3.1             org.Hs.eg.db_3.12.0        
[25] AnnotationDbi_1.52.0        IRanges_2.24.1              S4Vectors_0.28.1           
[28] Biobase_2.50.0              BiocGenerics_0.36.1         msigdbr_7.4.1              
[31] sparseMatrixStats_1.2.1     MatrixGenerics_1.2.1        matrixStats_0.58.0         
[34] ggrepel_0.9.1               ggplot2_3.3.3               patchwork_1.1.1            
[37] SeuratObject_4.0.1          Seurat_4.0.1
DESeq2 • 1.2k views
ADD COMMENT
1
Entering edit mode
@mikelove
Last seen 14 hours ago
United States

This was surprising, as I didn't expect a shrinkage method to turn any positive LFC into (small) negative values. Is this expected behavior from the normal shrinkage?

This is not surprising to me. There is no "true" LFC (e.g. you can get change in sign if you compare log2 of ratio of means, or of medians, of two groups of counts), and modeling choices then make a big difference in the sign. We don't in general recommend "normal" anymore (since the introduction of apeglm 4 years ago), as I don't think the prior is a good fit to the distribution of effect sizes in transcriptomics that we often see: some very large effects that are incompatible with normal prior.

Now I'm wondering if these genes should be used at all if only a few cells have one or two counts and the vast majority (>90%) has none.

From my experience with single cell, what you describe is common for many genes. And the high LFC for TNS4 seems justified.

I would recommend you use an alternative size factor though than the default, see the last bullet here:

https://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#recommendations-for-single-cell-analysis

So before running DESeq() you can use an alternative method for estimating size factors, sf, then set:

sizeFactors(dds) <- sf
ADD COMMENT
0
Entering edit mode

Thanks Michael! I will stick to apeglm and incorporate alternate size factors for single cell data.

All the best, Pedro

ADD REPLY

Login before adding your answer.

Traffic: 806 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6