Questions regarding DESeq2 experimental setup and lfcShrink
Entering edit mode
zroger499 • 0
Last seen 15 months ago

Hello Bioconductor comunity I'm working on an RNA-seq project where we have the following experimental setup:


Sample_name condition   individual  batch
guard_cell_1    guard_cell  1   smart-seq
guard_cell_2    guard_cell  2   smart-seq
guard_cell_3    guard_cell  3   smart-seq
meso_1  meso    1   smart-seq
meso_2  meso    2   smart-seq
meso_3  meso    3   smart-seq
leaves_1    leaves  1   smart-seq
leaves_2    leaves  2   smart-seq
totalptt_1  totalptt    1   smart-seq
totalptt_2  totalptt    2   smart-seq
totalptt_3  totalptt    3   smart-seq
P1F1Ext F1Ext   4   True-seq
P2F1Ext F1Ext   5   True-seq
P3F1Ext F1Ext   6   True-seq
P1F1Int F1Int   10  True-seq
P2F1Int F1Int   11  True-seq
P3F1Int F1Int   12  True-seq
P1F4Ext F4Ext   7   True-seq
P2F4Ext F4Ext   8   True-seq
P3F4Ext F4Ext   9   True-seq
P1F4Int F4Int   13  True-seq
P2F4Int F4Int   14  True-seq
P3F4Int F4Int   15  True-seq

My group is interested in the DEG (particularly the upregulated genes) in: a) guard_cells (compared with meso) b) F1Ext (compared with F1Int) c) F4Ext (compared with F4Int)

For this goal I made 3 DESeq objects, each using 6 samples (guard_cell + meso, F1Ext + F1Int, F4Ext + F4Int) and performed each comparison individually. In addition, I also filtered the gene expression matrix of F1 and F4 to only include genes expressed in guard_cells. Below is the code I used to analyze the first comparison (the second and third were created in a similar way, with the exception of the inclusion of "individual" in the experimental design)

featureCounts_matrix <- read.table(file = "_featureCounts/quant_featureCounts.tsv", row.names = 1, header =T)
levels <- c("meso","guard_cell")
meso_guard_expression_matrix <- featureCounts_matrix[,c(6:11)]
meso_guard_annotation <- annotation_matrix[c(1:6),]
dd <- DESeqDataSetFromMatrix(countData = meso_guard_expression_matrix, 
                             colData = meso_guard_annotation, 
                             design = ~ individual + condition)

colData(dd)$condition <- factor(colData(dd)$condition, levels = levels)

#filter 0 counts rows
dd <- dd[rowSums(counts(dd)) > 0, ]

#perform analysis
dds <- DESeq(dd)

#Get DGE
res <- results(dds, alpha = 0.05)
#shrink the log2fold change using apeglm package (default)
resLFC <- lfcShrink(dds, res = res, coef="condition_guard_vs_meso", type="apeglm")

#filter the results 
padj.cutoff <- 0.05
lfc.cutoff <- 1 

sig <- subset(data.frame(resLFC), padj < padj.cutoff & abs(log2FoldChange) > lfc.cutoff)
sig <- sig[order(sig$padj),]

Regarding my approach I have two questions:

1) Should I load all the data of my experiment in a single DESeq2 object and get my DEG using res(dds, contrast = (factor, numerator, denominator)) for each of the comparisons I want to make? And in that case, should I include batch in the design since I do not have samples from the same tissue in different batches (in addition I also performed a HC which suggested the batch effects do not play a large role).

2) Is it correct to perform log-fold change shrinkage and then filter the results based on a log-fold change thresholdd? The tutorials only mentioned plotting and ranking (the latter is also relevant to me).

Thanks in advance

sessionInfo( )

R version 4.0.2 (2020-06-22)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19041)

Matrix products: default

[1] LC_COLLATE=English_United Kingdom.1252  LC_CTYPE=English_United Kingdom.1252   
[3] LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C                           
[5] LC_TIME=English_United Kingdom.1252    

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] apeglm_1.10.0               RColorBrewer_1.1-2          EnhancedVolcano_1.6.0      
 [4] ggrepel_0.9.1               ggplot2_3.3.3               pheatmap_1.0.12            
 [7] DESeq2_1.28.1               SummarizedExperiment_1.18.2 DelayedArray_0.14.1        
[10] matrixStats_0.56.0          Biobase_2.48.0              GenomicRanges_1.40.0       
[13] GenomeInfoDb_1.24.2         IRanges_2.22.2              S4Vectors_0.26.1           
[16] BiocGenerics_0.34.0        

loaded via a namespace (and not attached):
 [1] bdsmatrix_1.3-4        Rcpp_1.0.5             locfit_1.5-9.4         mvtnorm_1.1-1         
 [5] lattice_0.20-41        digest_0.6.25          utf8_1.1.4             plyr_1.8.6            
 [9] R6_2.5.0               emdbook_1.3.12         coda_0.19-4            RSQLite_2.2.0         
[13] pillar_1.5.1           zlibbioc_1.34.0        rlang_0.4.10           rstudioapi_0.13       
[17] annotate_1.66.0        blob_1.2.1             Matrix_1.2-18          bbmle_1.0.23.1        
[21] labeling_0.4.2         splines_4.0.2          BiocParallel_1.22.0    geneplotter_1.66.0    
[25] RCurl_1.98-1.2         bit_4.0.4              munsell_0.5.0          numDeriv_2016.8-1.1   
[29] compiler_4.0.2         pkgconfig_2.0.3        tidyselect_1.1.0       tibble_3.0.3          
[33] GenomeInfoDbData_1.2.3 XML_3.99-0.5           fansi_0.4.2            crayon_1.4.1          
[37] dplyr_1.0.2            withr_2.4.1            MASS_7.3-51.6          bitops_1.0-6          
[41] grid_4.0.2             xtable_1.8-4           gtable_0.3.0           lifecycle_1.0.0       
[45] DBI_1.1.1              magrittr_2.0.1         scales_1.1.1           cachem_1.0.4          
[49] farver_2.1.0           XVector_0.28.0         genefilter_1.70.0      ellipsis_0.3.1        
[53] vctrs_0.3.4            generics_0.1.0         tools_4.0.2            bit64_4.0.5           
[57] glue_1.4.2             purrr_0.3.4            fastmap_1.1.0          survival_3.2-3        
[61] AnnotationDbi_1.50.3   colorspace_1.4-1       memoise_2.0.0
DESeq2 RNA-seq • 647 views
Entering edit mode
Last seen 3 days ago
United States

Q1 is a FAQ in the vignette. Choice of design I have to leave up to the user, it's worth consulting with a statistician on this.

Q2 we have two ways to introduce an lfcThreshold: in results() it is incorporated into the null hypothesis testing. In lfcShrink it will generate s-values corresponding to aggregate probabilities derived from the posterior distribution per gene (see Zhu et al reference for details).

Entering edit mode

I've taken a look at Zhu et al. 2018, but my knowledge in Bayesian statistics is very narrow. If I use the second approach, the lfcThreshold doesn't need to be included as well (and then filter the results with s-value < 0.05)?

resLFC <- lfcShrink(dds, coef="condition_guard_vs_meso", type="apeglm", svalue = T, lfcThreshold = 1)

Notice that in the above command the res argument was removed.

Entering edit mode

That looks good to me: it will give you genes where you have good evidence of |LFC| > 1.


Login before adding your answer.

Traffic: 370 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6