Hello Bioconductor comunity I'm working on an RNA-seq project where we have the following experimental setup:
View(annotation_matrix)
Sample_name condition individual batch
guard_cell_1 guard_cell 1 smart-seq
guard_cell_2 guard_cell 2 smart-seq
guard_cell_3 guard_cell 3 smart-seq
meso_1 meso 1 smart-seq
meso_2 meso 2 smart-seq
meso_3 meso 3 smart-seq
leaves_1 leaves 1 smart-seq
leaves_2 leaves 2 smart-seq
totalptt_1 totalptt 1 smart-seq
totalptt_2 totalptt 2 smart-seq
totalptt_3 totalptt 3 smart-seq
P1F1Ext F1Ext 4 True-seq
P2F1Ext F1Ext 5 True-seq
P3F1Ext F1Ext 6 True-seq
P1F1Int F1Int 10 True-seq
P2F1Int F1Int 11 True-seq
P3F1Int F1Int 12 True-seq
P1F4Ext F4Ext 7 True-seq
P2F4Ext F4Ext 8 True-seq
P3F4Ext F4Ext 9 True-seq
P1F4Int F4Int 13 True-seq
P2F4Int F4Int 14 True-seq
P3F4Int F4Int 15 True-seq
My group is interested in the DEG (particularly the upregulated genes) in: a) guard_cells (compared with meso) b) F1Ext (compared with F1Int) c) F4Ext (compared with F4Int)
For this goal I made 3 DESeq objects, each using 6 samples (guard_cell + meso, F1Ext + F1Int, F4Ext + F4Int) and performed each comparison individually. In addition, I also filtered the gene expression matrix of F1 and F4 to only include genes expressed in guard_cells. Below is the code I used to analyze the first comparison (the second and third were created in a similar way, with the exception of the inclusion of "individual" in the experimental design)
featureCounts_matrix <- read.table(file = "_featureCounts/quant_featureCounts.tsv", row.names = 1, header =T)
levels <- c("meso","guard_cell")
meso_guard_expression_matrix <- featureCounts_matrix[,c(6:11)]
meso_guard_annotation <- annotation_matrix[c(1:6),]
dd <- DESeqDataSetFromMatrix(countData = meso_guard_expression_matrix,
colData = meso_guard_annotation,
design = ~ individual + condition)
colData(dd)$condition <- factor(colData(dd)$condition, levels = levels)
#filter 0 counts rows
dd <- dd[rowSums(counts(dd)) > 0, ]
#perform analysis
dds <- DESeq(dd)
#Get DGE
res <- results(dds, alpha = 0.05)
#shrink the log2fold change using apeglm package (default)
resLFC <- lfcShrink(dds, res = res, coef="condition_guard_vs_meso", type="apeglm")
#filter the results
padj.cutoff <- 0.05
lfc.cutoff <- 1
sig <- subset(data.frame(resLFC), padj < padj.cutoff & abs(log2FoldChange) > lfc.cutoff)
sig <- sig[order(sig$padj),]
Regarding my approach I have two questions:
1) Should I load all the data of my experiment in a single DESeq2 object and get my DEG using res(dds, contrast = (factor, numerator, denominator))
for each of the comparisons I want to make? And in that case, should I include batch in the design since I do not have samples from the same tissue in different batches (in addition I also performed a HC which suggested the batch effects do not play a large role).
2) Is it correct to perform log-fold change shrinkage and then filter the results based on a log-fold change thresholdd? The tutorials only mentioned plotting and ranking (the latter is also relevant to me).
Thanks in advance
sessionInfo( )
R version 4.0.2 (2020-06-22)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19041)
Matrix products: default
locale:
[1] LC_COLLATE=English_United Kingdom.1252 LC_CTYPE=English_United Kingdom.1252
[3] LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C
[5] LC_TIME=English_United Kingdom.1252
attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets methods base
other attached packages:
[1] apeglm_1.10.0 RColorBrewer_1.1-2 EnhancedVolcano_1.6.0
[4] ggrepel_0.9.1 ggplot2_3.3.3 pheatmap_1.0.12
[7] DESeq2_1.28.1 SummarizedExperiment_1.18.2 DelayedArray_0.14.1
[10] matrixStats_0.56.0 Biobase_2.48.0 GenomicRanges_1.40.0
[13] GenomeInfoDb_1.24.2 IRanges_2.22.2 S4Vectors_0.26.1
[16] BiocGenerics_0.34.0
loaded via a namespace (and not attached):
[1] bdsmatrix_1.3-4 Rcpp_1.0.5 locfit_1.5-9.4 mvtnorm_1.1-1
[5] lattice_0.20-41 digest_0.6.25 utf8_1.1.4 plyr_1.8.6
[9] R6_2.5.0 emdbook_1.3.12 coda_0.19-4 RSQLite_2.2.0
[13] pillar_1.5.1 zlibbioc_1.34.0 rlang_0.4.10 rstudioapi_0.13
[17] annotate_1.66.0 blob_1.2.1 Matrix_1.2-18 bbmle_1.0.23.1
[21] labeling_0.4.2 splines_4.0.2 BiocParallel_1.22.0 geneplotter_1.66.0
[25] RCurl_1.98-1.2 bit_4.0.4 munsell_0.5.0 numDeriv_2016.8-1.1
[29] compiler_4.0.2 pkgconfig_2.0.3 tidyselect_1.1.0 tibble_3.0.3
[33] GenomeInfoDbData_1.2.3 XML_3.99-0.5 fansi_0.4.2 crayon_1.4.1
[37] dplyr_1.0.2 withr_2.4.1 MASS_7.3-51.6 bitops_1.0-6
[41] grid_4.0.2 xtable_1.8-4 gtable_0.3.0 lifecycle_1.0.0
[45] DBI_1.1.1 magrittr_2.0.1 scales_1.1.1 cachem_1.0.4
[49] farver_2.1.0 XVector_0.28.0 genefilter_1.70.0 ellipsis_0.3.1
[53] vctrs_0.3.4 generics_0.1.0 tools_4.0.2 bit64_4.0.5
[57] glue_1.4.2 purrr_0.3.4 fastmap_1.1.0 survival_3.2-3
[61] AnnotationDbi_1.50.3 colorspace_1.4-1 memoise_2.0.0
I've taken a look at Zhu et al. 2018, but my knowledge in Bayesian statistics is very narrow. If I use the second approach, the lfcThreshold doesn't need to be included as well (and then filter the results with s-value < 0.05)?
Notice that in the above command the res argument was removed.
That looks good to me: it will give you genes where you have good evidence of |LFC| > 1.