Hello,
I am conducting a differential expression analysis of RNA-seq data where I have samples from three treatment doses (control, low, high) at three developmental timepoints (W6, W8, GS45). Note these are not paired samples (i.e. different individuals sampled at each timepoints). I am interested in differentially expressed genes caused by the treatments at each developmental timepoint.
After reading the vignette, particularly the section on interactions (https://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#interactions), I decided to add a third factor which is a combination of treatment and timepoint, i.e.:
sample_metadata$group <- factor(paste0(sample_metadata$timepoint, "_", sample_metadata$dose))
Then I can extract the results for each comparison by supplying the appropriate contrast to the results function.
If I don't explicitly set a reference level, DESeq2 sets it to "GS45_control," as expected alphabetically. My understanding from the vignette and from other posts is that it doesn't matter what the reference level is set to if you are explicitly supplying a contrast to the results function. For example, if I want to find differentially expressed genes in "W6_low" compared to "W6_control", I would run:
w6_low <- results(dds, alpha = alpha, contrast = c("group", "W6_low", "W6_control"))
However, I noticed that I get a different number of significant DEGs if I do relevel dds$group first. I've produced an example below:
library(DESeq2)
counts_file = "input/gene_count_matrix.csv"
metadata_file = "input/sample_data.txt"
# Read in counts and metadata
count_data <- as.matrix(read.csv(counts_file,
row.names = "gene_id"))
sample_metadata <- read.csv(metadata_file,
sep = "\t",
row.names = 1,
stringsAsFactors = TRUE)
sample_metadata$group <- factor(paste0(sample_metadata$timepoint, "_", sample_metadata$dose))
count_data <- count_data[, rownames(sample_metadata)]
# Verify structure of data
stopifnot(all(rownames(sample_metadata) %in% colnames(count_data)))
stopifnot(all(rownames(sample_metadata) == colnames(count_data)))
dds <- DESeqDataSetFromMatrix(countData = count_data,
colData = sample_metadata,
design = ~ group)
dds_re <- DESeqDataSetFromMatrix(countData = count_data,
colData = sample_metadata,
design = ~ group)
dds_re$group <- relevel(dds_re$group, ref = "W6_control")
# Filter rows with no counts, or only a single count across all samples
keep <- rowSums(counts(dds)) > 1
dds <- dds[keep, ]
keep_re <- rowSums(counts(dds_re)) > 1
dds_re <- dds_re[keep_re, ]
dds <- DESeq(dds)
dds_re <- DESeq(dds_re)
results_w6_low <- results(dds, alpha = 0.05, contrast = c("group", "W6_low", "W6_control"))
results_w6_low_re <- results(dds_re, alpha = 0.05, contrast = c("group", "W6_low", "W6_control"))
results_w6_low <- results_w6_low[complete.cases(results_w6_low$log2FoldChange) &
complete.cases(results_w6_low$padj), ]
results_w6_low_re <- results_w6_low_re[complete.cases(results_w6_low_re$log2FoldChange) &
complete.cases(results_w6_low_re$padj), ]
nrow(results_w6_low[results_w6_low$padj < 0.05, ])
[1] 1238
nrow(results_w6_low_re[results_w6_low_re$padj < 0.05, ])
[1] 1241
sessionInfo()
R version 4.3.1 (2023-06-16)
Platform: x86_64-apple-darwin20 (64-bit)
Running under: macOS Sonoma 14.0
Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
time zone: America/Toronto
tzcode source: internal
attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods base
other attached packages:
[1] DESeq2_1.42.0 SummarizedExperiment_1.32.0 Biobase_2.62.0
[4] MatrixGenerics_1.14.0 matrixStats_1.0.0 GenomicRanges_1.54.1
[7] GenomeInfoDb_1.38.0 IRanges_2.36.0 S4Vectors_0.40.1
[10] BiocGenerics_0.48.0
loaded via a namespace (and not attached):
[1] utf8_1.2.4 generics_0.1.3 SparseArray_1.2.0 bitops_1.0-7
[5] lattice_0.22-5 magrittr_2.0.3 grid_4.3.1 Matrix_1.6-1.1
[9] sessioninfo_1.2.2 fansi_1.0.5 scales_1.2.1 codetools_0.2-19
[13] abind_1.4-5 cli_3.6.1 rlang_1.1.1 crayon_1.5.2
[17] XVector_0.42.0 munsell_0.5.0 DelayedArray_0.28.0 S4Arrays_1.2.0
[21] tools_4.3.1 parallel_4.3.1 BiocParallel_1.36.0 dplyr_1.1.3
[25] colorspace_2.1-0 ggplot2_3.4.4 locfit_1.5-9.8 GenomeInfoDbData_1.2.11
[29] vctrs_0.6.4 R6_2.5.1 lifecycle_1.0.3 zlibbioc_1.48.0
[33] pkgconfig_2.0.3 pillar_1.9.0 gtable_0.3.4 glue_1.6.2
[37] Rcpp_1.0.11 tibble_3.2.1 tidyselect_1.2.0 rstudioapi_0.15.0
[41] compiler_4.3.1 RCurl_1.98-1.12
Is this expected behaviour and I am misinterpreting the vignette? Should I be releveling to each timepoint's control and re-running DESeq before extracting the two companions for each timepoint?
Thanks for any insight or assistance.
