Question

DESeq2: different results with and without releveling when providing contrast

0

Entering edit mode

drew • 0

@270fc3f7

Last seen 2.1 years ago

Canada

Hello,

I am conducting a differential expression analysis of RNA-seq data where I have samples from three treatment doses (control, low, high) at three developmental timepoints (W6, W8, GS45). Note these are not paired samples (i.e. different individuals sampled at each timepoints). I am interested in differentially expressed genes caused by the treatments at each developmental timepoint.

After reading the vignette, particularly the section on interactions (https://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#interactions), I decided to add a third factor which is a combination of treatment and timepoint, i.e.:

sample_metadata$group <- factor(paste0(sample_metadata$timepoint, "_", sample_metadata$dose))

Then I can extract the results for each comparison by supplying the appropriate contrast to the results function.

If I don't explicitly set a reference level, DESeq2 sets it to "GS45_control," as expected alphabetically. My understanding from the vignette and from other posts is that it doesn't matter what the reference level is set to if you are explicitly supplying a contrast to the results function. For example, if I want to find differentially expressed genes in "W6_low" compared to "W6_control", I would run:

w6_low <- results(dds, alpha = alpha, contrast = c("group", "W6_low", "W6_control"))

However, I noticed that I get a different number of significant DEGs if I do relevel dds$group first. I've produced an example below:

library(DESeq2)

counts_file = "input/gene_count_matrix.csv"
metadata_file = "input/sample_data.txt"

# Read in counts and metadata
count_data <- as.matrix(read.csv(counts_file,
                                 row.names = "gene_id"))
sample_metadata <- read.csv(metadata_file,
                            sep = "\t",
                            row.names = 1,
                            stringsAsFactors = TRUE)
sample_metadata$group <- factor(paste0(sample_metadata$timepoint, "_", sample_metadata$dose))
count_data <- count_data[, rownames(sample_metadata)]

# Verify structure of data
stopifnot(all(rownames(sample_metadata) %in% colnames(count_data)))
stopifnot(all(rownames(sample_metadata) == colnames(count_data)))

dds <- DESeqDataSetFromMatrix(countData = count_data,
                              colData = sample_metadata,
                              design = ~ group)

dds_re <- DESeqDataSetFromMatrix(countData = count_data,
                                 colData = sample_metadata,
                                 design = ~ group)
dds_re$group <- relevel(dds_re$group, ref = "W6_control")

# Filter rows with no counts, or only a single count across all samples
keep <- rowSums(counts(dds)) > 1
dds <- dds[keep, ]
keep_re <- rowSums(counts(dds_re)) > 1
dds_re <- dds_re[keep_re, ]

dds <- DESeq(dds)
dds_re <- DESeq(dds_re)

results_w6_low <- results(dds, alpha = 0.05, contrast = c("group", "W6_low", "W6_control"))
results_w6_low_re <- results(dds_re, alpha = 0.05, contrast = c("group", "W6_low", "W6_control"))

results_w6_low <- results_w6_low[complete.cases(results_w6_low$log2FoldChange) &
                                     complete.cases(results_w6_low$padj), ]
results_w6_low_re <- results_w6_low_re[complete.cases(results_w6_low_re$log2FoldChange) &
                                           complete.cases(results_w6_low_re$padj), ]

nrow(results_w6_low[results_w6_low$padj < 0.05, ])
[1] 1238
nrow(results_w6_low_re[results_w6_low_re$padj < 0.05, ])
[1] 1241

sessionInfo()
R version 4.3.1 (2023-06-16)
Platform: x86_64-apple-darwin20 (64-bit)
Running under: macOS Sonoma 14.0

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/Toronto
tzcode source: internal

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] DESeq2_1.42.0               SummarizedExperiment_1.32.0 Biobase_2.62.0             
 [4] MatrixGenerics_1.14.0       matrixStats_1.0.0           GenomicRanges_1.54.1       
 [7] GenomeInfoDb_1.38.0         IRanges_2.36.0              S4Vectors_0.40.1           
[10] BiocGenerics_0.48.0        

loaded via a namespace (and not attached):
 [1] utf8_1.2.4              generics_0.1.3          SparseArray_1.2.0       bitops_1.0-7           
 [5] lattice_0.22-5          magrittr_2.0.3          grid_4.3.1              Matrix_1.6-1.1         
 [9] sessioninfo_1.2.2       fansi_1.0.5             scales_1.2.1            codetools_0.2-19       
[13] abind_1.4-5             cli_3.6.1               rlang_1.1.1             crayon_1.5.2           
[17] XVector_0.42.0          munsell_0.5.0           DelayedArray_0.28.0     S4Arrays_1.2.0         
[21] tools_4.3.1             parallel_4.3.1          BiocParallel_1.36.0     dplyr_1.1.3            
[25] colorspace_2.1-0        ggplot2_3.4.4           locfit_1.5-9.8          GenomeInfoDbData_1.2.11
[29] vctrs_0.6.4             R6_2.5.1                lifecycle_1.0.3         zlibbioc_1.48.0        
[33] pkgconfig_2.0.3         pillar_1.9.0            gtable_0.3.4            glue_1.6.2             
[37] Rcpp_1.0.11             tibble_3.2.1            tidyselect_1.2.0        rstudioapi_0.15.0      
[41] compiler_4.3.1          RCurl_1.98-1.12

Is this expected behaviour and I am misinterpreting the vignette? Should I be releveling to each timepoint's control and re-running DESeq before extracting the two companions for each timepoint?

Thanks for any insight or assistance.

DESeq2 • 935 views

ADD COMMENT • link updated 2.2 years ago by Michael Love 43k • written 2.2 years ago by drew • 0

score 0 · Answer 1 · 2023-11-07

The fluctuation is just due to small changes in the betas when different groups are set as reference. You can probably minimize with stricter filtering of low count genes.

You may want to work this statistical design question out with a local statistician or someone familiar with linear models in R.

Unfortunately, I have to limit my answers on the support site to software related questions.