Question

DESeq2: Difference in DEG numbers when switching reference condition (numerator, denominator) using contrast

0

Entering edit mode

Suvrat • 0

@d0b29f16

Last seen 3 months ago

Germany

Hello everyone,

I am comparing gene expression profiles between 8 different conditions (developmental stages in this case) with a design = ~ condition. I donot specifically have a "control" condition per se to set as a reference so I used the contrast= option when calling results(). I wanted to check whether switching the reference condition, i.e. switching the numerator and denominator terms in contrast=, gives similar results and used the code below to make the comparisons. Note: I did not do the pre-filtering step and only include the relevant lines of code for space reasons.

### I3 as denominator
ddsTxi.tr$condition <- relevel(ddsTxi.tr$condition, "I3")  
dds.tr <- DESeq(ddsTxi.tr)
res_contrast_I5_vs_I3 <- results(dds.tr, contrast = c("condition", "I5", "I3")  )
write.csv(as.data.frame(res_contrast_I5_vs_I3), file = "res_contrast_I5_vs_I3_padjOrdered_noleaf.csv")
res_contrast_I5_vs_I3 <- results(dds.tr, contrast = c("condition", "I5", "I3"), lfcThreshold = 1, alpha = 0.05)
summary(res_contrast_I5_vs_I3)
>out of 128220 with nonzero total read count
>adjusted p-value < 0.05
>LFC > 1.00 (up)    : 4285, 3.3%
>LFC < -1.00 (down) : 1118, 0.87%
>outliers [1]       : 793, 0.62%
>low counts [2]     : 24737, 19%
>(mean count < 4)
>[1] see 'cooksCutoff' argument of ?results
>[2] see 'independentFiltering' argument of ?results

### I5 as denominator
ddsTxi.tr$condition <- relevel(ddsTxi.tr$condition, "I5")
dds.tr <- DESeq(ddsTxi.tr)
res_contrast_I3_vs_I5 <- results(dds.tr, contrast = c("condition", "I3", "I5")  )
write.csv(as.data.frame(res_contrast_I3_vs_I5), file = "res_contrast_I3_vs_I5_padjOrdered_noleaf.csv")
res_contrast_I3_vs_I5 <- results(dds.tr, contrast = c("condition", "I3", "I5"), lfcThreshold = 1, alpha = 0.05)
summary(res_contrast_I3_vs_I5)
>out of 128220 with nonzero total read count
>adjusted p-value < 0.05
>LFC > 1.00 (up)    : 1169, 0.91%
>LFC < -1.00 (down) : 4229, 3.3%
>outliers [1]       : 793, 0.62%
>low counts [2]     : 24737, 19%
>(mean count < 4)
>[1] see 'cooksCutoff' argument of ?results
>[2] see 'independentFiltering' argument of ?results

As you can see the results are not exactly the same. I was wondering why this happens and searched for relevant posts in the discussion forums but did not find any posts which gave a definite answer. I extracted the gene Ids from both results to compare the overlap and there are DEGs exclusively being reported when one or the other condition is used as denominator (shown in venn diagram).
Overlap of DEG Ids from the results after removing outliers and low counts

I looked at the results tables for these 'exclusive genes' (in the .csv file written before filtering as they do not appear in the results after filtering) and found that their lfc and p-values also change (somtimes drastically) depending on the denominator being used. I expected that flipping reference levels should only change the sign of the lfc and not add/remove some genes entirely. Was my expectation wrong?

Why this difference in the DEGs when switching reference levels? Let me know if I have missed any relevant info that is needed Ps: I have the count data for these 'exclusive genes' and can upload it if necessary.

Best regards,

Suvrat


sessionInfo()
R version 4.3.2 (2023-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19045)

Matrix products: default


locale:
[1] LC_COLLATE=English_India.utf8  LC_CTYPE=English_India.utf8    LC_MONETARY=English_India.utf8 LC_NUMERIC=C                  
[5] LC_TIME=English_India.utf8    

time zone: Europe/Berlin
tzcode source: internal

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] RColorBrewer_1.1-3          pheatmap_1.0.12             ggplot2_3.4.4               dplyr_1.1.4                
 [5] jsonlite_1.8.7              tximport_1.30.0             readr_2.1.4                 DESeq2_1.42.0              
 [9] SummarizedExperiment_1.32.0 Biobase_2.62.0              MatrixGenerics_1.14.0       matrixStats_1.1.0          
[13] GenomicRanges_1.54.1        GenomeInfoDb_1.38.1         IRanges_2.36.0              S4Vectors_0.40.1           
[17] BiocGenerics_0.48.1        

loaded via a namespace (and not attached):
 [1] utf8_1.2.4              generics_0.1.3          SparseArray_1.2.2       bitops_1.0-7            lattice_0.22-5         
 [6] hms_1.1.3               magrittr_2.0.3          grid_4.3.2              Matrix_1.6-3            BiocManager_1.30.22    
[11] fansi_1.0.5             scales_1.3.0            codetools_0.2-19        abind_1.4-5             cli_3.6.1              
[16] rlang_1.1.2             crayon_1.5.2            XVector_0.42.0          munsell_0.5.0           withr_2.5.2            
[21] DelayedArray_0.28.0     S4Arrays_1.2.0          tools_4.3.2             parallel_4.3.2          tzdb_0.4.0             
[26] BiocParallel_1.36.0     colorspace_2.1-0        locfit_1.5-9.8          GenomeInfoDbData_1.2.11 vctrs_0.6.4            
[31] R6_2.5.1                lifecycle_1.0.4         zlibbioc_1.48.0         pkgconfig_2.0.3         pillar_1.9.0           
[36] gtable_0.3.4            glue_1.6.2              Rcpp_1.0.11             tibble_3.2.1            tidyselect_1.2.0       
[41] rstudioapi_0.15.0       compiler_4.3.2          RCurl_1.98-1.13

DESeq2 • 482 views

ADD COMMENT • link updated 4 months ago by Michael Love 42k • written 5 months ago by Suvrat • 0

score 0 · Answer 1 · 2023-12-21

0

Entering edit mode

Michael Love 42k

@mikelove

Last seen 11 hours ago

United States

The reference level can modify the LFC/p-value when estimates are on the boundary of the parameter space, e.g. if the reference level has no counts then the LFC is undefined and LFC and SE head to infinity.

ADD COMMENT • link 4 months ago Michael Love 42k

0

Entering edit mode

Ah, I see. Thank you for the quick reply! The reference counts for these 'exclusive genes' are in fact dominated by zeroes. How would you suggest I continue? I am assuming I should choose one reference group and stick with it for all my comparisons. Also should I filter out these exclusive genes from my final list of DEGs? I am still learning to do this sort of analysis and unsure about the best way to do it.

ADD REPLY • link 4 months ago Suvrat • 0

0

Entering edit mode

Yes I recommend to stick to one reference group for the analysis.

You do not need to filter them. Sometimes users filter genes where all conditions have mostly zeros. But this is optional.

ADD REPLY • link 4 months ago Michael Love 42k