DESeq2 doesn't return the most drastically DEG
Entering edit mode
ndjayne • 0
Last seen 2.6 years ago

I am exploring DEGs using a portion of the TCGA dataset of 151 patients. 7 of which contain a fusion gene. A part of this fusion gene is not normally expressed (in the cancer of interest) and therefore we expect to see it as one of if not the most deferentially expressed gene.

I ran the 7 fusion-gene patients vs the other 144 and was surprised to find that the padj value for this gene was 0.9033839 and Log2FC was actually negative at -0.96897848. Turning independent filtering off in results (independentFiltering=F) and that actually made the p-value worse, approaching 1.

However, when looking at the normalized counts:


dds <- DESeq(dds)
normalized_dds <- counts(dds,normalized=T)
 Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   9022   10792   11562   12638   14963   16375


 Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
   0.000    1.054    2.958   22.377    7.240 1524.108

This gene is clearly expressed many fold higher in the patients but this is not reflected in results:

res <- results(dds, contrast = c("cohort", "fusion", "others"), independentFiltering=T)
X baseMean log2FoldChange   lfcSE       stat    pvalue      padj                                        
1448 ENSfusiongene 21.50914     -0.9689785 1.01709 -0.9526968 0.3407437 0.9033839

I'm not sure if it's being filtered in some way, 7 vs 144 patients affects this or if the gene expression being so low in the non-fusion patients somehow clouds this result.

Session Info:

R version 3.6.1 (2019-07-05)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18362)

Matrix products: default

[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                           LC_TIME=English_United States.1252    

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] forcats_0.4.0               stringr_1.4.0               purrr_0.3.2                 readr_1.3.1                
 [5] tidyr_1.0.0                 tibble_2.1.3                tidyverse_1.2.1           
 [9] AnnotationDbi_1.47.1        plotly_4.9.0                ggrepel_0.8.1               ggplot2_3.2.1              
[13] DESeq2_1.25.13              SummarizedExperiment_1.15.9 DelayedArray_0.11.7         BiocParallel_1.19.1        
[17] matrixStats_0.55.0          Biobase_2.45.1              GenomicRanges_1.37.16       GenomeInfoDb_1.21.2        
[21] IRanges_2.19.16             S4Vectors_0.23.23           dplyr_0.8.3                 BiocFileCache_1.9.1        
[25] dbplyr_1.4.2                BiocGenerics_0.31.6        

loaded via a namespace (and not attached):
 [1] nlme_3.1-141           bitops_1.0-6           lubridate_1.7.4        bit64_0.9-7            RColorBrewer_1.1-2    
 [6] httr_1.4.1             tools_3.6.1            backports_1.1.5        R6_2.4.0               rpart_4.1-15          
[11] Hmisc_4.2-0            DBI_1.0.0              lazyeval_0.2.2         colorspace_1.4-1       nnet_7.3-12           
[16] withr_2.1.2            tidyselect_0.2.5       gridExtra_2.3          bit_1.1-14             curl_4.2              
[21] compiler_3.6.1         cli_1.1.0              rvest_0.3.4            htmlTable_1.13.2       xml2_1.2.2            
[26] scales_1.0.0           checkmate_1.9.4        genefilter_1.67.1      rappdirs_0.3.1         digest_0.6.21         
[31] foreign_0.8-72         XVector_0.25.0         base64enc_0.1-3        pkgconfig_2.0.3        htmltools_0.3.6       
[36] readxl_1.3.1           htmlwidgets_1.5.1      rlang_0.4.0            rstudioapi_0.10        RSQLite_2.1.2         
[41] generics_0.0.2         jsonlite_1.6           acepack_1.4.1          RCurl_1.95-4.12        magrittr_1.5          
[46] GenomeInfoDbData_1.2.2 Formula_1.2-3          Matrix_1.2-17          Rcpp_1.0.2             munsell_0.5.0         
[51] lifecycle_0.1.0        stringi_1.4.3          yaml_2.2.0             zlibbioc_1.31.0        grid_3.6.1            
[56] blob_1.2.0             crayon_1.3.4           lattice_0.20-38        haven_2.1.1            splines_3.6.1         
[61] annotate_1.63.0        hms_0.5.1              locfit_1.5-9.1         zeallot_0.1.0          knitr_1.25            
[66] pillar_1.4.2           geneplotter_1.63.0     XML_3.98-1.20          glue_1.3.1             latticeExtra_0.6-28   
[71] modelr_0.1.5           data.table_1.12.4      BiocManager_1.30.9     vctrs_0.2.0            cellranger_1.1.0      
[76] gtable_0.3.0           assertthat_0.2.1       xfun_0.10              xtable_1.8-4           broom_0.5.2           
[81] survival_2.44-1.1      viridisLite_0.3.0      memoise_1.1.0          cluster_2.1.0
deseq2 tcga • 335 views
Entering edit mode
Last seen 1 day ago
United States

The wrong sign indicates you may have a coding bug somewhere. Make some plots eg plotCounts() for this gene.

Entering edit mode

Thank you, we caught the error! You were correct, the structure of coldata was out of order which was seen in the plotCounts() for the gene where it was clear that only 1 of the 7 patients was in the correct cohort. Running the analysis properly resulted in the expected result of log2FoldChange 9.14 and padj was 5.5e-17

We weren't sure if we had to make the IDs in colData into rownames but did so anyway just to be safe.

Entering edit mode

Good to hear. Thanks for the follow up.


Login before adding your answer.

Traffic: 522 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6