Question

DESeq2 doesn't return the most drastically DEG

0

Entering edit mode

ndjayne • 0

@ndjayne-22255

Last seen 4.5 years ago

I am exploring DEGs using a portion of the TCGA dataset of 151 patients. 7 of which contain a fusion gene. A part of this fusion gene is not normally expressed (in the cancer of interest) and therefore we expect to see it as one of if not the most deferentially expressed gene.

I ran the 7 fusion-gene patients vs the other 144 and was surprised to find that the padj value for this gene was 0.9033839 and Log2FC was actually negative at -0.96897848. Turning independent filtering off in results (independentFiltering=F) and that actually made the p-value worse, approaching 1.

However, when looking at the normalized counts:

FUSION PATIENTS:

dds <- DESeq(dds)
normalized_dds <- counts(dds,normalized=T)
summary(normalized_dds["ENSgene_of_interest",Seven_fusion_patients_indexes])
 Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   9022   10792   11562   12638   14963   16375

NON-FUSION PATIENTS

summary(normalized_dds["ENSgene_of_interest",Other_patients_indexes])
 Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
   0.000    1.054    2.958   22.377    7.240 1524.108

This gene is clearly expressed many fold higher in the patients but this is not reflected in results:

res <- results(dds, contrast = c("cohort", "fusion", "others"), independentFiltering=T)
X baseMean log2FoldChange   lfcSE       stat    pvalue      padj                                        
1448 ENSfusiongene 21.50914     -0.9689785 1.01709 -0.9526968 0.3407437 0.9033839

I'm not sure if it's being filtered in some way, 7 vs 144 patients affects this or if the gene expression being so low in the non-fusion patients somehow clouds this result.

Session Info:

sessionInfo()
R version 3.6.1 (2019-07-05)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18362)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                           LC_TIME=English_United States.1252    

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] forcats_0.4.0               stringr_1.4.0               purrr_0.3.2                 readr_1.3.1                
 [5] tidyr_1.0.0                 tibble_2.1.3                tidyverse_1.2.1             org.Hs.eg.db_3.10.0        
 [9] AnnotationDbi_1.47.1        plotly_4.9.0                ggrepel_0.8.1               ggplot2_3.2.1              
[13] DESeq2_1.25.13              SummarizedExperiment_1.15.9 DelayedArray_0.11.7         BiocParallel_1.19.1        
[17] matrixStats_0.55.0          Biobase_2.45.1              GenomicRanges_1.37.16       GenomeInfoDb_1.21.2        
[21] IRanges_2.19.16             S4Vectors_0.23.23           dplyr_0.8.3                 BiocFileCache_1.9.1        
[25] dbplyr_1.4.2                BiocGenerics_0.31.6        

loaded via a namespace (and not attached):
 [1] nlme_3.1-141           bitops_1.0-6           lubridate_1.7.4        bit64_0.9-7            RColorBrewer_1.1-2    
 [6] httr_1.4.1             tools_3.6.1            backports_1.1.5        R6_2.4.0               rpart_4.1-15          
[11] Hmisc_4.2-0            DBI_1.0.0              lazyeval_0.2.2         colorspace_1.4-1       nnet_7.3-12           
[16] withr_2.1.2            tidyselect_0.2.5       gridExtra_2.3          bit_1.1-14             curl_4.2              
[21] compiler_3.6.1         cli_1.1.0              rvest_0.3.4            htmlTable_1.13.2       xml2_1.2.2            
[26] scales_1.0.0           checkmate_1.9.4        genefilter_1.67.1      rappdirs_0.3.1         digest_0.6.21         
[31] foreign_0.8-72         XVector_0.25.0         base64enc_0.1-3        pkgconfig_2.0.3        htmltools_0.3.6       
[36] readxl_1.3.1           htmlwidgets_1.5.1      rlang_0.4.0            rstudioapi_0.10        RSQLite_2.1.2         
[41] generics_0.0.2         jsonlite_1.6           acepack_1.4.1          RCurl_1.95-4.12        magrittr_1.5          
[46] GenomeInfoDbData_1.2.2 Formula_1.2-3          Matrix_1.2-17          Rcpp_1.0.2             munsell_0.5.0         
[51] lifecycle_0.1.0        stringi_1.4.3          yaml_2.2.0             zlibbioc_1.31.0        grid_3.6.1            
[56] blob_1.2.0             crayon_1.3.4           lattice_0.20-38        haven_2.1.1            splines_3.6.1         
[61] annotate_1.63.0        hms_0.5.1              locfit_1.5-9.1         zeallot_0.1.0          knitr_1.25            
[66] pillar_1.4.2           geneplotter_1.63.0     XML_3.98-1.20          glue_1.3.1             latticeExtra_0.6-28   
[71] modelr_0.1.5           data.table_1.12.4      BiocManager_1.30.9     vctrs_0.2.0            cellranger_1.1.0      
[76] gtable_0.3.0           assertthat_0.2.1       xfun_0.10              xtable_1.8-4           broom_0.5.2           
[81] survival_2.44-1.1      viridisLite_0.3.0      memoise_1.1.0          cluster_2.1.0

deseq2 tcga • 792 views

ADD COMMENT • link 4.5 years ago ndjayne • 0

score 2 · Accepted Answer · 2019-10-31

2

Entering edit mode

Michael Love 41k

@mikelove

Last seen 1 day ago

United States

The wrong sign indicates you may have a coding bug somewhere. Make some plots eg plotCounts() for this gene.

ADD COMMENT • link 4.5 years ago Michael Love 41k

0

Entering edit mode

Thank you, we caught the error! You were correct, the structure of coldata was out of order which was seen in the plotCounts() for the gene where it was clear that only 1 of the 7 patients was in the correct cohort. Running the analysis properly resulted in the expected result of log2FoldChange 9.14 and padj was 5.5e-17

We weren't sure if we had to make the IDs in colData into rownames but did so anyway just to be safe.