I am exploring DEGs using a portion of the TCGA dataset of 151 patients. 7 of which contain a fusion gene. A part of this fusion gene is not normally expressed (in the cancer of interest) and therefore we expect to see it as one of if not the most deferentially expressed gene.
I ran the 7 fusion-gene patients vs the other 144 and was surprised to find that the padj value for this gene was 0.9033839 and Log2FC was actually negative at -0.96897848. Turning independent filtering off in results (independentFiltering=F) and that actually made the p-value worse, approaching 1.
However, when looking at the normalized counts:
FUSION PATIENTS:
dds <- DESeq(dds)
normalized_dds <- counts(dds,normalized=T)
summary(normalized_dds["ENSgene_of_interest",Seven_fusion_patients_indexes])
Min. 1st Qu. Median Mean 3rd Qu. Max.
9022 10792 11562 12638 14963 16375
NON-FUSION PATIENTS
summary(normalized_dds["ENSgene_of_interest",Other_patients_indexes])
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.000 1.054 2.958 22.377 7.240 1524.108
This gene is clearly expressed many fold higher in the patients but this is not reflected in results:
res <- results(dds, contrast = c("cohort", "fusion", "others"), independentFiltering=T)
X baseMean log2FoldChange lfcSE stat pvalue padj
1448 ENSfusiongene 21.50914 -0.9689785 1.01709 -0.9526968 0.3407437 0.9033839
I'm not sure if it's being filtered in some way, 7 vs 144 patients affects this or if the gene expression being so low in the non-fusion patients somehow clouds this result.
Session Info:
sessionInfo()
R version 3.6.1 (2019-07-05)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18362)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C LC_TIME=English_United States.1252
attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets methods base
other attached packages:
[1] forcats_0.4.0 stringr_1.4.0 purrr_0.3.2 readr_1.3.1
[5] tidyr_1.0.0 tibble_2.1.3 tidyverse_1.2.1 org.Hs.eg.db_3.10.0
[9] AnnotationDbi_1.47.1 plotly_4.9.0 ggrepel_0.8.1 ggplot2_3.2.1
[13] DESeq2_1.25.13 SummarizedExperiment_1.15.9 DelayedArray_0.11.7 BiocParallel_1.19.1
[17] matrixStats_0.55.0 Biobase_2.45.1 GenomicRanges_1.37.16 GenomeInfoDb_1.21.2
[21] IRanges_2.19.16 S4Vectors_0.23.23 dplyr_0.8.3 BiocFileCache_1.9.1
[25] dbplyr_1.4.2 BiocGenerics_0.31.6
loaded via a namespace (and not attached):
[1] nlme_3.1-141 bitops_1.0-6 lubridate_1.7.4 bit64_0.9-7 RColorBrewer_1.1-2
[6] httr_1.4.1 tools_3.6.1 backports_1.1.5 R6_2.4.0 rpart_4.1-15
[11] Hmisc_4.2-0 DBI_1.0.0 lazyeval_0.2.2 colorspace_1.4-1 nnet_7.3-12
[16] withr_2.1.2 tidyselect_0.2.5 gridExtra_2.3 bit_1.1-14 curl_4.2
[21] compiler_3.6.1 cli_1.1.0 rvest_0.3.4 htmlTable_1.13.2 xml2_1.2.2
[26] scales_1.0.0 checkmate_1.9.4 genefilter_1.67.1 rappdirs_0.3.1 digest_0.6.21
[31] foreign_0.8-72 XVector_0.25.0 base64enc_0.1-3 pkgconfig_2.0.3 htmltools_0.3.6
[36] readxl_1.3.1 htmlwidgets_1.5.1 rlang_0.4.0 rstudioapi_0.10 RSQLite_2.1.2
[41] generics_0.0.2 jsonlite_1.6 acepack_1.4.1 RCurl_1.95-4.12 magrittr_1.5
[46] GenomeInfoDbData_1.2.2 Formula_1.2-3 Matrix_1.2-17 Rcpp_1.0.2 munsell_0.5.0
[51] lifecycle_0.1.0 stringi_1.4.3 yaml_2.2.0 zlibbioc_1.31.0 grid_3.6.1
[56] blob_1.2.0 crayon_1.3.4 lattice_0.20-38 haven_2.1.1 splines_3.6.1
[61] annotate_1.63.0 hms_0.5.1 locfit_1.5-9.1 zeallot_0.1.0 knitr_1.25
[66] pillar_1.4.2 geneplotter_1.63.0 XML_3.98-1.20 glue_1.3.1 latticeExtra_0.6-28
[71] modelr_0.1.5 data.table_1.12.4 BiocManager_1.30.9 vctrs_0.2.0 cellranger_1.1.0
[76] gtable_0.3.0 assertthat_0.2.1 xfun_0.10 xtable_1.8-4 broom_0.5.2
[81] survival_2.44-1.1 viridisLite_0.3.0 memoise_1.1.0 cluster_2.1.0
Thank you, we caught the error! You were correct, the structure of coldata was out of order which was seen in the plotCounts() for the gene where it was clear that only 1 of the 7 patients was in the correct cohort. Running the analysis properly resulted in the expected result of log2FoldChange 9.14 and padj was 5.5e-17
We weren't sure if we had to make the IDs in colData into rownames but did so anyway just to be safe.
Good to hear. Thanks for the follow up.