Hello,
I’m getting some curious results from DEXSeq and am trying to get some perspective / advice. I am conducting an analysis of age-differences in female and male mice. I have samples from mice of two ages and both sexes (two factors, two levels, n = 5). For the purposes of this post, I am interested in identifying deferentially used exons as a function of age in each sex separately. I conducted two separate DEXSeq analysis (one on male samples, another for female samples) in which only files from cDNA libraries derived from one sex or the other were analyzed with DEXseq. In summary, my findings are that there is a large number (3607) of deferentially used exons in female mice of different ages, and a substantially smaller number (1) for DUE in males in the analogous comparison. For reasons described below, I am suspicious of these results.
I initially started my DEXSeq analysis with the comparison of male samples, I proceeded according to the DEXseq vignette. I’ve included details for some of the analysis objects to help others determine if I’ve structured the analysis correctly. My full script contains several conditional statements that make importing particular sets of files easier. Therefore, I have abbreviated the output. I can provide more complete output if need be:
> BPPARAM class: SnowParam bpjobname:BPJOB; bpworkers:10; bptasks:0; bptimeout:Inf; bpRNGseed:; bpisup:FALSE bplog:FALSE; bpthreshold:INFO; bplogdir:NA bpstopOnError:FALSE; bpprogressbar:TRUE bpresultdir:NA cluster type: SOCK > countFiles [1] "input/wt_2m_m/1050Q_tu_sort.txt" "input/wt_2m_m/1052O_tu_sort.txt" "input/wt_2m_m/1053K_tu_sort.txt" [4] "input/wt_2m_m/1053L_tu_sort.txt" "input/wt_2m_m/1053M_tu_sort.txt" "input/wt_4m_m/1048H_tu_sort.txt" [7] "input/wt_4m_m/1048I_tu_sort.txt" "input/wt_4m_m/1048K_tu_sort.txt" "input/wt_4m_m/1056G_tu_sort.txt" [10] "input/wt_4m_m/1056H_tu_sort.txt" > flattenedFile [1] "C:/DEXSeq/input/Mmus.GRCm38.68.gff" > sampleTable condition sex 1050Q 2m m 1052O 2m m 1053K 2m m 1053L 2m m 1053M 2m m 1048H 4m m 1048I 4m m 1048K 4m m 1056G 4m m 1056H 4m m > dxd = DEXSeqDataSetFromHTSeq( + countFiles, + sampleData=sampleTable, + design = ~sample+exon+condition:exon, + flattenedfile = flattenedFile) converting counts to integer mode > colData(dxd) DataFrame with 20 rows and 4 columns sample condition sex exon <factor> <factor> <factor> <factor> 1 1050Q 2m m this 2 1052O 2m m this 3 1053K 2m m this 4 1053L 2m m this 5 1053M 2m m this ... ... ... ... ... 16 1048H 4m m others 17 1048I 4m m others 18 1048K 4m m others 19 1056G 4m m others 20 1056H 4m m others > head(counts(dxd),5) [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14] [,15] [,16] [,17] [,18] ENSMUSG00000000001:E001 729 525 410 456 515 546 420 255 281 416 677 623 381 444 431 511 397 239 ENSMUSG00000000001:E002 106 99 75 55 85 96 49 28 58 62 1300 1049 716 845 861 961 768 466 ENSMUSG00000000001:E003 90 96 72 59 76 95 58 33 47 55 1316 1052 719 841 870 962 759 461 ENSMUSG00000000001:E004 90 69 54 51 59 72 46 34 44 48 1316 1079 737 849 887 985 771 460 ENSMUSG00000000001:E005 103 81 33 55 55 61 58 39 44 39 1303 1067 758 845 891 996 759 455 [,19] [,20] ENSMUSG00000000001:E001 305 337 ENSMUSG00000000001:E002 528 691 ENSMUSG00000000001:E003 539 698 ENSMUSG00000000001:E004 542 705 ENSMUSG00000000001:E005 542 714 > dxd = estimateSizeFactors(dxd) > dxd = estimateDispersions(dxd, BPPARAM = BPPARAM) > dxd = testForDEU(dxd) > dxr1_male = DEXSeqResults( dxd) > table ( dxr1_male$padj < 0.05 )
FALSE TRUE 283784 1
As I interpret it, this analysis indicates that there is a single deferentially used exon at p.adj < 0.05. Also, there are 283784 non-NA padj values (significance statistics that survived independent filtering). I conducted the same analysis on female samples of 2 and 4 months of age using the same R script but changing only the name of the DEXSeqResults object, and the CountFiles and SampleTable objects such that the appropriate .txt files were read into the R environment and referenced appropriately. The output from the last step of that analysis is below:
> dxr1_female = DEXSeqResults( dxd) > table ( dxr1_female$padj < 0.05 )
FALSE TRUE 52311 3607
Assuming I've implemented DEXSeq appropriately, I might interpret these results to mean that there is substantial differential exon usage in females from 2 to 4 months of age. The same finding was not recovered in males, suggesting that perhaps females and males are different in this regard. However, as I interpret these results, it seems that independent filtering in the comparison of female samples was considerably more aggressive than independent filtering in the male comparison (the sum of the FALSE and TRUE values for each comparison are quite different). This is confirmed by checking the full .csv output files which show a very large number (295,113) of "NA"s in the padj column for the female results object.
Questions:
1. Is my implementation of DEXSeq flawed or inappropriate?
2. Is it valid to compare the results of two comparisons that underwent separate rounds independent filtering for p-value FDR adjustment?
3. If #2 is not valid, is it advisable to simply adjust the unfiltered p-values myself with p.adjust() such that a more comparable set of p-values are corrected in each comparison?
I understand that an alternative is simply to input all data into DEXSeq and formally test for a sex by time interaction to discern if there is a sex-difference in exon usage over time. I am simply trying to get a sense of what comparisons are appropriate, and whether I've structured this analysis correctly to begin with.
Thanks,
Joseph Bundy
Graduate Student
Florida State University College of Medicine
> sessionInfo() R version 3.2.2 (2015-08-14) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 7 x64 (build 7601) Service Pack 1
locale: [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252 [4] LC_NUMERIC=C LC_TIME=English_United States.1252
attached base packages: [1] stats4 parallel stats graphics grDevices utils datasets methods base
other attached packages: [1] DEXSeq_1.16.1 DESeq2_1.10.0 RcppArmadillo_0.5.400.2.0 Rcpp_0.12.1 [5] SummarizedExperiment_1.0.1 GenomicRanges_1.22.1 GenomeInfoDb_1.6.1 IRanges_2.4.1 [9] S4Vectors_0.8.2 Biobase_2.30.0 BiocGenerics_0.16.1 BiocParallel_1.4.0 [13] RevoUtilsMath_3.2.2
loaded via a namespace (and not attached): [1] genefilter_1.52.0 statmod_1.4.21 locfit_1.5-9.1 reshape2_1.4.1 splines_3.2.2 [6] lattice_0.20-33 colorspace_1.2-6 snow_0.3-13 survival_2.38-3 XML_3.98-1.3 [11] foreign_0.8-66 DBI_0.3.1 RColorBrewer_1.1-2 lambda.r_1.1.7 plyr_1.8.3 [16] stringr_1.0.0 zlibbioc_1.16.0 Biostrings_2.38.0 munsell_0.4.2 gtable_0.1.2 [21] futile.logger_1.4.1 hwriter_1.3.2 latticeExtra_0.6-26 geneplotter_1.48.0 biomaRt_2.26.0 [26] AnnotationDbi_1.32.0 proto_0.3-10 acepack_1.3-3.3 xtable_1.7-4 scales_0.3.0 [31] Hmisc_3.16-0 annotate_1.48.0 XVector_0.10.0 Rsamtools_1.22.0 gridExtra_2.0.0 [36] ggplot2_1.0.1 digest_0.6.8 stringi_0.5-5 grid_3.2.2 bitops_1.0-6 [41] tools_3.2.2 magrittr_1.5 RCurl_1.95-4.7 RSQLite_1.0.0 Formula_1.2-1 [46] cluster_2.0.3 futile.options_1.0.0 MASS_7.3-43 rpart_4.1-10 nnet_7.3-10
Thanks for your help Alejandro! You understood my objective perfectly. Thanks for taking the time to read my very long post.
I will likely take you up on your suggestion to test the interaction.
-Joseph