Search
Question: DEXSeq and Independent filtering
0
2.9 years ago by
Joseph Bundy20
United States
Joseph Bundy20 wrote:

Hello,

I’m getting some curious results from DEXSeq and am trying to get some perspective / advice. I am conducting an analysis of age-differences in female and male mice.  I have samples from mice of two ages and both sexes (two factors, two levels, n = 5). For the purposes of this post, I am interested in identifying deferentially used exons as a function of age in each sex separately. I conducted two separate DEXSeq analysis (one on male samples, another for female samples) in which only files from cDNA libraries derived from one sex or the other were analyzed with DEXseq. In summary, my findings are that there is a large number (3607) of deferentially used exons in female mice of different ages, and a substantially smaller number (1) for DUE in males in the analogous comparison. For reasons described below, I am suspicious of these results.

I initially started my DEXSeq analysis with the comparison of male samples, I proceeded according to the DEXseq vignette. I’ve included details for some of the analysis objects to help others determine if I’ve structured the analysis correctly. My full script contains several conditional statements that make importing particular sets of files easier. Therefore, I have abbreviated the output. I can provide more complete output if need be:

> BPPARAM
class: SnowParam
bpjobname:BPJOB; bpworkers:10; bptasks:0; bptimeout:Inf; bpRNGseed:; bpisup:FALSE
bplog:FALSE; bpthreshold:INFO; bplogdir:NA
bpstopOnError:FALSE; bpprogressbar:TRUE
bpresultdir:NA
cluster type: SOCK
> countFiles
[1] "input/wt_2m_m/1050Q_tu_sort.txt" "input/wt_2m_m/1052O_tu_sort.txt" "input/wt_2m_m/1053K_tu_sort.txt"
[4] "input/wt_2m_m/1053L_tu_sort.txt" "input/wt_2m_m/1053M_tu_sort.txt" "input/wt_4m_m/1048H_tu_sort.txt"
[7] "input/wt_4m_m/1048I_tu_sort.txt" "input/wt_4m_m/1048K_tu_sort.txt" "input/wt_4m_m/1056G_tu_sort.txt"
[10] "input/wt_4m_m/1056H_tu_sort.txt"
> flattenedFile
[1] "C:/DEXSeq/input/Mmus.GRCm38.68.gff"
> sampleTable
condition sex
1050Q        2m   m
1052O        2m   m
1053K        2m   m
1053L        2m   m
1053M        2m   m
1048H        4m   m
1048I        4m   m
1048K        4m   m
1056G        4m   m
1056H        4m   m
> dxd = DEXSeqDataSetFromHTSeq(
+   countFiles,
+   sampleData=sampleTable,
+   design = ~sample+exon+condition:exon,
+   flattenedfile = flattenedFile)
converting counts to integer mode
> colData(dxd)
DataFrame with 20 rows and 4 columns
sample condition      sex     exon
<factor>  <factor> <factor> <factor>
1      1050Q        2m        m     this
2      1052O        2m        m     this
3      1053K        2m        m     this
4      1053L        2m        m     this
5      1053M        2m        m     this
...      ...       ...      ...      ...
16     1048H        4m        m   others
17     1048I        4m        m   others
18     1048K        4m        m   others
19     1056G        4m        m   others
20     1056H        4m        m   others
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14] [,15] [,16] [,17] [,18]
ENSMUSG00000000001:E001  729  525  410  456  515  546  420  255  281   416   677   623   381   444   431   511   397   239
ENSMUSG00000000001:E002  106   99   75   55   85   96   49   28   58    62  1300  1049   716   845   861   961   768   466
ENSMUSG00000000001:E003   90   96   72   59   76   95   58   33   47    55  1316  1052   719   841   870   962   759   461
ENSMUSG00000000001:E004   90   69   54   51   59   72   46   34   44    48  1316  1079   737   849   887   985   771   460
ENSMUSG00000000001:E005  103   81   33   55   55   61   58   39   44    39  1303  1067   758   845   891   996   759   455
[,19] [,20]
ENSMUSG00000000001:E001   305   337
ENSMUSG00000000001:E002   528   691
ENSMUSG00000000001:E003   539   698
ENSMUSG00000000001:E004   542   705
ENSMUSG00000000001:E005   542   714
> dxd = estimateSizeFactors(dxd)
> dxd = estimateDispersions(dxd, BPPARAM = BPPARAM)
> dxd = testForDEU(dxd)
> dxr1_male = DEXSeqResults( dxd)
> table ( dxr1_male$padj < 0.05 )  FALSE TRUE 283784 1  As I interpret it, this analysis indicates that there is a single deferentially used exon at p.adj < 0.05. Also, there are 283784 non-NA padj values (significance statistics that survived independent filtering). I conducted the same analysis on female samples of 2 and 4 months of age using the same R script but changing only the name of the DEXSeqResults object, and the CountFiles and SampleTable objects such that the appropriate .txt files were read into the R environment and referenced appropriately. The output from the last step of that analysis is below: > dxr1_female = DEXSeqResults( dxd) > table ( dxr1_female$padj < 0.05 )

FALSE  TRUE
52311  3607 

Assuming I've implemented DEXSeq appropriately, I might interpret these results to mean that there is substantial differential exon usage in females from 2 to 4 months of age. The same finding was not recovered in males, suggesting that perhaps females and males are different in this regard.  However, as I interpret these results, it seems that independent filtering in the comparison of female samples was considerably more aggressive than independent filtering in the male comparison (the sum of the FALSE and TRUE values for each comparison are quite different). This is confirmed by checking the full .csv output files which show a very large number (295,113) of "NA"s in the padj column for the female results object.

Questions:

1. Is my implementation of DEXSeq flawed or inappropriate?
2. Is it valid to compare the results of two comparisons that underwent separate rounds independent filtering for p-value FDR adjustment?
3. If #2 is not valid, is it advisable to simply adjust the unfiltered p-values myself with p.adjust() such that a more comparable set of p-values are corrected in each comparison?

I understand that an alternative is simply to input all data into DEXSeq and formally test for a sex by time interaction to discern if there is a sex-difference in exon usage over time. I am simply trying to get a sense of what comparisons are appropriate, and whether I've structured this analysis correctly to begin with.

Thanks,

Joseph Bundy
Florida State University College of Medicine

> sessionInfo()
R version 3.2.2 (2015-08-14)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                           LC_TIME=English_United States.1252    

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[5] SummarizedExperiment_1.0.1 GenomicRanges_1.22.1       GenomeInfoDb_1.6.1         IRanges_2.4.1
[9] S4Vectors_0.8.2            Biobase_2.30.0             BiocGenerics_0.16.1        BiocParallel_1.4.0
[13] RevoUtilsMath_3.2.2       

loaded via a namespace (and not attached):
[1] genefilter_1.52.0    statmod_1.4.21       locfit_1.5-9.1       reshape2_1.4.1       splines_3.2.2
[6] lattice_0.20-33      colorspace_1.2-6     snow_0.3-13          survival_2.38-3      XML_3.98-1.3
[11] foreign_0.8-66       DBI_0.3.1            RColorBrewer_1.1-2   lambda.r_1.1.7       plyr_1.8.3
[16] stringr_1.0.0        zlibbioc_1.16.0      Biostrings_2.38.0    munsell_0.4.2        gtable_0.1.2
[21] futile.logger_1.4.1  hwriter_1.3.2        latticeExtra_0.6-26  geneplotter_1.48.0   biomaRt_2.26.0
[26] AnnotationDbi_1.32.0 proto_0.3-10         acepack_1.3-3.3      xtable_1.7-4         scales_0.3.0
[31] Hmisc_3.16-0         annotate_1.48.0      XVector_0.10.0       Rsamtools_1.22.0     gridExtra_2.0.0
[36] ggplot2_1.0.1        digest_0.6.8         stringi_0.5-5        grid_3.2.2           bitops_1.0-6
[41] tools_3.2.2          magrittr_1.5         RCurl_1.95-4.7       RSQLite_1.0.0        Formula_1.2-1
[46] cluster_2.0.3        futile.options_1.0.0 MASS_7.3-43          rpart_4.1-10         nnet_7.3-10  

modified 2.9 years ago by Wolfgang Huber13k • written 2.9 years ago by Joseph Bundy20
3
2.9 years ago by
Alejandro Reyes1.6k
Dana-Farber Cancer Institute, Boston, USA
Alejandro Reyes1.6k wrote:

Hi josephlbundy,

1. Is my implementation of DEXSeq flawed or inappropriate?

2. Is it valid to compare the results of two comparisons that underwent separate rounds independent filtering for p-value FDR adjustment?
3. If #2 is not valid, is it advisable to simply adjust the unfiltered p-values myself with p.adjust() such that a more comparable set of p-values are corrected in each comparison?

I think you are right that two different independent filtering rounds would give you different number of tests for each round, and the number of significant genes might not be comparable. You could try to see if removing the independent filtering step improves your results, but this step is precisely done to increase the number of rejections, so this might not be so good for your male comparisons. Comparing the number of DEU genes as you are doing could be confounded by strange effects, for example, if you have different power to detect exons in within the different sexes (e.g. differences on sequencing depth, etc) or if the variance across replicates is different within different sexes...

If I got it right, your final objective is to detect differences in exon usage due to the age, that are different between males and females? If this is the case what I would do is to build a DEXSeqDataSet that contains both males and females, and test directly for this hypothesis by using the following models:

fullModel <-  ~sample + exon + sex:exon + age:exon + age:sex:exon

reducedModel <- ~sample + exon + sex:exon + age:exon

Alejandro

Thanks for your help Alejandro!  You understood my objective perfectly. Thanks for taking the time to read my very long post.

I will likely take you up on your suggestion to test the interaction.

-Joseph

0
2.9 years ago by
EMBL European Molecular Biology Laboratory
Wolfgang Huber13k wrote:

Dear Joseph

before homing in too quickly on p-values, it might be good to do exploratory data analyses (PCAs, heatmaps, DEXSeq plots for important 'control' genes) on the fitted coefficients and residuals to see what is going on in the data. Not least for quality control (e.g. what if among the male samples there was a poor library? or a sample swap? ... etc.)

Wolfgang

Thanks for the advice Dr. Huber,

Adhering closely to the DESeq2 vignette (which has been very, very helpful) and the bioconductor RNA-seq workflow page, I've performed most of the exploratory analyses I can think of.  There is considerable variation in sequencing depth across my full experiment (60 libraries in total), but this doesn't vary systematically across treatment groups. The level of independent filtering was also very similar among different comparisons with DESeq2. This is not true for my comparisons with DEXSeq, hence my suspicion.

Switching female and male samples was a mistake of which I was admittedly terrified during sample preparation.  However, I've checked the normalized counts for Y-linked genes and genes that are known to escape X-inactivation and confirmed that I managed to not mix up female and male samples.

Joseph