Hi, this is mainly aimed for DEXSeq package developers, but any help will be appreciated.
I ran standard DEXSeq pipeline on my data (RNAseq, 2x125bp reads,ctrl vs treatment,3 replicates each) according to DEXSeq vignette. All steps finished fine, but somehow I end up with NA values in most of the padj column.
my commands:
python scripts:
python /homes2/marek/R/x86_64-pc-linux-gnu-library/3.1/DEXSeq/python_scripts/dexseq_prepare_annotation.py ~/annotation/Ensembl_GRCh37.75.gtf Ensembl_GRCh37.75_dexseq.gtf
ls *.bam | while read line; do python /homes2/marek/R/x86_64-pc-linux-gnu-library/3.1/DEXSeq/python_scripts/dexseq_count.py -p yes -f bam -r name Ensembl_GRCh37.75_dexseq.gtf $line ${line/.bam/.htseq} & done
and R:
library("DEXSeq")
sampleTable <- data.frame(row.names = c("WT_1","WT_2","WT_3","KO_1","KO_2","KO_3"), condition = c("control","control","control","knock-out","knock-out","knock-out"),libType=c("paired-end","paired-end","paired-end","paired-end","paired-end","paired-end"))
flattenedFile <- list.files(pattern="Ensembl_GRCh37.75_dexseq.gtf")
dxd <- DEXSeqDataSetFromHTSeq(countFiles,sampleData=sampleTable,design= ~ sample + exon + condition:exon,flattenedfile=flattenedFile)
dxd <- estimateSizeFactors(dxd)
dxd <- estimateDispersions(dxd)
dxd <- testForDEU(dxd)
dxr1 <- DEXSeqResults(dxd)
head(dxr1)
DataFrame with 6 rows and 10 columns
                             groupID   featureID exonBaseMean  dispersion
                         <character> <character>    <numeric>   <numeric>
ENSG00000000003:E001 ENSG00000000003        E001     9.333333 0.002639292
ENSG00000000003:E002 ENSG00000000003        E002     1.166667 0.062487260
ENSG00000000003:E003 ENSG00000000003        E003     1.000000 0.023675620
ENSG00000000003:E004 ENSG00000000003        E004     1.500000 0.013644089
ENSG00000000003:E005 ENSG00000000003        E005     2.500000 0.005624283
ENSG00000000003:E006 ENSG00000000003        E006     3.166667 0.004502899
                             stat    pvalue      padj
                        <numeric> <numeric> <numeric>
ENSG00000000003:E001 0.2084887235 0.6479545 0.9999723
ENSG00000000003:E002 0.0003735887 0.9845791        NA
ENSG00000000003:E003 0.0901488852 0.7639880        NA
ENSG00000000003:E004 0.0012700347 0.9715714        NA
ENSG00000000003:E005 0.7070434297 0.4004271        NA
ENSG00000000003:E006 0.0240073599 0.8768662        NA
                                     genomicData countData transcripts
                                       <GRanges> <integer>      <list>
ENSG00000000003:E001 chrX:-:[99883667, 99884983]        15    ########
ENSG00000000003:E002 chrX:-:[99885756, 99885863]         2    ########
ENSG00000000003:E003 chrX:-:[99887482, 99887537]         0    ########
ENSG00000000003:E004 chrX:-:[99887538, 99887565]         1    ########
ENSG00000000003:E005 chrX:-:[99888402, 99888438]         3    ########
ENSG00000000003:E006 chrX:-:[99888439, 99888536]         2    ########
I get NA in 612398/644354 rows
Is it possible NA is reported, because p-values are too large? Should not be padj close to 1 reported in that case?
I came up with this workaround:
dxr1$padj<-p.adjust(dxr1$pvalue,method="BH")
Is it consistent with DEXSeq way to adjust the p-value (I know it is Benjamini-Hochberg, but there might be some trick in the procedure)?
Thank you for any help
sessionInfo()
R version 3.1.2 (2014-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets
[8] methods   base     
other attached packages:
 [1] DEXSeq_1.10.8           BiocParallel_1.0.0      DESeq2_1.6.3           
 [4] RcppArmadillo_0.4.600.0 Rcpp_0.11.3             GenomicRanges_1.18.4   
 [7] GenomeInfoDb_1.2.4      IRanges_2.0.1           S4Vectors_0.4.0        
[10] Biobase_2.26.0          BiocGenerics_0.12.1    
loaded via a namespace (and not attached):
 [1] acepack_1.3-3.3      annotate_1.44.0      AnnotationDbi_1.28.1
 [4] base64enc_0.1-2      BatchJobs_1.5        BBmisc_1.8          
 [7] biomaRt_2.22.0       Biostrings_2.34.1    bitops_1.0-6        
[10] brew_1.0-6           checkmate_1.5.1      cluster_1.15.3      
[13] codetools_0.2-9      colorspace_1.2-4     DBI_0.3.1           
[16] digest_0.6.8         fail_1.2             foreach_1.4.2       
[19] foreign_0.8-62       Formula_1.1-2        genefilter_1.48.1   
[22] geneplotter_1.44.0   ggplot2_1.0.0        grid_3.1.2          
[25] gtable_0.1.2         Hmisc_3.14-6         hwriter_1.3.2       
[28] iterators_1.0.7      lattice_0.20-29      latticeExtra_0.6-26
[31] locfit_1.5-9.1       MASS_7.3-37          munsell_0.4.2       
[34] nnet_7.3-8           plyr_1.8.1           proto_0.3-10        
[37] RColorBrewer_1.1-2   RCurl_1.95-4.5       reshape2_1.4.1      
[40] rpart_4.1-8          Rsamtools_1.18.2     RSQLite_1.0.0       
[43] scales_0.2.4         sendmailR_1.2-1      splines_3.1.2       
[46] statmod_1.4.20       stringr_0.6.2        survival_2.37-7     
[49] tools_3.1.2          XML_3.98-1.1         xtable_1.7-4        
[52] XVector_0.6.0        zlibbioc_1.12.0     

I have the same feeling as Wolfgang, also the MA plot would help to see if this is the problem.
Alejandro
I can see the problem now. The counts are very low indeed. Even in the counts file produced by python script, I have only cca 5M reads (and 4M fall in no feature). I will have to look into this, because i have about 120M uniquely mapped reads (60M pairs) in bam file.
I am currently re-running the python script and saving the log, I will let know as soon as it is finished.
FYI, histogram of mean:
MA plot: