Hi, this is mainly aimed for DEXSeq package developers, but any help will be appreciated.
I ran standard DEXSeq pipeline on my data (RNAseq, 2x125bp reads,ctrl vs treatment,3 replicates each) according to DEXSeq vignette. All steps finished fine, but somehow I end up with NA values in most of the padj column.
my commands:
python scripts:
python /homes2/marek/R/x86_64-pc-linux-gnu-library/3.1/DEXSeq/python_scripts/dexseq_prepare_annotation.py ~/annotation/Ensembl_GRCh37.75.gtf Ensembl_GRCh37.75_dexseq.gtf ls *.bam | while read line; do python /homes2/marek/R/x86_64-pc-linux-gnu-library/3.1/DEXSeq/python_scripts/dexseq_count.py -p yes -f bam -r name Ensembl_GRCh37.75_dexseq.gtf $line ${line/.bam/.htseq} & done
and R:
library("DEXSeq") sampleTable <- data.frame(row.names = c("WT_1","WT_2","WT_3","KO_1","KO_2","KO_3"), condition = c("control","control","control","knock-out","knock-out","knock-out"),libType=c("paired-end","paired-end","paired-end","paired-end","paired-end","paired-end")) flattenedFile <- list.files(pattern="Ensembl_GRCh37.75_dexseq.gtf") dxd <- DEXSeqDataSetFromHTSeq(countFiles,sampleData=sampleTable,design= ~ sample + exon + condition:exon,flattenedfile=flattenedFile) dxd <- estimateSizeFactors(dxd) dxd <- estimateDispersions(dxd) dxd <- testForDEU(dxd) dxr1 <- DEXSeqResults(dxd)
head(dxr1)
DataFrame with 6 rows and 10 columns
groupID featureID exonBaseMean dispersion
<character> <character> <numeric> <numeric>
ENSG00000000003:E001 ENSG00000000003 E001 9.333333 0.002639292
ENSG00000000003:E002 ENSG00000000003 E002 1.166667 0.062487260
ENSG00000000003:E003 ENSG00000000003 E003 1.000000 0.023675620
ENSG00000000003:E004 ENSG00000000003 E004 1.500000 0.013644089
ENSG00000000003:E005 ENSG00000000003 E005 2.500000 0.005624283
ENSG00000000003:E006 ENSG00000000003 E006 3.166667 0.004502899
stat pvalue padj
<numeric> <numeric> <numeric>
ENSG00000000003:E001 0.2084887235 0.6479545 0.9999723
ENSG00000000003:E002 0.0003735887 0.9845791 NA
ENSG00000000003:E003 0.0901488852 0.7639880 NA
ENSG00000000003:E004 0.0012700347 0.9715714 NA
ENSG00000000003:E005 0.7070434297 0.4004271 NA
ENSG00000000003:E006 0.0240073599 0.8768662 NA
genomicData countData transcripts
<GRanges> <integer> <list>
ENSG00000000003:E001 chrX:-:[99883667, 99884983] 15 ########
ENSG00000000003:E002 chrX:-:[99885756, 99885863] 2 ########
ENSG00000000003:E003 chrX:-:[99887482, 99887537] 0 ########
ENSG00000000003:E004 chrX:-:[99887538, 99887565] 1 ########
ENSG00000000003:E005 chrX:-:[99888402, 99888438] 3 ########
ENSG00000000003:E006 chrX:-:[99888439, 99888536] 2 ########
I get NA in 612398/644354 rows
Is it possible NA is reported, because p-values are too large? Should not be padj close to 1 reported in that case?
I came up with this workaround:
dxr1$padj<-p.adjust(dxr1$pvalue,method="BH")
Is it consistent with DEXSeq way to adjust the p-value (I know it is Benjamini-Hochberg, but there might be some trick in the procedure)?
Thank you for any help
sessionInfo()
R version 3.1.2 (2014-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] DEXSeq_1.10.8 BiocParallel_1.0.0 DESeq2_1.6.3
[4] RcppArmadillo_0.4.600.0 Rcpp_0.11.3 GenomicRanges_1.18.4
[7] GenomeInfoDb_1.2.4 IRanges_2.0.1 S4Vectors_0.4.0
[10] Biobase_2.26.0 BiocGenerics_0.12.1
loaded via a namespace (and not attached):
[1] acepack_1.3-3.3 annotate_1.44.0 AnnotationDbi_1.28.1
[4] base64enc_0.1-2 BatchJobs_1.5 BBmisc_1.8
[7] biomaRt_2.22.0 Biostrings_2.34.1 bitops_1.0-6
[10] brew_1.0-6 checkmate_1.5.1 cluster_1.15.3
[13] codetools_0.2-9 colorspace_1.2-4 DBI_0.3.1
[16] digest_0.6.8 fail_1.2 foreach_1.4.2
[19] foreign_0.8-62 Formula_1.1-2 genefilter_1.48.1
[22] geneplotter_1.44.0 ggplot2_1.0.0 grid_3.1.2
[25] gtable_0.1.2 Hmisc_3.14-6 hwriter_1.3.2
[28] iterators_1.0.7 lattice_0.20-29 latticeExtra_0.6-26
[31] locfit_1.5-9.1 MASS_7.3-37 munsell_0.4.2
[34] nnet_7.3-8 plyr_1.8.1 proto_0.3-10
[37] RColorBrewer_1.1-2 RCurl_1.95-4.5 reshape2_1.4.1
[40] rpart_4.1-8 Rsamtools_1.18.2 RSQLite_1.0.0
[43] scales_0.2.4 sendmailR_1.2-1 splines_3.1.2
[46] statmod_1.4.20 stringr_0.6.2 survival_2.37-7
[49] tools_3.1.2 XML_3.98-1.1 xtable_1.7-4
[52] XVector_0.6.0 zlibbioc_1.12.0
I have the same feeling as Wolfgang, also the MA plot would help to see if this is the problem.
Alejandro
I can see the problem now. The counts are very low indeed. Even in the counts file produced by python script, I have only cca 5M reads (and 4M fall in no feature). I will have to look into this, because i have about 120M uniquely mapped reads (60M pairs) in bam file.
I am currently re-running the python script and saving the log, I will let know as soon as it is finished.
FYI, histogram of mean:
MA plot: