DEXseq padj NA problem
1
0
Entering edit mode
@bartosovicmarek-7297
Last seen 7.6 years ago
Czech Republic

Hi, this is mainly aimed for DEXSeq package developers, but any help will be appreciated.

I ran standard DEXSeq pipeline on my data (RNAseq, 2x125bp reads,ctrl vs treatment,3 replicates each) according to DEXSeq vignette. All steps finished fine, but somehow I end up with NA values in most of the padj column.

my commands:

python scripts:

python /homes2/marek/R/x86_64-pc-linux-gnu-library/3.1/DEXSeq/python_scripts/dexseq_prepare_annotation.py ~/annotation/Ensembl_GRCh37.75.gtf Ensembl_GRCh37.75_dexseq.gtf
ls *.bam | while read line; do python /homes2/marek/R/x86_64-pc-linux-gnu-library/3.1/DEXSeq/python_scripts/dexseq_count.py -p yes -f bam -r name Ensembl_GRCh37.75_dexseq.gtf $line ${line/.bam/.htseq} & done

and R:

library("DEXSeq")
sampleTable <- data.frame(row.names = c("WT_1","WT_2","WT_3","KO_1","KO_2","KO_3"), condition = c("control","control","control","knock-out","knock-out","knock-out"),libType=c("paired-end","paired-end","paired-end","paired-end","paired-end","paired-end"))
flattenedFile <- list.files(pattern="Ensembl_GRCh37.75_dexseq.gtf")
dxd <- DEXSeqDataSetFromHTSeq(countFiles,sampleData=sampleTable,design= ~ sample + exon + condition:exon,flattenedfile=flattenedFile)
dxd <- estimateSizeFactors(dxd)
dxd <- estimateDispersions(dxd)
dxd <- testForDEU(dxd)
dxr1 <- DEXSeqResults(dxd)
head(dxr1)

DataFrame with 6 rows and 10 columns


                             groupID   featureID exonBaseMean  dispersion
                         <character> <character>    <numeric>   <numeric>
ENSG00000000003:E001 ENSG00000000003        E001     9.333333 0.002639292
ENSG00000000003:E002 ENSG00000000003        E002     1.166667 0.062487260
ENSG00000000003:E003 ENSG00000000003        E003     1.000000 0.023675620
ENSG00000000003:E004 ENSG00000000003        E004     1.500000 0.013644089
ENSG00000000003:E005 ENSG00000000003        E005     2.500000 0.005624283
ENSG00000000003:E006 ENSG00000000003        E006     3.166667 0.004502899
                             stat    pvalue      padj
                        <numeric> <numeric> <numeric>
ENSG00000000003:E001 0.2084887235 0.6479545 0.9999723
ENSG00000000003:E002 0.0003735887 0.9845791        NA
ENSG00000000003:E003 0.0901488852 0.7639880        NA
ENSG00000000003:E004 0.0012700347 0.9715714        NA
ENSG00000000003:E005 0.7070434297 0.4004271        NA
ENSG00000000003:E006 0.0240073599 0.8768662        NA
                                     genomicData countData transcripts
                                       <GRanges> <integer>      <list>
ENSG00000000003:E001 chrX:-:[99883667, 99884983]        15    ########
ENSG00000000003:E002 chrX:-:[99885756, 99885863]         2    ########
ENSG00000000003:E003 chrX:-:[99887482, 99887537]         0    ########
ENSG00000000003:E004 chrX:-:[99887538, 99887565]         1    ########
ENSG00000000003:E005 chrX:-:[99888402, 99888438]         3    ########
ENSG00000000003:E006 chrX:-:[99888439, 99888536]         2    ########


I get NA in 612398/644354 rows

Is it possible NA is reported, because p-values are too large? Should not be padj close to 1 reported in that case?

I came up with this workaround:

dxr1$padj<-p.adjust(dxr1$pvalue,method="BH")

Is it consistent with DEXSeq way to adjust the p-value (I know it is Benjamini-Hochberg, but there might be some trick in the procedure)?

 

Thank you for any help

 

 

sessionInfo()

R version 3.1.2 (2014-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets
[8] methods   base     

other attached packages:
 [1] DEXSeq_1.10.8           BiocParallel_1.0.0      DESeq2_1.6.3           
 [4] RcppArmadillo_0.4.600.0 Rcpp_0.11.3             GenomicRanges_1.18.4   
 [7] GenomeInfoDb_1.2.4      IRanges_2.0.1           S4Vectors_0.4.0        
[10] Biobase_2.26.0          BiocGenerics_0.12.1    

loaded via a namespace (and not attached):
 [1] acepack_1.3-3.3      annotate_1.44.0      AnnotationDbi_1.28.1
 [4] base64enc_0.1-2      BatchJobs_1.5        BBmisc_1.8          
 [7] biomaRt_2.22.0       Biostrings_2.34.1    bitops_1.0-6        
[10] brew_1.0-6           checkmate_1.5.1      cluster_1.15.3      
[13] codetools_0.2-9      colorspace_1.2-4     DBI_0.3.1           
[16] digest_0.6.8         fail_1.2             foreach_1.4.2       
[19] foreign_0.8-62       Formula_1.1-2        genefilter_1.48.1   
[22] geneplotter_1.44.0   ggplot2_1.0.0        grid_3.1.2          
[25] gtable_0.1.2         Hmisc_3.14-6         hwriter_1.3.2       
[28] iterators_1.0.7      lattice_0.20-29      latticeExtra_0.6-26
[31] locfit_1.5-9.1       MASS_7.3-37          munsell_0.4.2       
[34] nnet_7.3-8           plyr_1.8.1           proto_0.3-10        
[37] RColorBrewer_1.1-2   RCurl_1.95-4.5       reshape2_1.4.1      
[40] rpart_4.1-8          Rsamtools_1.18.2     RSQLite_1.0.0       
[43] scales_0.2.4         sendmailR_1.2-1      splines_3.1.2       
[46] statmod_1.4.20       stringr_0.6.2        survival_2.37-7     
[49] tools_3.1.2          XML_3.98-1.1         xtable_1.7-4        
[52] XVector_0.6.0        zlibbioc_1.12.0     

 

 

dexseq • 1.7k views
ADD COMMENT
0
Entering edit mode
@wolfgang-huber-3550
Last seen 18 days ago
EMBL European Molecular Biology Laborat…

Hi Marek

this is perhaps a problem of insufficient library size / coverage. To explore this, what do you get from 

hist(dxr1$exonBaseMean, breaks=100)

hist( log2(1 + dxr1$exonBaseMean), breaks=100)

and

with(dxr1, plot(rank(exonBaseMean), -log10(pvalue), pch=16))

?

Wolfgang

ADD COMMENT
0
Entering edit mode

I have the same feeling as Wolfgang, also the MA plot would help to see if this is the problem.

plotMA( dxr1 ) 

Alejandro

ADD REPLY
0
Entering edit mode

I can see the problem now. The counts are very low indeed. Even in the counts file produced by python script, I have only cca 5M reads (and 4M fall in no feature). I will have to look into this, because i have about 120M uniquely mapped reads (60M pairs) in bam file.

I am currently re-running the python script and saving the log, I will let know as soon as it is finished.

 

FYI, histogram of mean:

 

MA plot:

ADD REPLY

Login before adding your answer.

Traffic: 963 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6