Question

DEXseq padj NA problem

0

Entering edit mode

bartosovic.marek • 0

@bartosovicmarek-7297

Last seen 8.2 years ago

Czech Republic

Hi, this is mainly aimed for DEXSeq package developers, but any help will be appreciated.

I ran standard DEXSeq pipeline on my data (RNAseq, 2x125bp reads,ctrl vs treatment,3 replicates each) according to DEXSeq vignette. All steps finished fine, but somehow I end up with NA values in most of the padj column.

my commands:

python scripts:

python /homes2/marek/R/x86_64-pc-linux-gnu-library/3.1/DEXSeq/python_scripts/dexseq_prepare_annotation.py ~/annotation/Ensembl_GRCh37.75.gtf Ensembl_GRCh37.75_dexseq.gtf
ls *.bam | while read line; do python /homes2/marek/R/x86_64-pc-linux-gnu-library/3.1/DEXSeq/python_scripts/dexseq_count.py -p yes -f bam -r name Ensembl_GRCh37.75_dexseq.gtf $line ${line/.bam/.htseq} & done

and R:

library("DEXSeq")
sampleTable <- data.frame(row.names = c("WT_1","WT_2","WT_3","KO_1","KO_2","KO_3"), condition = c("control","control","control","knock-out","knock-out","knock-out"),libType=c("paired-end","paired-end","paired-end","paired-end","paired-end","paired-end"))
flattenedFile <- list.files(pattern="Ensembl_GRCh37.75_dexseq.gtf")
dxd <- DEXSeqDataSetFromHTSeq(countFiles,sampleData=sampleTable,design= ~ sample + exon + condition:exon,flattenedfile=flattenedFile)
dxd <- estimateSizeFactors(dxd)
dxd <- estimateDispersions(dxd)
dxd <- testForDEU(dxd)
dxr1 <- DEXSeqResults(dxd)

head(dxr1)

DataFrame with 6 rows and 10 columns

                             groupID   featureID exonBaseMean dispersion
                         <character> <character>    <numeric>   <numeric>
ENSG00000000003:E001 ENSG00000000003        E001     9.333333 0.002639292
ENSG00000000003:E002 ENSG00000000003        E002     1.166667 0.062487260
ENSG00000000003:E003 ENSG00000000003        E003     1.000000 0.023675620
ENSG00000000003:E004 ENSG00000000003        E004     1.500000 0.013644089
ENSG00000000003:E005 ENSG00000000003        E005     2.500000 0.005624283
ENSG00000000003:E006 ENSG00000000003        E006     3.166667 0.004502899
                             stat    pvalue      padj
                        <numeric> <numeric> <numeric>
ENSG00000000003:E001 0.2084887235 0.6479545 0.9999723
ENSG00000000003:E002 0.0003735887 0.9845791        NA
ENSG00000000003:E003 0.0901488852 0.7639880        NA
ENSG00000000003:E004 0.0012700347 0.9715714        NA
ENSG00000000003:E005 0.7070434297 0.4004271        NA
ENSG00000000003:E006 0.0240073599 0.8768662        NA
                                     genomicData countData transcripts
                                       <GRanges> <integer>      <list>
ENSG00000000003:E001 chrX:-:[99883667, 99884983]        15    ########
ENSG00000000003:E002 chrX:-:[99885756, 99885863]         2    ########
ENSG00000000003:E003 chrX:-:[99887482, 99887537]         0    ########
ENSG00000000003:E004 chrX:-:[99887538, 99887565]         1    ########
ENSG00000000003:E005 chrX:-:[99888402, 99888438]         3    ########
ENSG00000000003:E006 chrX:-:[99888439, 99888536]         2    ########

I get NA in 612398/644354 rows

Is it possible NA is reported, because p-values are too large? Should not be padj close to 1 reported in that case?

I came up with this workaround:

dxr1$padj<-p.adjust(dxr1$pvalue,method="BH")

Is it consistent with DEXSeq way to adjust the p-value (I know it is Benjamini-Hochberg, but there might be some trick in the procedure)?

Thank you for any help

sessionInfo()

R version 3.1.2 (2014-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)

locale:
[1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8       LC_NAME=C
[9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets
[8] methods base

other attached packages:
[1] DEXSeq_1.10.8           BiocParallel_1.0.0      DESeq2_1.6.3
[4] RcppArmadillo_0.4.600.0 Rcpp_0.11.3             GenomicRanges_1.18.4
[7] GenomeInfoDb_1.2.4      IRanges_2.0.1           S4Vectors_0.4.0
[10] Biobase_2.26.0          BiocGenerics_0.12.1

loaded via a namespace (and not attached):
[1] acepack_1.3-3.3      annotate_1.44.0      AnnotationDbi_1.28.1
[4] base64enc_0.1-2      BatchJobs_1.5        BBmisc_1.8
[7] biomaRt_2.22.0       Biostrings_2.34.1    bitops_1.0-6
[10] brew_1.0-6           checkmate_1.5.1      cluster_1.15.3
[13] codetools_0.2-9      colorspace_1.2-4     DBI_0.3.1
[16] digest_0.6.8         fail_1.2             foreach_1.4.2
[19] foreign_0.8-62       Formula_1.1-2        genefilter_1.48.1
[22] geneplotter_1.44.0   ggplot2_1.0.0        grid_3.1.2
[25] gtable_0.1.2         Hmisc_3.14-6         hwriter_1.3.2
[28] iterators_1.0.7      lattice_0.20-29      latticeExtra_0.6-26
[31] locfit_1.5-9.1       MASS_7.3-37          munsell_0.4.2
[34] nnet_7.3-8           plyr_1.8.1           proto_0.3-10
[37] RColorBrewer_1.1-2   RCurl_1.95-4.5       reshape2_1.4.1
[40] rpart_4.1-8          Rsamtools_1.18.2     RSQLite_1.0.0
[43] scales_0.2.4         sendmailR_1.2-1      splines_3.1.2
[46] statmod_1.4.20       stringr_0.6.2        survival_2.37-7
[49] tools_3.1.2          XML_3.98-1.1         xtable_1.7-4
[52] XVector_0.6.0        zlibbioc_1.12.0

dexseq • 1.8k views

ADD COMMENT • link updated 9.9 years ago by Wolfgang Huber ★ 13k • written 9.9 years ago by bartosovic.marek • 0

score 0 · Answer 1 · 2015-01-28

0

Entering edit mode

Wolfgang Huber ★ 13k

@wolfgang-huber-3550

Last seen 3 months ago

EMBL European Molecular Biology Laborat…

Hi Marek

this is perhaps a problem of insufficient library size / coverage. To explore this, what do you get from

hist(dxr1$exonBaseMean, breaks=100)

hist( log2(1 + dxr1$exonBaseMean), breaks=100)

and

with(dxr1, plot(rank(exonBaseMean), -log10(pvalue), pch=16))

?

Wolfgang

ADD COMMENT • link 9.9 years ago Wolfgang Huber ★ 13k

0

Entering edit mode

I have the same feeling as Wolfgang, also the MA plot would help to see if this is the problem.

plotMA( dxr1 )

Alejandro

ADD REPLY • link 9.9 years ago Alejandro Reyes ★ 1.9k

0

Entering edit mode

I can see the problem now. The counts are very low indeed. Even in the counts file produced by python script, I have only cca 5M reads (and 4M fall in no feature). I will have to look into this, because i have about 120M uniquely mapped reads (60M pairs) in bam file.

I am currently re-running the python script and saving the log, I will let know as soon as it is finished.

FYI, histogram of mean:

MA plot:

ADD REPLY • link 9.9 years ago bartosovic.marek • 0