Using dispersion per gene to run Wald test (DESeq2)
1
0
Entering edit mode
meisan406 • 0
@meisan406-7061
Last seen 7.0 years ago
United States

I am trying to run the DESeq2 with only 1378 genes. This subset of genes are defined by the GO term "DNA binding proteins", as I am only interested in the transcription factors in this data set. I have tried with just the standard workflow of using a common dispersion trend but I think due to the low number of genes, it is underpowered to detect any significant differences (FDR 10%). So I thought of using the dispersion per gene (based on a suggestion here: http://seqanswers.com/forums/showthread.php?t=33618).

Here are my codes, where cds.dbp is now a DESeq2 object limited to the 1378 genes:
 
cds.dbp <- estimateSizeFactors( cds.dbp )
cds.dbp <- estimateDispersionsGeneEst( cds.dbp)
dispersions(cds.dbp) <- mcols(cds.dbp)$dispGeneEst
cds.dbp.2 <- nbinomWaldTest( cds.dbp )

But I received the following error message from the nbinomWaldTest step:

Error in approx(cumsum(wts), x, xout = c(low, high), method = "constant",  : 
  zero non-NA points

Here're the first few rows of mcols(cds.dbp):

DataFrame with 1378 rows and 5 columns
        baseMean      baseVar   allZero  dispGeneEst   dispersion
       <numeric>    <numeric> <logical>    <numeric>    <numeric>
1       5.556475 4.148693e+00     FALSE 0.0000000100 0.0000000100
2    1438.080398 3.912841e+04     FALSE 0.0122872985 0.0122872985
3    2634.390047 2.243086e+05     FALSE 0.0100523772 0.0100523772
4     684.371282 2.216909e+03     FALSE 0.0009121098 0.0009121098
5     172.285501 1.240765e+03     FALSE 0.0270022439 0.0270022439


I am unable to figure out the problem and would appreciate some help. I am also happy to learn if there's any suggestions in general on how test just a subset of genes with the DESeq2 workflow. Thanks.

sessionInfo():

R version 3.2.3 (2015-12-10)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.10.4 (Yosemite)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] DESeq2_1.10.1              RcppArmadillo_0.6.400.2.2  Rcpp_0.12.2               
 [4] SummarizedExperiment_1.0.1 Biobase_2.30.0             GenomicRanges_1.22.2      
 [7] GenomeInfoDb_1.6.1         IRanges_2.4.6              S4Vectors_0.8.5           
[10] BiocGenerics_0.16.1        data.table_1.9.6          

loaded via a namespace (and not attached):
 [1] RColorBrewer_1.1-2   futile.logger_1.4.1  plyr_1.8.3           XVector_0.10.0      
 [5] futile.options_1.0.0 tools_3.2.3          zlibbioc_1.16.0      rpart_4.1-10        
 [9] RSQLite_1.0.0        annotate_1.48.0      gtable_0.1.2         lattice_0.20-33     
[13] DBI_0.3.1            gridExtra_2.0.0      genefilter_1.52.0    cluster_2.0.3       
[17] locfit_1.5-9.1       grid_3.2.3           nnet_7.3-11          AnnotationDbi_1.32.3
[21] XML_3.98-1.3         survival_2.38-3      BiocParallel_1.4.3   foreign_0.8-66      
[25] latticeExtra_0.6-26  Formula_1.2-1        geneplotter_1.48.0   ggplot2_2.0.0       
[29] lambda.r_1.1.7       Hmisc_3.17-1         scales_0.3.0         splines_3.2.3       
[33] xtable_1.8-0         colorspace_1.2-6     acepack_1.3-3.3      munsell_0.4.2       
[37] chron_2.3-47

deseq2 • 1.4k views
ADD COMMENT
0
Entering edit mode
@mikelove
Last seen 1 day ago
United States

I think this is not related to low power. In fact, the information sharing from fitting a dispersion trend only increases power, which you are skipping here. Instead, you likely have too few samples to observe any differences above the biological or technical noise at a 10% FDR cutoff.

All the estimation procedures are generally better using all the data. If for any reason you did want to subset to test only a subset of genes, I would recommend you do that *after* using all genes in the dataset for estimation steps. (You can handle pvalue adjustment yourself with p.adjust).

Even with only dozens of genes, I wouldn't recommend skipping the information sharing steps, which provide a boost in power.

Regarding the error, I've fixed this in the devel branch to provide a fallback when fitted dispersion estimates are missing.

ADD COMMENT

Login before adding your answer.

Traffic: 839 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6