Hello Everyone,
I am using DESeq2 to look at differential gene expression. For some of the comparisons of interest, a strikingly low number of differentially expressed genes was reported. Cuffdiff reported many more genes as differentially expressed. I would like to understand what is causing this difference in outcome and want to estimate gene-wise dispersion as a first step. Is this possible?
Here's what I have:
object.DESeq <- DESeqDataSetFromMatrix(counts, colData, design = ~ virus + red + time + virus:red + virus:time + red:time + virus:red:time) fit <- estimateSizeFactors(object.DESeq) fit <- estimateDispersionsGeneEst(fit) fit <- nbinomLRT(fit, betaPrior = FALSE)
However, the third line (containing nbinomLRT) returns this error:
Error in nbinomLRT(fit, betaPrior = FALSE) : testing requires dispersion estimates, first call estimateDispersions()
Additionally, what else may account for the large difference in DEG calls between Cuffdiff and DESeq2?
> sessionInfo()
R version 3.1.2 (2014-10-31)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets methods base
other attached packages:
[1] DESeq2_1.6.3 RcppArmadillo_0.4.650.1.1 Rcpp_0.11.4 GenomicRanges_1.18.4 GenomeInfoDb_1.2.4
[6] IRanges_2.0.1 S4Vectors_0.4.0 BiocGenerics_0.12.1
loaded via a namespace (and not attached):
[1] acepack_1.3-3.3 annotate_1.44.0 AnnotationDbi_1.28.1 base64enc_0.1-2 BatchJobs_1.5 BBmisc_1.9
[7] Biobase_2.26.0 BiocParallel_1.0.3 brew_1.0-6 checkmate_1.5.1 cluster_1.15.3 codetools_0.2-9
[13] colorspace_1.2-5 DBI_0.3.1 digest_0.6.8 fail_1.2 foreach_1.4.2 foreign_0.8-61
[19] Formula_1.2-0 genefilter_1.48.1 geneplotter_1.44.0 ggplot2_1.0.0 grid_3.1.2 gtable_0.1.2
[25] Hmisc_3.15-0 iterators_1.0.7 lattice_0.20-29 latticeExtra_0.6-26 locfit_1.5-9.1 MASS_7.3-35
[31] munsell_0.4.2 nnet_7.3-8 plyr_1.8.1 proto_0.3-10 RColorBrewer_1.1-2 reshape2_1.4.1
[37] rpart_4.1-8 RSQLite_1.0.0 scales_0.2.4 sendmailR_1.2-1 splines_3.1.2 stringr_0.6.2
[43] survival_2.37-7 tcltk_3.1.2 tools_3.1.2 XML_3.98-1.1 xtable_1.7-4 XVector_0.6.0
Hi Michael,
I did use the standard series of commands initially
but am now trying to make the changes described above. I attempted to do this using the code written in the original question.
So if you want to use a likelihood ratio test, you can just do:
where you fill in the "..." with the reduced formula. DESeq() will handle all the size factor estimation and dispersion estimation steps. (your code above was missing 2/3 of the dispersion estimation steps) (also, the likelihood ratio test will set betaPrior=FALSE, so you can skip that)
Can you say more about what your biological question is?
You are using a model with all 1st and 2nd order interaction terms. Note that if you want to compare all possible combinations of virus, red and time, you can use a simpler setup, by combining the factors into one:
dds$group <- factor(paste0(dds$virus, dds$red, dds$time)) design(dds) <- ~ group dds <- DESeq(dds) resultsNames(dds) results(dds, contrast=c("group", "...", "..."))
Where you fill in the two "..." spaces with the names of combinations that you want to compare.
Hi Michael,
For my experimental design, I have:
2 Settings (Control/Virus)* 2 Developmental classes per treatment (Green/Red)* 4 Time points* 3 Biological Replicates=48 libraries
I'm interested in the effects of a virus on fruit ripening progression and on the relationship between "Green" and "Red".
To clarify my issue:
Number of differentially expressed genes called by Cuffdiff over time for some comparisons:
Number of differentially expressed genes called by DESeq2 over time for some comparisons:
Not only is the number of DEG called by Cuffdiff substantially more (not necessarily a problem), but the trends in the number of DEG differ for some of the comparisons based on the method (compare the blue and red lines in both figures). I'd like to know what alterations can be made to the standard DESeq() command to see what is causing this inconsistency.
*Edit: Just realized the figures didn't show up. Here's a link:
https://docs.google.com/spreadsheets/d/1JT-RCnIeWBUnWtS0EzD3zDcBEVfida64kLAYU_YvwGM/edit?usp=sharing
Firstly, there are many differences between these methods, so it's not surprising if the # of genes at some FDR cutoff is different. p-values are tail probabilities, and so even small changes in model parameters can result in a big change in the absolute # below an FDR cutoff. To me, the trends in the spreadsheet actually look quite similar. But without knowing the DESeq() or results() calls you used to make results tables and what comparisons you are making with Cuffdiff it's hard to say what's happening.
Large designs with lots of interactions can make things quite complicated. What helps to simplify things is to get back to specific biological questions, such as: "which genes show virus-specfic changes over time?" For questions like this, interaction models are useful. If you questions are more like, "which genes are differentially expressed for virus-treated red samples at time 1?", then I would recommend using the code example I provided above, which greatly simplifies building results tables for questions like this.
Here's what my RCode looks like:
https://drive.google.com/open?id=0B2Ew0euSvASeN0RjUGNrTTRBV0k&authuser=0
And here are my Cuffdiff commands:
https://drive.google.com/open?id=0B2Ew0euSvASeMGZzTC0wYWptVUE&authuser=0
The comparison naming system (specified after -o in Cuffdiff) is identical in both.