Low number differentially expressed genes with DESeq2, want to estimate dispersion gene-wise
1
0
Entering edit mode
A • 0
@a-7557
Last seen 3.6 years ago
United States

 

Hello Everyone, 

I am using DESeq2 to look at differential gene expression. For some of the comparisons of interest, a strikingly low number of differentially expressed genes was reported. Cuffdiff reported many  more genes as differentially expressed. I would like to understand what is causing this difference in outcome and want to estimate gene-wise dispersion as a first step. Is this possible?

Here's what I have:

object.DESeq <- DESeqDataSetFromMatrix(counts, colData, design = ~ virus + red + time + virus:red + virus:time + red:time + virus:red:time)
fit <- estimateSizeFactors(object.DESeq)
fit <- estimateDispersionsGeneEst(fit)
fit <- nbinomLRT(fit, betaPrior = FALSE)

However, the third line (containing nbinomLRT) returns this error:

Error in nbinomLRT(fit, betaPrior = FALSE) : testing requires dispersion estimates, first call estimateDispersions()

Additionally, what else may account for the large difference in DEG calls between Cuffdiff and DESeq2?

> sessionInfo()

R version 3.1.2 (2014-10-31)
Platform: x86_64-apple-darwin13.4.0 (64-bit)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] DESeq2_1.6.3              RcppArmadillo_0.4.650.1.1 Rcpp_0.11.4               GenomicRanges_1.18.4      GenomeInfoDb_1.2.4       
[6] IRanges_2.0.1             S4Vectors_0.4.0           BiocGenerics_0.12.1      

loaded via a namespace (and not attached):
 [1] acepack_1.3-3.3      annotate_1.44.0      AnnotationDbi_1.28.1 base64enc_0.1-2      BatchJobs_1.5        BBmisc_1.9          
 [7] Biobase_2.26.0       BiocParallel_1.0.3   brew_1.0-6           checkmate_1.5.1      cluster_1.15.3       codetools_0.2-9     
[13] colorspace_1.2-5     DBI_0.3.1            digest_0.6.8         fail_1.2             foreach_1.4.2        foreign_0.8-61      
[19] Formula_1.2-0        genefilter_1.48.1    geneplotter_1.44.0   ggplot2_1.0.0        grid_3.1.2           gtable_0.1.2        
[25] Hmisc_3.15-0         iterators_1.0.7      lattice_0.20-29      latticeExtra_0.6-26  locfit_1.5-9.1       MASS_7.3-35         
[31] munsell_0.4.2        nnet_7.3-8           plyr_1.8.1           proto_0.3-10         RColorBrewer_1.1-2   reshape2_1.4.1      
[37] rpart_4.1-8          RSQLite_1.0.0        scales_0.2.4         sendmailR_1.2-1      splines_3.1.2        stringr_0.6.2       
[43] survival_2.37-7      tcltk_3.1.2          tools_3.1.2          XML_3.98-1.1         xtable_1.7-4         XVector_0.6.0       

 

 

 

DESeq2 RNASeq differential gene expression • 2.0k views
ADD COMMENT
0
Entering edit mode
@mikelove
Last seen 5 hours ago
United States
Where did these lines of code come from? The recommended step is to call DESeq() Check the software vignette for the recommended steps to take.
ADD COMMENT
0
Entering edit mode

Hi Michael, 

I did use the standard series of commands initially

object.DESeq <- DESeqDataSetFromMatrix(counts, colData, design = ~ virus + red + time + virus:red + virus:time + red:time + virus:red:time)
fit <- DESeq(object.DESeq, betaPrior = FALSE)

but am now trying to make the changes described above. I attempted to do this using the code written in the original question.

 

 

ADD REPLY
0
Entering edit mode

So if you want to use a likelihood ratio test, you can just do:

dds = DESeq(dds, test="LRT", reduced=...)

where you fill in the "..." with the reduced formula. DESeq() will handle all the size factor estimation and dispersion estimation steps. (your code above was missing 2/3 of the dispersion estimation steps) (also, the likelihood ratio test will set betaPrior=FALSE, so you can skip that)

Can you say more about what your biological question is?

You are using a model with all 1st and 2nd order interaction terms. Note that if you want to compare all possible combinations of virus, red and time, you can use a simpler setup, by combining the factors into one:

dds$group <- factor(paste0(dds$virus, dds$red, dds$time))
design(dds) <- ~ group
dds <- DESeq(dds)
resultsNames(dds)
results(dds, contrast=c("group", "...", "..."))

Where you fill in the two "..." spaces with the names of combinations that you want to compare.

ADD REPLY
0
Entering edit mode

Hi Michael, 

For my experimental design, I have:

2 Settings (Control/Virus)* 2 Developmental classes per treatment (Green/Red)* 4 Time points* 3 Biological Replicates=48 libraries

I'm interested in the effects of a virus on fruit ripening progression and on the relationship between "Green" and "Red".

 

 

ADD REPLY
0
Entering edit mode

To clarify my issue:

Number of differentially expressed genes called by Cuffdiff over time for some comparisons:

Number of differentially expressed genes called by DESeq2 over time for some comparisons:

Not only is the number of DEG called by Cuffdiff substantially more (not necessarily a problem), but the trends in the number of DEG differ for some of the comparisons based on the method (compare the blue and red lines in both figures). I'd like to know what alterations can be made to the standard DESeq() command to see what is causing this inconsistency. 

*Edit: Just realized the figures didn't show up. Here's a link:

https://docs.google.com/spreadsheets/d/1JT-RCnIeWBUnWtS0EzD3zDcBEVfida64kLAYU_YvwGM/edit?usp=sharing

ADD REPLY
1
Entering edit mode

Firstly, there are many differences between these methods, so it's not surprising if the # of genes at some FDR cutoff is different. p-values are tail probabilities, and so even small changes in model parameters can result in a big change in the absolute # below an FDR cutoff. To me, the trends in the spreadsheet actually look quite similar. But without knowing the DESeq() or results() calls you used to make results tables and what comparisons you are making with Cuffdiff it's hard to say what's happening.

Large designs with lots of interactions can make things quite complicated. What helps to simplify things is to get back to specific biological questions, such as: "which genes show virus-specfic changes over time?" For questions like this, interaction models are useful. If you questions are more like, "which genes are differentially expressed for virus-treated red samples at time 1?", then I would recommend using the code example I provided above, which greatly simplifies building results tables for questions like this.

ADD REPLY
0
Entering edit mode

Here's what my RCode looks like:

https://drive.google.com/open?id=0B2Ew0euSvASeN0RjUGNrTTRBV0k&authuser=0

And here are my Cuffdiff commands:

https://drive.google.com/open?id=0B2Ew0euSvASeMGZzTC0wYWptVUE&authuser=0

The comparison naming system (specified after -o in Cuffdiff) is identical in both.

ADD REPLY

Login before adding your answer.

Traffic: 827 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6