DESeq2 timecourse - How to set the experimental design
5
1
Entering edit mode
nafeesa ▴ 10
@nafeesa-6842
Last seen 5.8 years ago
Germany

Hi,

I am trying to analyze RNA-Seq Data with DESeq2 and could use some help. I have 2 genotypes and 14 timepoints. I want to find differences in gene expression between both genotypes overall and at every timepoint and between every two timepoints in each genotype. After reading a lot of manuals, paper and posts here and elsewhere, I believe that the correct design for my experiment is ~genotype+time+time:genotype in the full and ~genotype+time in the reduced formula. Here is what I did so far:

library("DESeq2")

directory<-"............."
sampleFiles <- grep("??",list.files(directory),value=TRUE)

time <- factor(c("T12", "T13", "T14", "T1", "T11", "T5", "T7", "T2", "T9", "T3", "T6", "T8", "T4", "T10", "T12", "T13", "T14", "T1", "T11", "T5", "T7", "T2", "T9", "T3", "T6", "T8", "T4", "T10" ))
genotype <- factor(c("GF","GF","GF","GF","GF","GF","GF","GF","GF","GF","GF","GF","GF","GF","VB","VB","VB","VB","VB","VB","VB","VB","VB","VB","VB","VB","VB","VB"))
sampleTable<-data.frame(sampleName=sampleFiles, fileName=sampleFiles,  genotype=genotype, time=time)
ddsHTSeq <- DESeqDataSetFromHTSeqCount(sampleTable=sampleTable, directory=directory, design=~genotype+time+time:genotype)

ddsHTSeq$genotype<-factor(ddsHTSeq$genotype, levels=c("GF","VB"))
ddsHTSeq$time <- factor(ddsHTSeq$time, levels=c("T1", "T2", "T3", "T4", "T5", "T6", "T7", "T8", "T9", "T10", "T11", "T12", "T13", "T14"))

dds <- DESeq(ddsHTSeq)
res <- results(dds)

ddsHTSeq <- estimateSizeFactors(ddsHTSeq)
ddsHTSeq <- estimateDispersions(ddsHTSeq)
ddsLRT <- nbinomLRT(dds, reduced = formula(~genotype+time)

resLRT<- results(ddsLRT)

Unfortunately I get:

> sumis.na(res$pvalue))
[1] 10295
> sum(res$pvalue<0.05, na.rm=TRUE)
[1] 0
> sum(res$padj<0.1, na.rm=TRUE)
[1] 0
So this means, there is no time influenced difference between genotypes. Is that correct? The count data tells me soemthing else though....

I also want to test for differences in gene expression between individual time of each genotype seperately. T2 vs T1, T3 vs T2, T4 vs T3 and so on. How can I access that?

I would be greatful and appreciate any help. Thank you!

Nadia

> sessionInfo()
R version 3.1.1 (2014-07-10)
Platform: x86_64-redhat-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets methods
[8] base

other attached packages:
[1] DESeq2_1.4.5              RcppArmadillo_0.4.450.1.0
[3] Rcpp_0.11.2               GenomicRanges_1.16.4
[5] GenomeInfoDb_1.0.2        IRanges_1.22.10
[7] BiocGenerics_0.10.0

loaded via a namespace (and not attached):
 [1] annotate_1.42.1      AnnotationDbi_1.22.6 Biobase_2.20.1
 [4] DBI_0.2-5            genefilter_1.46.1    geneplotter_1.42.0
 [7] grid_3.1.1           lattice_0.20-29      locfit_1.5-9.1
[10] RColorBrewer_1.0-5   RSQLite_0.11.2       splines_3.1.1
[13] stats4_3.1.1         survival_2.37-7      XML_3.98-1.1
[16] xtable_1.7-1         XVector_0.4.0

deseq2 geneexpression timecourse • 9.2k views
ADD COMMENT
3
Entering edit mode
@wolfgang-huber-3550
Last seen 12 days ago
EMBL European Molecular Biology Laborat…

Nadia,

I'd like to contribute a somewhat more optimistic assessment than the previous two replies ("you don't have any replicates ... so you have no chance to ... perform proper statistical inference"; "hence you cannot test anything"), but for that it'll be necessary to think of time not as a factor but as a continuous covariate, and of the response not as a set of 14 different numbers but as a function ("element of a Hilbert space"). And it's not a standard out-of-the-box DESeq2 workflow, it'll require some more customization, and maybe you need to involve a local statistician or physicist.

Essentially, you can estimate a continuous, smooth function of time for each gene and each genotype from the 14 time points, and then compare these functions between genotypes, to see for which genes they are significantly different. If the functions are set up with a parametric model, or with local polynomials (a la loess, locfit etc) then you can keep the degrees of freedom well below 14, and use the rest to estimate dispersions or standard deviations, as a basis for the statistical test. I would probably approach this as Simon suggests by rlog-transforming, then calling locfit with confidence bands, and test, for each gene, for significant difference between the genotypes. Perhaps also include independent filtering beforehand to throw out those genes where nothing happens anyway.

A second approach is to go more exploratory, and cluster or classify the time courses into the major patterns, and then perhaps see which genes change cluster/class between genotypes.

 

 

ADD COMMENT
0
Entering edit mode

Wolfgang, I'm not always as pessimistic as you make me sound. ;-) If these are independent measurements then Nadia has replicates anyway, and everything is fine.

In this case, fitting a curve or doing some kind of spline regression will be quite useful. Mike and I had planned since quite a while to add a section to the DESeq2 vignette to show how to do this within DESeq2 (which is actually possible). As usual, finding good example data is the difficulty. Anybody out there has something to offer?

However: If these are just repeated measurements of one sample of each genotype, then one is just comparing two samples, and one should not attempt any formal statistical testing. Any p values of confidence bands are bound to be misleading. This is why I recommend doing something simple like finding the genes with the biggest distance between the expression vectors over time course, then plotting these, and simply looking at the plots.

ADD REPLY
1
Entering edit mode

I've prepared this SummarizedExperiment with help from the original authors:

http://bioconductor.org/packages/devel/data/experiment/html/fission.html

36 samples, 2 conditions, 6 time points.

And have an less-than-developed section using this in the workflow which will be released with Bioc 3.

ADD REPLY
2
Entering edit mode
Simon Anders ★ 3.7k
@simon-anders-3855
Last seen 3.7 years ago
Zentrum für Molekularbiologie, Universi…

Your formula does not have any degrees of freedom left, because you do not have replicates: each time course is done only once, hence you cannot test anything. I'm a bit surprised the DESeq2 did not reject the design with an error.

Are these repeated measurements on the same two sample? Then you don't have replication and cannot perform statistical tests. I'd recommend to use the rlog transform and simply look for genes with strong differences, for example by summing up the gene's rlog fold changes between genotypes.

Or have you prepared, for each genotype, 14 independent samples which you have incubated for different lengths of time? Then, you can legitimately ask which genes show a genotype-dependent expression profile over time by comparing the full model '~ time + genotype' with the reduced model '~time'.

ADD COMMENT
2
Entering edit mode
@wolfgang-huber-3550
Last seen 12 days ago
EMBL European Molecular Biology Laborat…

Nadia

By clustering, I meant using a clustering algorithm like k-means or hierarchical clustering. A bit of googling for example brought up this thread: http://stats.stackexchange.com/questions/3271/clustering-genes-in-a-time-course-experiment Also, you could see what Simon did here: http://genomebiology.com/2010/11/9/R93 esp. Figs. 2, 3.

The main point is that after applying the rlog to the RNA-Seq data, essentially all the time course related methods that people developed for microarrays over the last 15 years are in principle applicable.

Regarding your other questions, I think they exceed what can be reasonably provided by this forum. There are probably two good, not mutually exclusive solutions: you learn a little bit of statistics and R to understand the difference between factors and  continuous variables, and linear models, and study the documentation of locfit and how it computes confidence bands. Perhaps the gptk package on CRAN would also be a good place to look (and more recent). Which brings up the second suggestion, I think your data and question are complex enough that they could benefit from collaborating with a statistician or mathematician.

 

 

ADD COMMENT
0
Entering edit mode
@mikelove
Last seen 1 day ago
United States

hi Nadia,

One note: we recommend users store phenotypic data in CSV files and then read them into R. One typo and your analysis can be thrown off if you keep the phenotypic data inside the R script. Having it in a CSV also helps you scan the information for a single sample easily.

If I am not mistaken, you don't have any replicates (it looks like for each genotype at each time point you only have one sample), so you have no chance to estimate the dispersion using the design formula with interactions specified above or perform proper statistical inference. See the help page for the DESeq function under "Experiments without replicates". You should have gotten a message from the R console after running DESeq() which says "same number of samples and coefficients to fit, estimating dispersion by treating samples as replicates". So it's not surprising you don't have any genes with small p-values.

Note: in order to use the likelihood ratio test just do this:

dds = DESeq(ddsHTSeq, test="LRT", reduced=~genotype + time)
res = results(dds)

DESeq() is a wrapper for the following steps: estimating size factors, dispersions and then nbinomWaldTest or nbinomLRT. So it doesn't make sense to run DESeq and then rerun all the steps individually.

"I also want to test for differences in gene expression between individual time of each genotype seperately. T2 vs T1, T3 vs T2, T4 vs T3 and so on. How can I access that?"

I'm not sure how you can compare time points for each genotype, given that you have no biological replicates and there is no way to estimate the dispersion of the counts.

(edited)

ADD COMMENT
0
Entering edit mode
nafeesa ▴ 10
@nafeesa-6842
Last seen 5.8 years ago
Germany

Dear all, thank you for your replies.

I have 14 independent samples for each genotype and each was "incubated" for a different length of time but yes, I also have only one replicate for each genotype and each timepoint.

Wolfgang, thank you very much. Can you give me a little help on how to call locfit with confidence bands and on how to set time as a continuous covariate,? I am not very familiar with R yet. By clustering the time courses into the major patterns you mean for example to sum up the first 7 and the last 7 timepoints and compare both clusters, am I right?

Nadia

ADD COMMENT

Login before adding your answer.

Traffic: 388 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6