Question

DESeq2 timecourse - How to set the experimental design

1

Entering edit mode

nafeesa ▴ 10

@nafeesa-6842

Last seen 6.5 years ago

Germany

Hi,

I am trying to analyze RNA-Seq Data with DESeq2 and could use some help. I have 2 genotypes and 14 timepoints. I want to find differences in gene expression between both genotypes overall and at every timepoint and between every two timepoints in each genotype. After reading a lot of manuals, paper and posts here and elsewhere, I believe that the correct design for my experiment is ~genotype+time+time:genotype in the full and ~genotype+time in the reduced formula. Here is what I did so far:

library("DESeq2")

directory<-"............."
sampleFiles <- grep("??",list.files(directory),value=TRUE)

time <- factor(c("T12", "T13", "T14", "T1", "T11", "T5", "T7", "T2", "T9", "T3", "T6", "T8", "T4", "T10", "T12", "T13", "T14", "T1", "T11", "T5", "T7", "T2", "T9", "T3", "T6", "T8", "T4", "T10" ))
genotype <- factor(c("GF","GF","GF","GF","GF","GF","GF","GF","GF","GF","GF","GF","GF","GF","VB","VB","VB","VB","VB","VB","VB","VB","VB","VB","VB","VB","VB","VB"))
sampleTable<-data.frame(sampleName=sampleFiles, fileName=sampleFiles, genotype=genotype, time=time)
ddsHTSeq <- DESeqDataSetFromHTSeqCount(sampleTable=sampleTable, directory=directory, design=~genotype+time+time:genotype)

ddsHTSeq$genotype<-factor(ddsHTSeq$genotype, levels=c("GF","VB"))
ddsHTSeq$time <- factor(ddsHTSeq$time, levels=c("T1", "T2", "T3", "T4", "T5", "T6", "T7", "T8", "T9", "T10", "T11", "T12", "T13", "T14"))

dds <- DESeq(ddsHTSeq)
res <- results(dds)

ddsHTSeq <- estimateSizeFactors(ddsHTSeq)
ddsHTSeq <- estimateDispersions(ddsHTSeq)
ddsLRT <- nbinomLRT(dds, reduced = formula(~genotype+time)

resLRT<- results(ddsLRT)

Unfortunately I get:

> sumis.na(res$pvalue))
[1] 10295
> sum(res$pvalue<0.05, na.rm=TRUE)
[1] 0
> sum(res$padj<0.1, na.rm=TRUE)
[1] 0
So this means, there is no time influenced difference between genotypes. Is that correct? The count data tells me soemthing else though....

I also want to test for differences in gene expression between individual time of each genotype seperately. T2 vs T1, T3 vs T2, T4 vs T3 and so on. How can I access that?

I would be greatful and appreciate any help. Thank you!

Nadia

> sessionInfo()
R version 3.1.1 (2014-07-10)
Platform: x86_64-redhat-linux-gnu (64-bit)

locale:
[1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8       LC_NAME=C
[9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] parallel stats graphics grDevices utils datasets methods
[8] base

other attached packages:
[1] DESeq2_1.4.5              RcppArmadillo_0.4.450.1.0
[3] Rcpp_0.11.2               GenomicRanges_1.16.4
[5] GenomeInfoDb_1.0.2        IRanges_1.22.10
[7] BiocGenerics_0.10.0

loaded via a namespace (and not attached):
[1] annotate_1.42.1      AnnotationDbi_1.22.6 Biobase_2.20.1
[4] DBI_0.2-5            genefilter_1.46.1    geneplotter_1.42.0
[7] grid_3.1.1           lattice_0.20-29      locfit_1.5-9.1
[10] RColorBrewer_1.0-5   RSQLite_0.11.2       splines_3.1.1
[13] stats4_3.1.1         survival_2.37-7      XML_3.98-1.1
[16] xtable_1.7-1         XVector_0.4.0

deseq2 geneexpression timecourse • 9.7k views

ADD COMMENT • link updated 10.2 years ago by Wolfgang Huber ★ 13k • written 10.2 years ago by nafeesa ▴ 10

score 3 · Answer 1 · 2014-10-08

Nadia,

I'd like to contribute a somewhat more optimistic assessment than the previous two replies ("you don't have any replicates ... so you have no chance to ... perform proper statistical inference"; "hence you cannot test anything"), but for that it'll be necessary to think of time not as a factor but as a continuous covariate, and of the response not as a set of 14 different numbers but as a function ("element of a Hilbert space"). And it's not a standard out-of-the-box DESeq2 workflow, it'll require some more customization, and maybe you need to involve a local statistician or physicist.

Essentially, you can estimate a continuous, smooth function of time for each gene and each genotype from the 14 time points, and then compare these functions between genotypes, to see for which genes they are significantly different. If the functions are set up with a parametric model, or with local polynomials (a la loess, locfit etc) then you can keep the degrees of freedom well below 14, and use the rest to estimate dispersions or standard deviations, as a basis for the statistical test. I would probably approach this as Simon suggests by rlog-transforming, then calling locfit with confidence bands, and test, for each gene, for significant difference between the genotypes. Perhaps also include independent filtering beforehand to throw out those genes where nothing happens anyway.

A second approach is to go more exploratory, and cluster or classify the time courses into the major patterns, and then perhaps see which genes change cluster/class between genotypes.

score 2 · Answer 2 · 2014-10-08

Your formula does not have any degrees of freedom left, because you do not have replicates: each time course is done only once, hence you cannot test anything. I'm a bit surprised the DESeq2 did not reject the design with an error.

Are these repeated measurements on the same two sample? Then you don't have replication and cannot perform statistical tests. I'd recommend to use the rlog transform and simply look for genes with strong differences, for example by summing up the gene's rlog fold changes between genotypes.

Or have you prepared, for each genotype, 14 independent samples which you have incubated for different lengths of time? Then, you can legitimately ask which genes show a genotype-dependent expression profile over time by comparing the full model '~ time + genotype' with the reduced model '~time'.

score 2 · Answer 3 · 2014-10-12

Nadia

By clustering, I meant using a clustering algorithm like k-means or hierarchical clustering. A bit of googling for example brought up this thread: http://stats.stackexchange.com/questions/3271/clustering-genes-in-a-time-course-experiment Also, you could see what Simon did here: http://genomebiology.com/2010/11/9/R93 esp. Figs. 2, 3.

The main point is that after applying the rlog to the RNA-Seq data, essentially all the time course related methods that people developed for microarrays over the last 15 years are in principle applicable.

Regarding your other questions, I think they exceed what can be reasonably provided by this forum. There are probably two good, not mutually exclusive solutions: you learn a little bit of statistics and R to understand the difference between factors and continuous variables, and linear models, and study the documentation of locfit and how it computes confidence bands. Perhaps the gptk package on CRAN would also be a good place to look (and more recent). Which brings up the second suggestion, I think your data and question are complex enough that they could benefit from collaborating with a statistician or mathematician.

score 0 · Answer 4 · 2014-10-08

hi Nadia,

One note: we recommend users store phenotypic data in CSV files and then read them into R. One typo and your analysis can be thrown off if you keep the phenotypic data inside the R script. Having it in a CSV also helps you scan the information for a single sample easily.

If I am not mistaken, you don't have any replicates (it looks like for each genotype at each time point you only have one sample), so you have no chance to estimate the dispersion using the design formula with interactions specified above ~~or perform proper statistical inference~~. See the help page for the DESeq function under "Experiments without replicates". You should have gotten a message from the R console after running DESeq() which says "same number of samples and coefficients to fit, estimating dispersion by treating samples as replicates". So it's not surprising you don't have any genes with small p-values.

Note: in order to use the likelihood ratio test just do this:

dds = DESeq(ddsHTSeq, test="LRT", reduced=~genotype + time)
res = results(dds)

DESeq() is a wrapper for the following steps: estimating size factors, dispersions and then nbinomWaldTest or nbinomLRT. So it doesn't make sense to run DESeq and then rerun all the steps individually.

"I also want to test for differences in gene expression between individual time of each genotype seperately. T2 vs T1, T3 vs T2, T4 vs T3 and so on. How can I access that?"

I'm not sure how you can compare time points for each genotype, given that you have no biological replicates and there is no way to estimate the dispersion of the counts.

(edited)

score 0 · Answer 5 · 2014-10-09

Dear all, thank you for your replies.

I have 14 independent samples for each genotype and each was "incubated" for a different length of time but yes, I also have only one replicate for each genotype and each timepoint.

Wolfgang, thank you very much. Can you give me a little help on how to call locfit with confidence bands and on how to set time as a continuous covariate,? I am not very familiar with R yet. By clustering the time courses into the major patterns you mean for example to sum up the first 7 and the last 7 timepoints and compare both clusters, am I right?

Nadia