DESeq2 for time series using time as continuous and looking at effect of time on DE
1
0
Entering edit mode
ap874 • 0
@ap874-16253
Last seen 3.4 years ago

Hello,

My question is (I believe) more of a conceptual nature of what I can do with a time course RNA Seq dataset. I will still include the code below (and please let me know if I leave something out that should be in here and I’ll make sure to include it).

My dataset consists of 4 genotypes that were collected every 24h until they showed a given phenotype, which resulted in 10, 13, 15, or 17 data points with 3 replicates. (For those that work with plants, dormant buds collected from the field and placed into forcing conditions, collected daily for RNA Seq until budbreak).

Because what I’m mostly looking for are similarities between the genotypes, I’m treating the datasets separately.

I’m using time as a continuous variable, with up to time^4. By doing this I’m covering for transient expression of genes (alternatively I could use splines, but that wouldn’t make my life easier I think in doing what I want to do). In DESeq, I compare a full model to one with simply an intercept, and with that I end up getting ~20k DE genes – essentially everything that has a non-zero expression at some point. What I did then was extract the parameter estimates and intercepts for the model, and then filtered the genes for what had a max predicted counts lower than 10 and logFC of 2 within the time frame of each genotype. This filtering step was done manually on excel. In the end I had ~8k DE genes for each of the genotypes, which I then used for pathway enrichment analysis and qualitative comparisons between the genotypes (e.g. a gene related to growth that starts going up for genotype A earlier, and genotype D later, matching the time to phenotype idea).

I think my approach is appropriate, but I wanted some confirmation of this. I couldn’t find anything that treated RNA Seq data in this way. Most of the experiments only use time as a factor in an interaction term with a treatment, but not the effect of time itself.

sampleTable<-data.frame(sampleName=z$sampleFiles,fileName=z$fileName, time=z$Time, time2=z$Time2,time3=z$Time3,time4=z$Time4)


dds=0

library(DESeq2)
ddsHTSeq<-DESeqDataSetFromHTSeqCount(sampleTable=sampleTable, directory=directory, design = ~time+time2+time3+time4)

dds<-DESeq(ddsHTSeq, test="LRT", reduced=~1, fitType = "parametric")

res<-results(dds)
summary(res)

head(res[order(res$padj),]) a=dds@rowRanges@elementMetadata@listData$Intercept
b=dds@rowRanges@elementMetadata@listData$time c=dds@rowRanges@elementMetadata@listData$time2
d=dds@rowRanges@elementMetadata@listData$time3 e=dds@rowRanges@elementMetadata@listData$time4
f=dds@rowRanges@partitioning@NAMES
g=results(dds)$padj t=data.frame(cbind(f,a,b,c,d,e,g)) colnames(t)=c("GeneID","Intercept", "betatime","betatime2","betatime3","betatime4","padjumodel") summary(t) write.csv(t,"DESeqresults.csv") ADD COMMENT 1 Entering edit mode @mikelove Last seen 28 minutes ago United States First, you can use e.g. mcols(dds)$Intercept to pull out coefficients.

See colnames(mcols(dds)) and mcols(mcols(dds), use.names=TRUE).

Secondly, there is probably a way to better extract the genes you want for pathway enrichment using DESeq2, and with all the samples in the same dataset, which is recommended for better parameter estimation. Can you describe what kind of genes you want to pull out?

0
Entering edit mode

What I want to pull out are genes that are DE when compared to themselves at any of the other points in time. So rather than doing pairwise comparisons for every day in every combination (i.e., 0 vs 1, 0 vs 2, ..., 0 vs 17, 1 vs 2, ..., 16 vs 17), my idea was to use the fact that DESeq2 does the LRT test between a full and a null models, essentially telling me for which genes I have an effect of time (full model ~t+t^2+t^3+t^4, null ~1). [I understand that the t^4 (even t^3) may be something that is likely increasing the number of DE genes I get, but when looking at that number of time points, I think it is appropriate]

From these I want to pull out genes that I believe are likely giving me the most effect over the phenotypes observed, so that's when I look for log(max(counts),2) / log(min(counts), 2) >2 (at least a 4 fold change in expression), and that has more than 10 counts at the highest point. These filters I apply to the predicted values (as opposed to actual counts) for any given day, based on the fact that the models are good fits for the data.

The decision to separate the genotypes in different datasets is that there is one phenotype that is explained by time, and another that is explained by each genotype's response to temperature. Combining them would result in me probably getting significant interaction terms for each time coefficient. And regardless, the fact that I have time^2 and on makes it already hard to compare the actual coefficients between genotypes.

One potential benefit of using all genotypes in the same dataset would be that I could see the specific protein homologs that are differentially expressed, but since I'm using 3 species and aligning to the genome of one of them, it is possible that the homolog specific information is not true.

Thanks for the tips for the coefficients! Slims down the code :)

1
Entering edit mode

It seems like you've got a set of rules which is giving you the genes that you are interested in. I'd say as long as you do some plotting of the normalized counts over time for the genes of interest, and these look correct, then go for it.