Dear all, I would very much appreciate any assistance in the analysis of our RNA-SEQ data, i feel that i'm making a mistake somewhere but cant quite lay my finger on where this is happening.
Study design: 8 wasp allergic patients were stung by a wasp, peripheral blood samples were drawn before the sting (T0) and at 5 set time points afterwards (T5, T11, T15, T30 and T60 minutes). 3 patients developed a systemic reaction (WelReactie), 5 patients dit not develop a systemic reaction (GeenReactie).
Study question: Which genes are deferentially expressed at any time-point between the group "WelReactie" and the group "GeenReactie"
Code
library(DESeq2) directory = 'C:/Users/Default.COLTON/Documents/R/Longitudinal' setwd(directory) sampleFiles <- grep(".htseq", list.files(directory),value=T) sampleFiles sampleCondition <- c('GeenReactie','GeenReactie','GeenReactie','GeenReactie','GeenReactie','GeenReactie', 'GeenReactie','GeenReactie','GeenReactie','GeenReactie','GeenReactie','GeenReactie', 'GeenReactie','GeenReactie','GeenReactie','GeenReactie','GeenReactie','GeenReactie', 'GeenReactie','GeenReactie','GeenReactie','GeenReactie','GeenReactie','GeenReactie', 'GeenReactie','GeenReactie','GeenReactie','GeenReactie','GeenReactie','GeenReactie', 'WelReactie','WelReactie','WelReactie','WelReactie','WelReactie','WelReactie', 'WelReactie','WelReactie','WelReactie','WelReactie','WelReactie','WelReactie', 'WelReactie','WelReactie','WelReactie','WelReactie','WelReactie','WelReactie') time <- factor(c('T0', 'T11', 'T15', 'T30', 'T5', 'T60', 'T0', 'T11', 'T15', 'T30', 'T5', 'T60', 'T0', 'T11', 'T15', 'T30', 'T5', 'T60', 'T0', 'T11', 'T15', 'T30', 'T5', 'T60', 'T0', 'T11', 'T15', 'T30', 'T5', 'T60', 'T0', 'T11', 'T15', 'T30', 'T5', 'T60', 'T0', 'T11', 'T15', 'T30', 'T5', 'T60', 'T0', 'T11', 'T15', 'T30', 'T5', 'T60')) sampleTable<-data.frame(sampleName=sampleFiles, fileName=sampleFiles, condition=sampleCondition, time=time) ddsHTSeq <- DESeqDataSetFromHTSeqCount(sampleTable=sampleTable, directory=directory, design=~condition+time+time:condition) ddsHTSeq$condition<-factor(ddsHTSeq$condition, levels=c("GeenReactie","WelReactie")) ddsHTSeq$time <- factor(ddsHTSeq$time, levels=c("T0", "T5", "T11", "T15", "T30", "T60")) dds = DESeq(ddsHTSeq, test="LRT", reduced=~condition + time) resTC <- results(dds) resTC$symbol <- mcols(ddsTC)$symbol head(resTC[order(resTC$padj),],4) write.csv(res, file="DATE-Longitudinal.csv")
Results:
baseMean | log2FoldChange | lfcSE | stat | pvalue | padj | |
ENSG00000124882 | 11.87947 | 3.418625 | 0.899554 | 25.94876 | 9.13E-05 | 1 |
ENSG00000198918 | 34.99821 | 2.464948 | 0.583217 | 21.557 | 0.000635 | 1 |
ENSG00000170632 | 49.10306 | 1.171724 | 0.285648 | 20.24493 | 0.001124 | 1 |
ENSG00000085274 | 90.14553 | 0.950642 | 0.231124 | 19.91252 | 0.001298 | 1 |
Outcome: To my surprise, all adjusted P values were exactly 1, whereas just a standard crossectional analysis for each time point sepperatly does indicate DEG.
Bioconductor question: are there no DEG or, far more likely, what am i doing wrong?
Dear Micheal,
To clarify what I meant: deseq analysis of all 8 samples taken at baseline indicate 24 DEG (adjp<0.05), between the groups with and without a systemic reaction.
In addition, deseq analysis of all 8 samples taken at T60 minutes indicates 300 DEG (adjp<0.05), between the groups with and without a systemic reaction. To me, this is an indication of some time dependent differential expression of genes.
Attached is the plotCounts() of the top genes based on not adjusted p value, as all adjusted P values are 1.
https://postimg.org/image/d6chg26il/
It looks like there is DE at T60, but the other time points are not so clear, and the within-condition variability is very high, which leads to not-so-small p-values. So it could be the case that you have more power to detect the differences at T60 than when you do the LRT.
Given that there is a lot of within-condition variability here, I'm wondering if you would have improved power by reducing the number of terms you test in the LRT. Usually, I don't recommend to use time as a linear covariate, but it could potentially help here to recover a few genes which differ in the slope of log counts over time.
What if you try: