I'm following the EdgeR documentation and I'm unclear on a few things regarding setting up a design matrix and contrast with more than one time point. I think I may have it set up correctly but I'm not confident.
My experiment consists of 2 time points; 16 hours and 32 hours, and each time point has it's own negative control. My design matrix looks like this.
Condition Time Group
HPI16_2A_1 control 16hr control.16hr
HPI16_2A_2 control 16hr control.16hr
HPI16_2A_3 control 16hr control.16hr
HPI16_miniT_1 miniT 16hr miniT.16hr
HPI16_miniT_2 miniT 16hr miniT.16hr
HPI16_miniT_3 miniT 16hr miniT.16hr
HPI32_2A_1 control 32hr control.32hr
HPI32_2A_2 control 32hr control.32hr
HPI32_2A_3 control 32hr control.32hr
HPI32_miniT_1 miniT 32hr miniT.32hr
HPI32_miniT_2 miniT 32hr miniT.32hr
HPI32_miniT_3 miniT 32hr miniT.32hr
I want to compare miniT.32hr to miniT.16hr while incorporating their respective controls. Is the appropriate contrast?
my.contrasts <- makeContrasts(
miniTvscontrol.32hr = (miniT.32hr - control.32hr) - (miniT.16hr - control.16hr),
levels=design)
Hopefully I have provided enough information!
Yes, my mistake, thanks for pointing that out. Thank you for the feedback as well.
I've been thinking about this comparison a bit more, and now I'm not quite sure I'm interpreting it correctly. I'm worried I'm not properly normalizing (this probably isn't the right term) or accounting for each time points control.
How is this contrast (miniT.32hr - control.32hr) - (miniT.16hr - control.16hr) working and how would it differ from (miniT.32hr - miniT.16hr) - (control.32hr - control.16hr)?
Is it possible that a gene is not a DEG at either time point individually (ie it's just background noise), but when doing this sort of contrast (32hr vs 16hr) it may be detected as a DEG? This sort of scenario I'm most worried about.
The contrast does exact what it says. I don't know what else to say.
It doesn't differ. The two contrasts are identical.
Well, no. I'm not really understanding your concern. Clearly if
T32 = miniT.32hr - control.32hr
andT16 = miniT.16hr - control.16hr
are both small then the whole contrast must be small as well. The contrast tests whether the treatment effect differs between the two times points. If the treatment effect truly differs between the times, then inevitably the true treatment effect must be nonzero for at least one of the times.On the other hand, it could be that
T32
is positive but just misses out of being statistically significant as a time-specific contrast andT16
is negative but just misses out on being significant by itself, and the complete contrastT32 - T16
becomes significant by contrasting the positive and negative effects. So the interaction can pick up some genes that are non-significant by individual testing at both time points. This is intended behavior: it's an advantage, not a disadvantage, because the time-specific testing will obviously have some false negatives.Thanks, once again. However, I have more questions!
There are a different number of genes/tags between the two time points (16 and 32 hours). These data were collected from cells infected with a virus so these are timepoints post infection. As a result, there are more genes/tags at 32 hours post infection than there are at 16. In order to process this in EdgeR I have merged the data frames; however, this obviously creates NA values. There are a minimal number of NA values though, maybe 3-4%. I'd rather not merge the results and toss out genes that are only present at one time point, because these are useful data points. How should I handle the NAs? Can I replace them with 1?
Your new questions are not follow-ups to the original questiion above but are about quite different issues. Rather than asking new questions as comments, please open a new question.
When you do so, you need to give proper context. What techology you are using? Why do you have missing values? Is this thread a continuation of your previous question about proteomics data? As it is, your new question is unexpected to say the least. You cannot possibly be using edgeR in the first place if you have NAs. I cannot possibly advise you how to impute NAs while knowing absolutely nothing about your data type. Replacing NAs with 1 with would be really strange and is not recommended by anyone as far as I know. In your previous question you said that you already knew how to impute NAs so I am quite bamboozled.