I have a CASE/CONTROL RNA-seq dataset with three time point measurements (PRE/MID/POST) for both groups. We're trying to test whether there's a difference in expression between the two groups in any of the the time points. From physiological point of view we hypothesize that there's no big difference between PRE and POST measurements between the groups but we should see a difference in the MID timepoint. In total we have little over 100 RNA-lib samples so theres quite a few biological replicates but no technical replicates. We have eliminated possible outliers based on PCA and heatmaps and are using RUVseq to eliminate overestimated variance in the data.
There's a good example of the model in DESeq2 vignette but we also want to account for the within sample variability (I think). In our coldata we have CASE/CONTROL status, PRE_MID_POST status and SUBJECT_ID (how we can see that to whom three RNA- library samples belong to).
I think the right model to test time course case/control setup should be, full: ~ CASE_CONTROL + PRE_MID_POST + CASE/CONTROL and reduced: ~ CASE_CONTROL + PRE_MID_POST, using the LRT-test in DESeq2. However, the p-values aren't calculated correctly and the results aren't quite what you would expect because it seem that DESeq2 doesn't calculate for the outliers or low count genes at all etc. I'm not quite sure if this is because we should also account for the SUBJECT_ID in the model so the within subject variability would be accounted for. I tried this by modelling and also by creating a separate model matrix according to the vignette (nested within sample variability) but the full=design() argument doesn't work:Error in (function (classes, fdef, mtable) :
unable to find an inherited method for function ‘design’ for signature ‘"matrix"’.
I'm quite confused by the modelling that we should use because when we subset only cases in the analysis then the results seem plausible and DESeq2 seems to calculate thing correctly But when we include also the control group in analysis for some reason there's big problems.
So my questions are:
1) Should we account for the SUBJECT_ID in the model?
2) Is the LRT-test the right approach ?
3) What sort of model would be the right solutions for our case?
Help would be really appreciated, especially from Mike Love if possible.