Hi all,
I'm a university student experienced with R but new to Bioconductor. For my research class, I'm reproducing this study on methylphenidate's (MPH) impact on gene expression for ADHD patients.
I was able to fully reproduce the processed data and model results, but I'm confused about the model interpretation. Page 44 of limma's user guide describes the exact model used in this study, a paired sample with a block size of two. The code for this is on the bottom for reference.
In my case, the four RNA pools "consisted of samples from seven individuals each, with the same individuals present in both treatment groups". So my assumption is that we're testing the impact of MPH on gene expression, but we're also testing how it specifically affects individuals with ADHD. For this model, the design matrix can be found here.
Fitting the model and running decideTests() results in this table
My question is, does this design only explicitly test for MPH effects on the unaffected (non-ADHD) sample? If I wanted to know which genes are significantly differentially expressed for ADHD individuals, would I have to change the factor levels and run another model? I've made a folder containing all the code and files needed to reproduce everything, so if you'd like to take a look it can be found here.
Thank you for your time!
# Code from page 45 of the limma user guide
SibShip <- factor(targets$SibShip)
Treat <- factor(targets$Treatment, levels=c("C","T"))
design <- model.matrix(~SibShip+Treat)
fit <- lmFit(eset, design)
fit <- eBayes(fit)
topTable(fit, coef="TreatT")
Thanks so much for your answer! I've been trying to fit this model, but I'm getting a "No residual df: setting dispersion to NA" while trying to estimate the dispersion, along with a "Design matrix not of full rank" error when I fit the model.
I can post all the details, but I'm almost certain this is because the expression matrix only has 4 columns. So the design matrix has more columns then the expression matrix. While this study looks at 13 samples, they're pooled so there are only two measurements per group here, one before and one after the treatment, so four total columns. Would there be any way around this? Or am I limited to a maximum of 3 coefficients? Thanks again for your time!
I pointed you to a specific example. Did that not work? I don't see how the pooling has any bearing on the situation, but maybe I am missing something?
Anyway, without the details nobody can help diagnose, so you should probably post code.
Sure thing, all the code, and files are at the Google Drive link here: https://drive.google.com/open?id=1Cm7BuwSek5nRYMvOPKnSuCwzQ24EstlK
I've just added an "edgeR_code" file which is my attempt at trying section 3.5 of edgeR, but the relevant parts are posted below.
I think the pooling may be relevant because, if I'm understanding correctly, it means I have fewer columns in the expression matrix (4) even though there were 13 samples. Thanks so much for all the help, I really appreciate it.
If you only have four samples (pooled or not), then you are correct. Remember that what you are trying to do is simultaneously solve for a set of unknown values. This is just simple algebra. If you have one unknown, you need one equation to solve for that value, like
is easily solvable. But if you have
You can't solve that equation because there are an infinite number of solutions! When you fit a linear model you are doing that same exact thing. If you have two unknowns, you need two equations. If you have four equations, you cannot solve for more than four coefficients. So if you really only have four observations you can't analyze your data while blocking on subject, because that's six unknowns and four equations.