Hello,
I have performed an analysis of my data, the first portion at least, based on input from this forum, but I wanted to make sure I am doing this correctly before moving on. As I've mentioned in another post, I have a dataset set up as follows, it is leaf sample collections from trees subjected to three different treatments, 4 reps per treatment, from various locations on the trees, starting with the top and moving down:
id node_id treatment node_loc main_lateral
1 15_A_main 1 A main main
2 15_A_middle 17 A middle lateral
3 15_A_middle 14 A middle lateral
4 15_A_upper 9 A upper lateral
5 15_A_lower 26 A lower lateral
6 15_A_lower 20 A lower lateral
7 16_C_main 1 C main main
8 16_C_middle 14 C middle lateral
9 16_C_upper 7 C upper lateral
10 16_C_lower 25 C lower lateral
11 16_C_lower 22 C lower lateral
12 16_C_middle 17 C middle lateral
There are actually 3 different treatments - A,B,C. We expect that there will be a significant amount of difference between the three treatments in gene expression as sampling moves down the tree (the node_id column starts at 1 at the top and increases as sampling moves down the tree). However, we are interested in simply finding genes that vary with node_id regardless of the treatment at first.
So I have done the following for the design:
X <- ns(y$samples$node_id, df = 3)
design <- model.matrix(~X)
For the actual DE testing, I did:
y <- estimateDisp(y, design, robust=TRUE)
fit.spline <- glmQLFit(y, design, robust=TRUE)
fit.spline <- glmQLFTest(fit.spline, coef = 2:4)
Is this all I need to do to answer this particular question ("which genes are differentially expressed with node_id regardless of treatment")?
Thanks very much!
Erik
OK, thank you!!! Just to be clear, when I do that the output of
colnames(design)
becomesTherefore, to test for DE, I would simply use
Correct?
Thanks again for the help!!!
Erik
No. As Gordon already explained, testing the node coefficients simply tells you if the expression values change depending on where you collected the leaves. You already assume that to be true! The interesting thing is to know if the treatment affects expression, after controlling for the fact that you have collected the leaves at different places in the different plants. The treatmentB coefficient tests for differences between treatmentB and treatmentA, and treatmentC does the same (e.g., treatmentC vs treatmentA). And if you want to know differences between treatmentB and treatmentC, you can use a contrast comparing the two.
Right, that is what we are interested in for the initial analysis, the genes that change with collection location without regard to which treatment was received. I think the reason I have been asked to do this is so that heatmaps and dendrograms can be generated for this group of genes downstream, comparing their expression in the different treatments and presenting it visually. That's my guess, anyway, since you are definitely correct that it wouldn't be of the most experimental interest.
In that case, one more complication. There is the possibility that the changes across collection locations are dependent on the treatment. The model you are fitting ignores that possibility. If you do
You will add coefficients that estimate the difference in expression at collection site, depending on the treatment. Or put another way, genes that have different curves depending on the treatment (see the example in the limma User's Guide, section 9.6.2). Those genes do change with collection location, but not without regard to the treatment. The genes that change with collection location without regard to the treatment will be those with a significant F-test for coef = 4:6, that are not significant for coef = 7:12.
That should be
for the coefficients to be in the order that using coef = 4:6 and coef = 7:12 will work as I describe.
That makes sense, thanks so much for the clear and detailed explanations!!!! I will look again at the limma guide to try to better my understanding also.