Spline-fitted data analysis in edgeR question
1
0
Entering edit mode
@3746f5fd
Last seen 9 weeks ago
United States

Hello,

I have performed an analysis of my data, the first portion at least, based on input from this forum, but I wanted to make sure I am doing this correctly before moving on. As I've mentioned in another post, I have a dataset set up as follows, it is leaf sample collections from trees subjected to three different treatments, 4 reps per treatment, from various locations on the trees, starting with the top and moving down:

            id node_id treatment node_loc main_lateral
1    15_A_main       1         A     main         main
2  15_A_middle      17         A   middle      lateral
3  15_A_middle      14         A   middle      lateral
4   15_A_upper       9         A    upper      lateral
5   15_A_lower      26         A    lower      lateral
6   15_A_lower      20         A    lower      lateral
7    16_C_main       1         C     main         main
8  16_C_middle      14         C   middle      lateral
9   16_C_upper       7         C    upper      lateral
10  16_C_lower      25         C    lower      lateral
11  16_C_lower      22         C    lower      lateral
12 16_C_middle      17         C   middle      lateral


There are actually 3 different treatments - A,B,C. We expect that there will be a significant amount of difference between the three treatments in gene expression as sampling moves down the tree (the node_id column starts at 1 at the top and increases as sampling moves down the tree). However, we are interested in simply finding genes that vary with node_id regardless of the treatment at first.

So I have done the following for the design:

X <- ns(y$samples$node_id, df = 3)
design <- model.matrix(~X)


For the actual DE testing, I did:

y <- estimateDisp(y, design, robust=TRUE)
fit.spline <- glmQLFit(y, design, robust=TRUE)
fit.spline <- glmQLFTest(fit.spline, coef = 2:4)


Is this all I need to do to answer this particular question ("which genes are differentially expressed with node_id regardless of treatment")?

Thanks very much!

Erik

edgeR • 545 views
1
Entering edit mode
@gordon-smyth
Last seen 2 hours ago
WEHI, Melbourne, Australia

You have correctly fitted a smooth curve through the expression values as a function of node_id, and you have tested whether expression varies with node_id.

Your code assumes there is no treatment effect, which seems unrealistic since testing the treatment effect is the whole point of your experiment. Defining

design <- model.matrix(~treatment+X)


would instead find genes whose expression varies with node_id in the same way for both treatments, without assuming that the baseline expression is the same for both treatments.

0
Entering edit mode

OK, thank you!!! Just to be clear, when I do that the output of colnames(design) becomes

"(Intercept)"        "treatmentB"   "treatmentC"   "X1"                 "X2"                 "X3"


Therefore, to test for DE, I would simply use

y <- estimateDisp(y, design, robust=TRUE)

fit.spline <- glmQLFit(y, design, robust=TRUE)

fit.spline <- glmQLFTest(fit.spline, coef = 4:6)


Correct?

Thanks again for the help!!!

Erik

1
Entering edit mode

No. As Gordon already explained, testing the node coefficients simply tells you if the expression values change depending on where you collected the leaves. You already assume that to be true! The interesting thing is to know if the treatment affects expression, after controlling for the fact that you have collected the leaves at different places in the different plants. The treatmentB coefficient tests for differences between treatmentB and treatmentA, and treatmentC does the same (e.g., treatmentC vs treatmentA). And if you want to know differences between treatmentB and treatmentC, you can use a contrast comparing the two.

0
Entering edit mode

As Gordon already explained, testing the node coefficients simply tells you if the expression values change depending on where you collected the leaves.

Right, that is what we are interested in for the initial analysis, the genes that change with collection location without regard to which treatment was received. I think the reason I have been asked to do this is so that heatmaps and dendrograms can be generated for this group of genes downstream, comparing their expression in the different treatments and presenting it visually. That's my guess, anyway, since you are definitely correct that it wouldn't be of the most experimental interest.

1
Entering edit mode

In that case, one more complication. There is the possibility that the changes across collection locations are dependent on the treatment. The model you are fitting ignores that possibility. If you do

design <- model.matrix(~X*treatment)


You will add coefficients that estimate the difference in expression at collection site, depending on the treatment. Or put another way, genes that have different curves depending on the treatment (see the example in the limma User's Guide, section 9.6.2). Those genes do change with collection location, but not without regard to the treatment. The genes that change with collection location without regard to the treatment will be those with a significant F-test for coef = 4:6, that are not significant for coef = 7:12.

1
Entering edit mode

That should be

design <- model.matix(~treatment*X)


for the coefficients to be in the order that using coef = 4:6 and coef = 7:12 will work as I describe.

0
Entering edit mode

That makes sense, thanks so much for the clear and detailed explanations!!!! I will look again at the limma guide to try to better my understanding also.