I have performed an analysis of my data, the first portion at least, based on input from this forum, but I wanted to make sure I am doing this correctly before moving on. As I've mentioned in another post, I have a dataset set up as follows, it is leaf sample collections from trees subjected to three different treatments, 4 reps per treatment, from various locations on the trees, starting with the top and moving down:
id node_id treatment node_loc main_lateral 1 15_A_main 1 A main main 2 15_A_middle 17 A middle lateral 3 15_A_middle 14 A middle lateral 4 15_A_upper 9 A upper lateral 5 15_A_lower 26 A lower lateral 6 15_A_lower 20 A lower lateral 7 16_C_main 1 C main main 8 16_C_middle 14 C middle lateral 9 16_C_upper 7 C upper lateral 10 16_C_lower 25 C lower lateral 11 16_C_lower 22 C lower lateral 12 16_C_middle 17 C middle lateral
There are actually 3 different treatments - A,B,C. We expect that there will be a significant amount of difference between the three treatments in gene expression as sampling moves down the tree (the node_id column starts at 1 at the top and increases as sampling moves down the tree). However, we are interested in simply finding genes that vary with node_id regardless of the treatment at first.
So I have done the following for the design:
X <- ns(y$samples$node_id, df = 3) design <- model.matrix(~X)
For the actual DE testing, I did:
y <- estimateDisp(y, design, robust=TRUE) fit.spline <- glmQLFit(y, design, robust=TRUE) fit.spline <- glmQLFTest(fit.spline, coef = 2:4)
Is this all I need to do to answer this particular question ("which genes are differentially expressed with node_id regardless of treatment")?
Thanks very much!