I have a dataset set up as follows, it is leaf sample collections from trees subjected to three different treatments, 4 reps per treatment, from various locations on the trees, starting with the top and moving down:
id node_id treatment node_loc main_lateral 1 15_A_main 1 A main main 2 15_A_middle 17 A middle lateral 3 15_A_middle 14 A middle lateral 4 15_A_upper 9 A upper lateral 5 15_A_lower 26 A lower lateral 6 15_A_lower 20 A lower lateral 7 16_C_main 1 C main main 8 16_C_middle 14 C middle lateral 9 16_C_upper 7 C upper lateral 10 16_C_lower 25 C lower lateral 11 16_C_lower 22 C lower lateral 12 16_C_middle 17 C middle lateral 13 20_A_main 1 A main main 14 20_A_upper 9 A upper lateral 15 20_A_lower 19 A lower lateral 16 20_A_lower 17 A lower lateral 17 20_A_lower 16 A lower lateral 18 20_A_middle 12 A middle lateral 19 23_A_main 1 A main main 20 23_A_middle 15 A middle lateral
It's only partially shown here, but there are three treatments (A, B, C), a node_id column (a continuous variable in which all samples begin at 1, the collection from the main or top of the tree, and continue down the tree counted by number of nodes with variable distances from rep to rep), a node_loc column (these are designations for a binning the node_id by their collection point on the tree), and a main_lateral column (which simply delineates the collections from the main or top of the tree from the collection from lateral shoots, or a node_id of 1 versus and node_id of something >1).
I set up my analysis like this:
metadata <- read.delim("2020_branch_treatment_metadata.txt", check.names=FALSE) x <- read.delim("2020_branch_treatment_counts.txt", check.names=FALSE, header=TRUE, row.names="Geneid") x.cpm <- cpm(x) threshold <- x.cpm > 0.5 keep <- rowSums(threshold) >= 2 x.keep <- x[keep,] y <- DGEList(counts=x.keep, samples = metadata) y <- calcNormFactors(y)
Based on what I have seen in this forum and based on the response, I received earlier, I decided to try to fit the continuous data (node_id) to a spline since it will almost certainly not be linear, which I did as follows:
X <- ns(y$samples$node_id, df = 3)
I think I understand how to find DE genes that differ with the node_id which is one thing we want to investigate, based on this post ([edgeR time series analysis with no replicates]), and that it should be set up like:
treatment <- factor(y$samples$treatment) design <- model.matrix(~0 + treatment + treatment:X)
There are two other things we would like to do with these data:
1) We would like to find DE genes with expression changes associated with their distance from the main/top of the tree and then group them into heatmaps/dendrograms based on their mean expression values, but delineated by the more easy-to-understand node_loc (treatment_main/middle/lower/upper). I'm not really sure how or if the data fitted to a spline would fit into that or how to set up the design. If not would the following make sense?
treatment <- factor(y$samples$treatment) node_loc <- factor(y$samples$node_loc) design <- model.matrix(~0 + treatment + node_loc)
2) We are also interested in genes that are uniquely differentially expressed in the main versus those in the laterals, especially those in treatment A and B versus C. Again, I'm not sure whether or not to incorporate the spline into the design or not. Would the following work to do this?
treatment <- factor(y$samples$treatment) main_lateral <- factor(y$samples$main_lateral) design <- model.matrix(~0 + treatment + main_lateral + treatment:main_lateral)
If anyone could let me know whether or not I'm on the right track or not, I'd greatly appreciate it. I'm just really confused on how to handle these different categories of data that are mostly redundant descriptions of the node_id, which is the only variable other than treatment that seems to have any actual measurable value.
Thanks very much! Your response cleared it up for me. I did have a couple additional questions though, if you don't mind. The first one is regarding the spline-fitted data in my original post. When I run that, I get the following coefficients:
When I create the comparisons to find DE, do I want to include the first 3 coefficients or leave them out. In other words, is it better or is there any real difference in doing:
versus just doing:
My second question involves the one-way portion of the analysis and is more a matter of my own curiosity. The link you provided explains how to run that portion of my analysis very nicely, but I wanted to know if there is a difference between treating two pieces of metadata as an interaction in the design any different from pasting those portions of metadata together and treating them as just another group in the design. In other words, is the following:
any different from doing:
Thanks very much for your time and your help!
As stated in time course case study in the edgeR user's guide, the three coefficients
X3do not have any particular meaning. So it doesn't make sense to compare each of them between your treatment groups.
For your second question, the one-way layout design matrix should be
It is equivalent to the interactive design but is much easier for making comparison contrasts.