I have a dataset set up as follows, it is leaf sample collections from trees subjected to three different treatments, 4 reps per treatment, from various locations on the trees, starting with the top and moving down:
id node_id treatment node_loc main_lateral 1 15_A_main 1 A main main 2 15_A_middle 17 A middle lateral 3 15_A_middle 14 A middle lateral 4 15_A_upper 9 A upper lateral 5 15_A_lower 26 A lower lateral 6 15_A_lower 20 A lower lateral 7 16_C_main 1 C main main 8 16_C_middle 14 C middle lateral 9 16_C_upper 7 C upper lateral 10 16_C_lower 25 C lower lateral 11 16_C_lower 22 C lower lateral 12 16_C_middle 17 C middle lateral 13 20_A_main 1 A main main 14 20_A_upper 9 A upper lateral 15 20_A_lower 19 A lower lateral 16 20_A_lower 17 A lower lateral 17 20_A_lower 16 A lower lateral 18 20_A_middle 12 A middle lateral 19 23_A_main 1 A main main 20 23_A_middle 15 A middle lateral
It's only partially shown here, but there are three treatments (A, B, C), a node_id column (a continuous variable in which all samples begin at 1, the collection from the main or top of the tree, and continue down the tree counted by number of nodes with variable distances from rep to rep), a node_loc column (these are designations for a binning the node_id by their collection point on the tree), and a main_lateral column (which simply delineates the collections from the main or top of the tree from the collection from lateral shoots, or a node_id of 1 versus and node_id of something >1).
I set up my analysis like this:
metadata <- read.delim("2020_branch_treatment_metadata.txt", check.names=FALSE) x <- read.delim("2020_branch_treatment_counts.txt", check.names=FALSE, header=TRUE, row.names="Geneid") x.cpm <- cpm(x) threshold <- x.cpm > 0.5 keep <- rowSums(threshold) >= 2 x.keep <- x[keep,] y <- DGEList(counts=x.keep, samples = metadata) y <- calcNormFactors(y)
Based on what I have seen in this forum and based on the response, I received earlier, I decided to try to fit the continuous data (node_id) to a spline since it will almost certainly not be linear, which I did as follows:
X <- ns(y$samples$node_id, df = 3)
I think I understand how to find DE genes that differ with the node_id which is one thing we want to investigate, based on this post ([edgeR time series analysis with no replicates]), and that it should be set up like:
treatment <- factor(y$samples$treatment) design <- model.matrix(~0 + treatment + treatment:X)
There are two other things we would like to do with these data:
1) We would like to find DE genes with expression changes associated with their distance from the main/top of the tree and then group them into heatmaps/dendrograms based on their mean expression values, but delineated by the more easy-to-understand node_loc (treatment_main/middle/lower/upper). I'm not really sure how or if the data fitted to a spline would fit into that or how to set up the design. If not would the following make sense?
treatment <- factor(y$samples$treatment) node_loc <- factor(y$samples$node_loc) design <- model.matrix(~0 + treatment + node_loc)
2) We are also interested in genes that are uniquely differentially expressed in the main versus those in the laterals, especially those in treatment A and B versus C. Again, I'm not sure whether or not to incorporate the spline into the design or not. Would the following work to do this?
treatment <- factor(y$samples$treatment) main_lateral <- factor(y$samples$main_lateral) design <- model.matrix(~0 + treatment + main_lateral + treatment:main_lateral)
If anyone could let me know whether or not I'm on the right track or not, I'd greatly appreciate it. I'm just really confused on how to handle these different categories of data that are mostly redundant descriptions of the node_id, which is the only variable other than treatment that seems to have any actual measurable value.