Confusion using edgeR and spline-fitted data
1
0
Entering edit mode
@3746f5fd
Last seen 10 weeks ago
United States

Hello,

I have a dataset set up as follows, it is leaf sample collections from trees subjected to three different treatments, 4 reps per treatment, from various locations on the trees, starting with the top and moving down:

            id node_id treatment node_loc main_lateral
1    15_A_main       1         A     main         main
2  15_A_middle      17         A   middle      lateral
3  15_A_middle      14         A   middle      lateral
4   15_A_upper       9         A    upper      lateral
5   15_A_lower      26         A    lower      lateral
6   15_A_lower      20         A    lower      lateral
7    16_C_main       1         C     main         main
8  16_C_middle      14         C   middle      lateral
9   16_C_upper       7         C    upper      lateral
10  16_C_lower      25         C    lower      lateral
11  16_C_lower      22         C    lower      lateral
12 16_C_middle      17         C   middle      lateral
13   20_A_main       1         A     main         main
14  20_A_upper       9         A    upper      lateral
15  20_A_lower      19         A    lower      lateral
16  20_A_lower      17         A    lower      lateral
17  20_A_lower      16         A    lower      lateral
18 20_A_middle      12         A   middle      lateral
19   23_A_main       1         A     main         main
20 23_A_middle      15         A   middle      lateral


It's only partially shown here, but there are three treatments (A, B, C), a node_id column (a continuous variable in which all samples begin at 1, the collection from the main or top of the tree, and continue down the tree counted by number of nodes with variable distances from rep to rep), a node_loc column (these are designations for a binning the node_id by their collection point on the tree), and a main_lateral column (which simply delineates the collections from the main or top of the tree from the collection from lateral shoots, or a node_id of 1 versus and node_id of something >1).

I set up my analysis like this:

metadata <- read.delim("2020_branch_treatment_metadata.txt", check.names=FALSE)

x.cpm <- cpm(x)
threshold <- x.cpm > 0.5
keep <- rowSums(threshold) >= 2
x.keep <- x[keep,]

y <- DGEList(counts=x.keep, samples = metadata)
y <- calcNormFactors(y)


Based on what I have seen in this forum and based on the response, I received earlier, I decided to try to fit the continuous data (node_id) to a spline since it will almost certainly not be linear, which I did as follows:

X <- ns(y$samples$node_id, df = 3)


I think I understand how to find DE genes that differ with the node_id which is one thing we want to investigate, based on this post ([edgeR time series analysis with no replicates][1]), and that it should be set up like:

treatment <- factor(y$samples$treatment)
design <- model.matrix(~0 + treatment + treatment:X)


There are two other things we would like to do with these data:

1) We would like to find DE genes with expression changes associated with their distance from the main/top of the tree and then group them into heatmaps/dendrograms based on their mean expression values, but delineated by the more easy-to-understand node_loc (treatment_main/middle/lower/upper). I'm not really sure how or if the data fitted to a spline would fit into that or how to set up the design. If not would the following make sense?

treatment <- factor(y$samples$treatment)
node_loc <- factor(y$samples$node_loc)
design <- model.matrix(~0 + treatment + node_loc)


2) We are also interested in genes that are uniquely differentially expressed in the main versus those in the laterals, especially those in treatment A and B versus C. Again, I'm not sure whether or not to incorporate the spline into the design or not. Would the following work to do this?

treatment <- factor(y$samples$treatment)
main_lateral <- factor(y$samples$main_lateral)
design <- model.matrix(~0 + treatment + main_lateral + treatment:main_lateral)


If anyone could let me know whether or not I'm on the right track or not, I'd greatly appreciate it. I'm just really confused on how to handle these different categories of data that are mostly redundant descriptions of the node_id, which is the only variable other than treatment that seems to have any actual measurable value.

Thanks,

Erik

edgeR • 214 views
1
Entering edit mode
Yunshun Chen ▴ 770
@yunshun-chen-5451
Last seen 6 weeks ago
Australia

1) We would like to find DE genes with expression changes associated with their distance from the main/top of the tree and then group them into heatmaps/dendrograms based on their mean expression values, but delineated by the more easy-to-understand node_loc (treatment_main/middle/lower/upper). I'm not really sure how or if the data fitted to a spline would fit into that or how to set up the design. If not would the following make sense?

Yes, if you want to interpret them at the node_loc level, then you can use your design matrix design <- model.matrix(~0 + treatment + node_loc).

2) We are also interested in genes that are uniquely differentially expressed in the main versus those in the laterals, especially those in treatment A and B versus C. Again, I'm not sure whether or not to incorporate the spline into the design or not. Would the following work to do this?

You don't need spline for this. Your design matrix design <- model.matrix(~0 + treatment + main_lateral + treatment:main_lateral) is fine. But it would be better to use a one-way layout design matrix for this (so that the testing contrasts are more intuitive and easier to construct). (See https://f1000research.com/articles/5-1438 for an example)

0
Entering edit mode

Thanks very much! Your response cleared it up for me. I did have a couple additional questions though, if you don't mind. The first one is regarding the spline-fitted data in my original post. When I run that, I get the following coefficients:

colnames(design)
[1] "treatmentA"    "treatmentB"    "treatmentC"    "treatmentA:X1" "treatmentB:X1" "treatmentC:X1" "treatmentA:X2" "treatmentB:X2" "treatmentC:X2" "treatmentA:X3" "treatmentB:X3" "treatmentC:X3"


When I create the comparisons to find DE, do I want to include the first 3 coefficients or leave them out. In other words, is it better or is there any real difference in doing:

con <- makeContrasts(AB = treatmentA-treatmentB, AC = treatmentA-treatmentC, BC= treatmentB-treatmentC, ABX1 = treatmentA.X1-treatmentB.X1, ACX1 = treatmentA.X1-treatmentC.X1, BCX1= treatmentBX1-treatmentCX1,  ABX2 = treatmentA.X2-treatmentB.X2, ACX2 = treatmentA.X2-treatmentC.X2, BCX2= treatmentBX2-treatmentCX2, ABX3 = treatmentA.X3-treatmentB.X3, ACX3 = treatmentA.X3-treatmentC.X3, BCX3= treatmentBX3-treatmentCX3, levels = design)


versus just doing:

con <- makeContrasts( ABX1 = treatmentA.X1-treatmentB.X1, ACX1 = treatmentA.X1-treatmentC.X1, BCX1= treatmentBX1-treatmentCX1,  ABX2 = treatmentA.X2-treatmentB.X2, ACX2 = treatmentA.X2-treatmentC.X2, BCX2= treatmentBX2-treatmentCX2, ABX3 = treatmentA.X3-treatmentB.X3, ACX3 = treatmentA.X3-treatmentC.X3, BCX3= treatmentBX3-treatmentCX3, levels = design)


My second question involves the one-way portion of the analysis and is more a matter of my own curiosity. The link you provided explains how to run that portion of my analysis very nicely, but I wanted to know if there is a difference between treating two pieces of metadata as an interaction in the design any different from pasting those portions of metadata together and treating them as just another group in the design. In other words, is the following:

design <- model.matrix(~0 + treatment + main_lateral + treatment:main_lateral)


any different from doing:

new_group <- factor(paste0(treatment, main_lateral))
design <- model.matrix(~0 + treatment + main_lateral + new_group)


?

Erik

1
Entering edit mode

As stated in time course case study in the edgeR user's guide, the three coefficients X1, X2 and X3 do not have any particular meaning. So it doesn't make sense to compare each of them between your treatment groups.

For your second question, the one-way layout design matrix should be

design <- model.matrix(~0 + new_group)


It is equivalent to the interactive design but is much easier for making comparison contrasts.