Question

Nested design in edgeR

0

Entering edit mode

ashley.lu • 0

@ashleylu-15735

Last seen 5.6 years ago

Dear all,

I have a question on how to used the nested matrix to analyse my RNAseq data.

Here is the description of my data. I have one continuous factor measuring the level of protein A by immunostaining. The second factor is categorical representing 3 different regions (BS, HP, CX) of the brain.

The key questions we want to ask are:

Is there any genes differentially expressed according to the level of protein A measured by immunostaining .
If there regional differences for genes which are affected / in response by protein A (which is more interested to us).

Here is what the meta data looks like,

>head(meta)

`protein`	`regions`
`9.28453159356615`	`BS`
`10.9531507793882`	`BS`
`11.1858664750967`	`HP`
`8.85570185738541`	`CX`
`11.2412805317039`	`HP`
`10.3046406236311`	`BS`
`9.82840322124053`	`CX`
`10.4616254492949`	`HP`
`10.2353912875853`	`HP`
`9.94273889586413`	`CX`
`6.92469668752532`	`HP`
`3.13141359991371`	`BS`
`5.59111480089629`	`BS`
`5.75868326124029`	`CX`
`3.48120813508149`	`BS`
`3.25876039462462`	`BS`
`6.13335849940239`	`CX`
`6.78688204460976`	`CX`
`6.57238798262957`	`CX`
`7.53613331510434`	`HP`

I have been trying three designs:

> design1a=model.matrix(~0+protein+protein:regions,meta) > head(design1a) protein protein:regionsBS protein:regionsCX protein:regionsHP 1 8.820494 0.000000 8.820494 0.00000 2 10.576712 0.000000 10.576712 0.00000 3 9.722720 0.000000 0.000000 9.72272 4 10.042670 10.042670 0.000000 0.00000 5 8.459912 8.459912 0.000000 0.00000 6 9.694727 0.000000 9.694727 0.00000

> design1b=model.matrix(~protein+protein:regions,meta) > head(design1b) (Intercept) protein protein:regionsCX protein:regionsHP 1 1 8.820494 8.820494 0.00000 2 1 10.576712 10.576712 0.00000 3 1 9.722720 0.000000 9.72272 4 1 10.042670 0.000000 0.00000 5 1 8.459912 0.000000 0.00000 6 1 9.694727 9.694727 0.00000 > design2=model.matrix(~regions+protein:regions,meta) > head(design2) (Intercept) regionsCX regionsHP regionsBS:protein regionsCX:protein regionsHP:protein 1 1 1 0 0.000000 8.820494 0.00000 2 1 1 0 0.000000 10.576712 0.00000 3 1 0 1 0.000000 0.000000 9.72272 4 1 0 0 10.042670 0.000000 0.00000 5 1 0 0 8.459912 0.000000 0.00000 6 1 1 0 0.000000 9.694727 0.00000

For design 1a, it seems easy for me to understand, as i can estimate coefficient 2 for the main effect of protein A, and estimate glmQLFTest(fit_design1a, coef=3:5) for any regional differences in response to protein A. But this design matrix is not full rank.

So design 1b is probably more correct. But i am not sure how to interpret design 1b.

If I try to used glmQLFTest(fit_design1b, coef=3:4), does it give me those genes which are most variable across regions in response to protein A, or does it give me genes which are most different from the base level regionBS in response to protein A?

And design2 , which i can no longer measure the main effect of protein A, but if am only interested in the regional effects in response to protein A ( or regional differences of protein A), can I still use this design and ignore the main effects, only measure the nested coeffcients. glmQLFTest(fit_design2, coef=4:6)?

edger • 843 views

ADD COMMENT • link updated 6.0 years ago by Gordon Smyth 50k • written 6.0 years ago by ashley.lu • 0

score 5 · Accepted Answer · 2018-05-07

There are two possible formulations here, depending on what assumptions you want to make. The first:

designX <- model.matrix(~protein:regions, meta)
head(designX)
##   (Intercept) protein:regionsBS protein:regionsCX protein:regionsHP
## 1           1          9.284532          0.000000           0.00000
## 2           1         10.953151          0.000000           0.00000
## 3           1          0.000000          0.000000          11.18587
## 4           1          0.000000          8.855702           0.00000
## 5           1          0.000000          0.000000          11.24128
## 6           1         10.304641          0.000000           0.00000

... assumes that all regions have the same gene expression when you have no protein A. The intercept represents the log-expression in each/all regions when protein A has zero intensity (as measured by immunostaining). Each of the three remaining terms represents the region-specific log-increase in gene expression with increasing protein A staining.

The second model is effectively your design2, but I will make it a bit easier to parse:

designY <- model.matrix(~0 + regions + protein:regions, meta)
head(designY)
##   regionsBS regionsCX regionsHP regionsBS:protein regionsCX:protein
## 1         1         0         0          9.284532          0.000000
## 2         1         0         0         10.953151          0.000000
## 3         0         0         1          0.000000          0.000000
## 4         0         1         0          0.000000          8.855702
## 5         0         0         1          0.000000          0.000000
## 6         1         0         0         10.304641          0.000000
##   regionsHP:protein
## 1           0.00000
## 2           0.00000
## 3          11.18587
## 4           0.00000
## 5          11.24128
## 6           0.00000

This avoids the assumption that the baseline gene expression (i.e., in the absence of detected protein A) is equal across regions. The first three terms represent the baseline expression in each region, while the last three terms are the same as before. If you think in terms of a simple linear regression, you're effectively fitting a line to the gene expression against protein A staining intensity for each region, and getting region-specific intercepts (regions*) and gradients (regions*:protein). I would say that this is a more suitable model; assuming that gene expression is the same in each region seems troublesome to me.

Assuming you're using designY, your question 1 is easy to answer. Just set coef=4:6, which will do an ANODEV to test for whether there is any effect (or more strictly speaking, association) between protein A staining and each gene's expression. You can also test each of coefficients 4-6 separately to determine the association within each region.

Answering question 2 just involves testing the coefficients against each other, e.g.:

glmQLFTest(fit_designY, contrast=c(0,0,0,1,0,-1))

... which will test whether the association between protein A staining in region BS is significantly different from that in region HP. In short, you're testing the difference of gradients. Just make sure to report the gradients as well, which makes the interpretation easier. This is because a positive difference in the gradients could be due to the fact that the gradient was negative and turned positive; was very negative and turned less negative; or was positive and turned more positive; all of which may have different biological implications.