Nested design in edgeR
1
0
Entering edit mode
ashley.lu • 0
@ashleylu-15735
Last seen 5.6 years ago

Dear all,

I have a question on how to used the nested matrix to analyse my RNAseq data. 

Here is the description of my data. I have one continuous factor measuring the level of protein A by immunostaining.  The second factor is categorical representing 3 different regions (BS, HP, CX) of the brain. 

The key questions we want to ask are:

  1. Is there any genes differentially expressed according to the level of protein A measured by immunostaining .

  2. If there  regional differences for genes which are affected / in response by protein A (which is more interested to us).

Here is what the meta data looks like,

>head(meta)

protein regions
9.28453159356615 BS
10.9531507793882 BS
11.1858664750967 HP
8.85570185738541 CX
11.2412805317039 HP
10.3046406236311 BS
9.82840322124053 CX
10.4616254492949 HP
10.2353912875853 HP
9.94273889586413 CX
6.92469668752532 HP
3.13141359991371 BS
5.59111480089629 BS
5.75868326124029 CX
3.48120813508149 BS
3.25876039462462 BS
6.13335849940239 CX
6.78688204460976 CX
6.57238798262957 CX
7.53613331510434 HP

 

I have been trying three designs:

> design1a=model.matrix(~0+protein+protein:regions,meta)
> head(design1a)
    protein protein:regionsBS protein:regionsCX protein:regionsHP
1  8.820494          0.000000          8.820494           0.00000
2 10.576712          0.000000         10.576712           0.00000
3  9.722720          0.000000          0.000000           9.72272
4 10.042670         10.042670          0.000000           0.00000
5  8.459912          8.459912          0.000000           0.00000
6  9.694727          0.000000          9.694727           0.00000

> design1b=model.matrix(~protein+protein:regions,meta)
> head(design1b)
  (Intercept)   protein protein:regionsCX protein:regionsHP
1           1  8.820494          8.820494           0.00000
2           1 10.576712         10.576712           0.00000
3           1  9.722720          0.000000           9.72272
4           1 10.042670          0.000000           0.00000
5           1  8.459912          0.000000           0.00000
6           1  9.694727          9.694727           0.00000

> design2=model.matrix(~regions+protein:regions,meta)
> head(design2)
  (Intercept) regionsCX regionsHP regionsBS:protein regionsCX:protein regionsHP:protein
1           1         1         0          0.000000          8.820494           0.00000
2           1         1         0          0.000000         10.576712           0.00000
3           1         0         1          0.000000          0.000000           9.72272
4           1         0         0         10.042670          0.000000           0.00000
5           1         0         0          8.459912          0.000000           0.00000
6           1         1         0          0.000000          9.694727           0.00000

 

For design 1a, it seems easy for me to understand, as i can estimate coefficient 2 for the main effect of protein A, and estimate glmQLFTest(fit_design1a, coef=3:5) for any regional differences in response to protein A. But this design matrix is not full rank. 

So design 1b is probably more correct. But i am not sure how to interpret design 1b

 If I try to used glmQLFTest(fit_design1b, coef=3:4), does it give me those genes which are most variable across regions in response to protein A, or does it give me genes which are most different from the base level regionBS in response to protein A? 

And design2 , which i can no longer measure the main effect of protein A, but if am only interested in the regional effects in response to protein A ( or regional differences of protein A), can I still use this design and ignore the main effects, only measure the nested coeffcients. glmQLFTest(fit_design2, coef=4:6)?

 

edger • 843 views
ADD COMMENT
5
Entering edit mode
Aaron Lun ★ 28k
@alun
Last seen 1 hour ago
The city by the bay

There are two possible formulations here, depending on what assumptions you want to make. The first:

designX <- model.matrix(~protein:regions, meta)
head(designX)
##   (Intercept) protein:regionsBS protein:regionsCX protein:regionsHP
## 1           1          9.284532          0.000000           0.00000
## 2           1         10.953151          0.000000           0.00000
## 3           1          0.000000          0.000000          11.18587
## 4           1          0.000000          8.855702           0.00000
## 5           1          0.000000          0.000000          11.24128
## 6           1         10.304641          0.000000           0.00000

... assumes that all regions have the same gene expression when you have no protein A. The intercept represents the log-expression in each/all regions when protein A has zero intensity (as measured by immunostaining). Each of the three remaining terms represents the region-specific log-increase in gene expression with increasing protein A staining.

The second model is effectively your design2, but I will make it a bit easier to parse:

designY <- model.matrix(~0 + regions + protein:regions, meta)
head(designY)
##   regionsBS regionsCX regionsHP regionsBS:protein regionsCX:protein
## 1         1         0         0          9.284532          0.000000
## 2         1         0         0         10.953151          0.000000
## 3         0         0         1          0.000000          0.000000
## 4         0         1         0          0.000000          8.855702
## 5         0         0         1          0.000000          0.000000
## 6         1         0         0         10.304641          0.000000
##   regionsHP:protein
## 1           0.00000
## 2           0.00000
## 3          11.18587
## 4           0.00000
## 5          11.24128
## 6           0.00000

This avoids the assumption that the baseline gene expression (i.e., in the absence of detected protein A) is equal across regions. The first three terms represent the baseline expression in each region, while the last three terms are the same as before. If you think in terms of a simple linear regression, you're effectively fitting a line to the gene expression against protein A staining intensity for each region, and getting region-specific intercepts (regions*) and gradients (regions*:protein). I would say that this is a more suitable model; assuming that gene expression is the same in each region seems troublesome to me.

Assuming you're using designY, your question 1 is easy to answer. Just set coef=4:6, which will do an ANODEV to test for whether there is any effect (or more strictly speaking, association) between protein A staining and each gene's expression. You can also test each of coefficients 4-6 separately to determine the association within each region.

Answering question 2 just involves testing the coefficients against each other, e.g.:

glmQLFTest(fit_designY, contrast=c(0,0,0,1,0,-1))

... which will test whether the association between protein A staining in region BS is significantly different from that in region HP. In short, you're testing the difference of gradients. Just make sure to report the gradients as well, which makes the interpretation easier. This is because a positive difference in the gradients could be due to the fact that the gradient was negative and turned positive; was very negative and turned less negative; or was positive and turned more positive; all of which may have different biological implications.

ADD COMMENT
0
Entering edit mode

Dear Aaron, 

Thank you very much for the detail explanation. And designY is what i've been looking for. 

I still have a question related to this design.

1. "set coef=4:6, which will do an ANODEV to test for whether there is any effect (or more strictly speaking, association) between protein A staining and each gene's expression". 

I am a little confused here. So wouldn't a simple design also test this (association of proteinA staining with each gene expression ? For example, using model.matrix(~protein).

My initial interpretation of setting coef=4:6   was that this will test whether there is any differences between regions in how proteinA associates with each gene expression. Or maybe I misunderstood here?

 

ADD REPLY
0
Entering edit mode

Your simple design is probably insufficient, as it fails to account for region-to-region differences in the intercept or gradient of the relationship between gene expression and protein A staining. As a result, the variance will be inflated which will reduce your power to detect a significant relationship. At worst, there might be a non-zero gradient in each region (potentially of differing signs), but the net gradient is zero when all regions are combined together in the simple model. This would result in a complete failure to detect a non-zero effect in each tissue.

ADD REPLY
0
Entering edit mode

Dear Aaron,

Thank you very much for clarifying the differences. So I followed the instruction and performed the following step:

 

## 01.create DGE objects on filtered raw counts
subcounts=DGEList(subcounts_raw[,row.names(designY)])

## 02.calculate library size factor for all downstream analysis
subcounts=calcNormFactors(subcounts)

## 03.calculate commond, trend, tagwise dispersion for EdgeR model fitting

subcounts_disp=estimateDisp(subcounts,designY,robust=T)

fit_designY=glmQLFit(design = designY,y = subcounts_disp,robust = T)

res_QL=glmQLFTest(fit_designY,coef =4:6 )

tb=topTags(res_QL,n= Inf )$table

head(tb,n=3)
 
logFC.BS.protein
logFC.CX.protein
logFC.HP.protein
logCPM
F
PValue
FDR
Ctsd
0.5483445
0.4842517
0.5306680
8.952918
527.7258
1.966415e-290
9.134784e-286
Cst7
0.9908010
0.9383780
0.9913678
6.612004
491.6846
2.519786e-273
5.852708e-269
Tyrobp
0.6734945
0.6444256
0.6772371
7.481808
405.6746
4.994127e-231
7.733239e-227

 

So by setting coef =4:6 , i am testing if there is any genes significantly associated with protein A after accounts the variations between different regions. For example, can I interpret the above example Ctsd is significantly associated with proteinA , and with one unit changes in the immunostaining of protein A , there is a 0.5483 changes in Ctsd expression in the BS area?

Also does a significant p-values here indicate there is at least one region out of three are significantly associated with protein A , or does it indicate all three regions are significantly associated with protein A?

And lastly, if I want to know which genes are significantly different between regions in their association with protein A, is it correct to use contrast to do pairwise comparison between all levels (coefficients 4:6) and then pool all the p-values to do a BH p-adjust? 

 

 

 

ADD REPLY
0
Entering edit mode

Question 1: Yes. Though the ANODEV will only tell you that there is a significant association in any region; it doesn't tell you that the association in the BS region is significant. If you want that, you'll have to test coef=4 directly.

Question 2: A rejection means that at least one region has a non-zero gradient.

Question 3: You can still do an ANODEV with the comparisons between coefficients, e.g.,

contrast=cbind(c(0,0,0,1,0,-1), c(0,0,0,1,-1,0))
ADD REPLY

Login before adding your answer.

Traffic: 575 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6