Question: design model matrix for nested factors (edgeR)
gravatar for silvia.paolucci81
4.7 years ago by
silvia.paolucci810 wrote:

Dear all, 

I have a question on how to design the model matrix to properly analyze my RNAseq data.

Here the summary of my data: I have two conditions (young and old). In each condition I have 3 biological replicates (sample) and each biological replicate is replicated twice (technical replicate), in total 12 libraries. My technical replicates are not different runs or different sequencing machines, they are two libraries generated from one single biological replicate. All 12 samples were sequenced together. So it is different from a batch effect. 

Here is the structure of my target

> target
   condition sample technical
1      young          1         a
2      young          1         b
3      young          2         a
4      young          2         b
5      young          3         a
6      young          3         b
7        old          4         c
8        old          4         d
9        old          5         c
10       old          5         d
11       old          6         c
12       old          6         d

My aim: I want to know the differentially expressed genes between young and old.

I think I need to design the model.matrix taking into account the technical replicates nested within sample nested within condition (if it was a standard glm I would write it like this: ~condition/sample/technical) but I am not sure how to do it in edgeR 

I tried


design<-model.matrix(~condition + technical, data=target) 

fit <-glmFit(d,design)  



but I do not think this is correct because my technical replicates should be nested within sample.


From the edgeR userguide, I got the following suggestion:


design<-model.matrix(~condition + condition:technical, data=target) 

fit <-glmFit(d,design)  



but still I am missing the sample term which I think it is necessary to give as input for the full nested design.


The following two options will not work as they give errors at the dispersion estimate step:

design<-model.matrix(~condition + condition:sample:technical, data=target) 

design<-model.matrix(~condition:technical, data=target) 


Any idea is really really welcome!

Thanks for your help!



ADD COMMENTlink modified 4.7 years ago by Aaron Lun25k • written 4.7 years ago by silvia.paolucci810
Answer: design model matrix for nested factors (edgeR)
gravatar for Aaron Lun
4.7 years ago by
Aaron Lun25k
Cambridge, United Kingdom
Aaron Lun25k wrote:

The simplest approach is to just check whether the variability between the technical replicates is Poisson. If so, you can just pool each pair of technical replicates into a single library for each biological replicate. To check for Poisson behaviour, you can set up a design matrix with:

design.temp <- model.matrix(~factor(target$sample))

and then check whether the NB dispersions are near-zero (assuming that d is your DGEList object):

d.temp <- estimateDisp(d, design.temp)

If so, then pooling is the most straightforward option, as you won't be losing any information about technical variability (as this is expected to be Poisson). Pooling can be done by running:

d2 <- d
d2$samples$group <- target$condition
d2 <- sumTechReps(d2, target$sample)

You can then fit a GLM to the pooled libraries with:

design <- model.matrix(~factor(d2$samples$group))
fit <- glmFit(d2, design)

In fact, even if it is not Poisson, pooling would probably be okay. Any extra-Poisson technical variability would be modelled by the empirical NB dispersion estimate in the analysis with design. However, it is probably unwise to include the technical replicates as separate entities in the design matrix. This is because they would be treated as biological replicates, and incorrectly deflate the dispersion estimates for the actual biological variability.

ADD COMMENTlink modified 4.7 years ago • written 4.7 years ago by Aaron Lun25k

Hi Aaron,

thank you very much for your suggestion!

I am trying to do what you suggested and it looks like it's working. I thought about pooling the technical replicates, but I was not sure at which point I was supposed to pool. 

And this is what I get when I check the NB dispersions

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
0.05544 0.26690 0.45460 0.42990 0.61060 0.75630 

I still think that I am probably loosing some information by pooling the technical replicates, but it is more conservative.

Thanks again


ADD REPLYlink modified 4.7 years ago • written 4.7 years ago by silvia.paolucci810

Okay, those dispersions are pretty high. Make sure your design.temp is constructed using target$sample; my original post was incorrectly using target$technical. The other possibility is that you're getting high technical dispersions for low-abundance genes, so filtering might help; but I doubt it, because even the lowest trended dispersion is still quite high.

Anyway, regarding your last comment; if the counts were truly Poisson-distributed between replicates, you wouldn't lose any information. This is because the expression for the likelihood (conditional on the mean) is effectively the same regardless of whether you pool the counts or keep them separate. As a result, you will obtain the same coefficients and dispersion estimates from either approach. Of course, pooling is more convenient as it allows us to plug it in to the existing framework.

ADD REPLYlink written 4.7 years ago by Aaron Lun25k

Hi Aaron, thanks again for your comment. Yes, my design.temp is constructed using target$sample. Indeed the dispersions are quite high, this is why I don' t think pooling is the best option. However, at the moment it looks like it is the best I can do using edgeR. When I look at the heatmap of the significant differentially expressed genes that I get after pooling the technical replicates (heatmap generated using the Trinity command and the matrix.TMM_normalized.FPKM generated with Trinity), I can see that the technical replicates are rather different and not clustering as I would expect them to do...

By the way, do you have any suggestion (or a link) on how I could generate my heatmap for the differentially expressed genes, after pooling the replicates? I mean a heatmap where I would see only my 6 samples, 3 per condition.

Thank you very much!



ADD REPLYlink written 4.7 years ago by silvia.paolucci810

Well, the presence of these kinds of large technical overdispersion is a troubling issue. How are you getting your counts? If you're counting de novo transcripts assembled with Trinity, there might be extra variability introduced by assembly process, e.g., different assemblies formed between libraries.

ADD REPLYlink written 4.7 years ago by Aaron Lun25k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 375 users visited in the last hour