design model matrix for nested factors (edgeR)
1
0
Entering edit mode
@silviapaolucci81-7600
Last seen 9.8 years ago
Switzerland

Dear all, 

I have a question on how to design the model matrix to properly analyze my RNAseq data.

Here the summary of my data: I have two conditions (young and old). In each condition I have 3 biological replicates (sample) and each biological replicate is replicated twice (technical replicate), in total 12 libraries. My technical replicates are not different runs or different sequencing machines, they are two libraries generated from one single biological replicate. All 12 samples were sequenced together. So it is different from a batch effect. 

Here is the structure of my target

> target
   condition sample technical
1      young          1         a
2      young          1         b
3      young          2         a
4      young          2         b
5      young          3         a
6      young          3         b
7        old          4         c
8        old          4         d
9        old          5         c
10       old          5         d
11       old          6         c
12       old          6         d

My aim: I want to know the differentially expressed genes between young and old.

I think I need to design the model.matrix taking into account the technical replicates nested within sample nested within condition (if it was a standard glm I would write it like this: ~condition/sample/technical) but I am not sure how to do it in edgeR 

I tried

 

design<-model.matrix(~condition + technical, data=target) 

fit <-glmFit(d,design)  

lrt<-glmLRT(fit,coef="condition")

 

but I do not think this is correct because my technical replicates should be nested within sample.

 

From the edgeR userguide, I got the following suggestion:

 

design<-model.matrix(~condition + condition:technical, data=target) 

fit <-glmFit(d,design)  

lrt<-glmLRT(fit,coef="condition")

 

but still I am missing the sample term which I think it is necessary to give as input for the full nested design.

 

The following two options will not work as they give errors at the dispersion estimate step:

design<-model.matrix(~condition + condition:sample:technical, data=target) 

design<-model.matrix(~condition:technical, data=target) 

 

Any idea is really really welcome!

Thanks for your help!

Silvia

 

edger design matrix model matrix • 4.2k views
ADD COMMENT
2
Entering edit mode
Aaron Lun ★ 28k
@alun
Last seen 8 hours ago
The city by the bay

The simplest approach is to just check whether the variability between the technical replicates is Poisson. If so, you can just pool each pair of technical replicates into a single library for each biological replicate. To check for Poisson behaviour, you can set up a design matrix with:

design.temp <- model.matrix(~factor(target$sample))

and then check whether the NB dispersions are near-zero (assuming that d is your DGEList object):

d.temp <- estimateDisp(d, design.temp)
summary(d.temp$trended)

If so, then pooling is the most straightforward option, as you won't be losing any information about technical variability (as this is expected to be Poisson). Pooling can be done by running:

d2 <- d
d2$samples$group <- target$condition
d2 <- sumTechReps(d2, target$sample)

You can then fit a GLM to the pooled libraries with:

design <- model.matrix(~factor(d2$samples$group))
fit <- glmFit(d2, design)

In fact, even if it is not Poisson, pooling would probably be okay. Any extra-Poisson technical variability would be modelled by the empirical NB dispersion estimate in the analysis with design. However, it is probably unwise to include the technical replicates as separate entities in the design matrix. This is because they would be treated as biological replicates, and incorrectly deflate the dispersion estimates for the actual biological variability.

ADD COMMENT
0
Entering edit mode

Hi Aaron,

thank you very much for your suggestion!

I am trying to do what you suggested and it looks like it's working. I thought about pooling the technical replicates, but I was not sure at which point I was supposed to pool. 

And this is what I get when I check the NB dispersions

summary(d.temp$trended)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
0.05544 0.26690 0.45460 0.42990 0.61060 0.75630 

I still think that I am probably loosing some information by pooling the technical replicates, but it is more conservative.

Thanks again

Silvia

ADD REPLY
0
Entering edit mode

Okay, those dispersions are pretty high. Make sure your design.temp is constructed using target$sample; my original post was incorrectly using target$technical. The other possibility is that you're getting high technical dispersions for low-abundance genes, so filtering might help; but I doubt it, because even the lowest trended dispersion is still quite high.

Anyway, regarding your last comment; if the counts were truly Poisson-distributed between replicates, you wouldn't lose any information. This is because the expression for the likelihood (conditional on the mean) is effectively the same regardless of whether you pool the counts or keep them separate. As a result, you will obtain the same coefficients and dispersion estimates from either approach. Of course, pooling is more convenient as it allows us to plug it in to the existing framework.

ADD REPLY
0
Entering edit mode

Hi Aaron, thanks again for your comment. Yes, my design.temp is constructed using target$sample. Indeed the dispersions are quite high, this is why I don' t think pooling is the best option. However, at the moment it looks like it is the best I can do using edgeR. When I look at the heatmap of the significant differentially expressed genes that I get after pooling the technical replicates (heatmap generated using the Trinity command analyze_diff_expr.pl and the matrix.TMM_normalized.FPKM generated with Trinity), I can see that the technical replicates are rather different and not clustering as I would expect them to do...

By the way, do you have any suggestion (or a link) on how I could generate my heatmap for the differentially expressed genes, after pooling the replicates? I mean a heatmap where I would see only my 6 samples, 3 per condition.

Thank you very much!

Silvia

 

 
ADD REPLY
0
Entering edit mode

Well, the presence of these kinds of large technical overdispersion is a troubling issue. How are you getting your counts? If you're counting de novo transcripts assembled with Trinity, there might be extra variability introduced by assembly process, e.g., different assemblies formed between libraries.

ADD REPLY

Login before adding your answer.

Traffic: 873 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6