Dear all,
I have a question on how to design the model matrix to properly analyze my RNAseq data.
Here the summary of my data: I have two conditions (young and old). In each condition I have 3 biological replicates (sample) and each biological replicate is replicated twice (technical replicate), in total 12 libraries. My technical replicates are not different runs or different sequencing machines, they are two libraries generated from one single biological replicate. All 12 samples were sequenced together. So it is different from a batch effect.
Here is the structure of my target
> target
condition sample technical
1 young 1 a
2 young 1 b
3 young 2 a
4 young 2 b
5 young 3 a
6 young 3 b
7 old 4 c
8 old 4 d
9 old 5 c
10 old 5 d
11 old 6 c
12 old 6 d
My aim: I want to know the differentially expressed genes between young and old.
I think I need to design the model.matrix taking into account the technical replicates nested within sample nested within condition (if it was a standard glm I would write it like this: ~condition/sample/technical) but I am not sure how to do it in edgeR
I tried
design<-model.matrix(~condition + technical, data=target)
fit <-glmFit(d,design)
lrt<-glmLRT(fit,coef="condition")
but I do not think this is correct because my technical replicates should be nested within sample.
From the edgeR userguide, I got the following suggestion:
design<-model.matrix(~condition + condition:technical, data=target)
fit <-glmFit(d,design)
lrt<-glmLRT(fit,coef="condition")
but still I am missing the sample term which I think it is necessary to give as input for the full nested design.
The following two options will not work as they give errors at the dispersion estimate step:
design<-model.matrix(~condition + condition:sample:technical, data=target)
design<-model.matrix(~condition:technical, data=target)
Any idea is really really welcome!
Thanks for your help!
Silvia

