Question

edgeR | paired samples | multiple baseline | averaging baseline or not

0

Entering edit mode

gatk • 0

@gatk-9898

Last seen 7.9 years ago

We have a multi-factor RNA-Seq experiment with multiple baseline measures that we would like to model using edgeR:

2x pre-treatment samples per mouse [time=0]
1x post-treatment sample per mouse [time=1]
2x treatment groups (all mice are treated at post-treatment)

The goal is to evaluate the treatment effect between groups adjusted for baseline (see full R code below). Note, all mice are untreated at baseline. We identified the following two approaches to model this in edgeR:

Approach 1: use each of the two pre-treatment samples per mouse as part of the model

Model: mouse+time+time*treatment

2x pre-treatment samples per mouse [time=0]
1x post-treatment sample per mouse [time=1]
2x treatment groups (all mice are treated at post-treatment)

Approach 2: calculate the average counts of the two pre-treatment samples per mouse prior to modeling and use it as single baseline value

Model: mouse+time+time*treatment

1x one pre-treatment sample [average] per mouse [time=0]
1x post-treatment sample per mouse [time=1]
2x treatment groups (all mice are treated at post-treatment)

Which of the two approaches do you recommend for estimating the treatment effect for this experiment using edgeR?

Full R code :

library(edgeR);

y = matrix(rnbinom(12000,mu=10,size=10),ncol=12);

time = as.factor(rep(c(0,0,1),4));

treatment = as.factor(rep(c(1,2),2,each=3));

mouse = as.factor(rep(c(1:4),each=3));

d = DGEList(counts=y,group=treatment);

design = model.matrix(~mouse+time+time*treatment )[,-6];

d=estimateGLMCommonDisp(d,design)

d=estimateGLMTrendedDisp(d,design)

d=estimateGLMTagwiseDisp(d,design)

d.fit = glmFit(d,design);

d.lrt = glmLRT(d.fit,'time1:treatment2')

edger rnaseq baseline linear model negative binomial model • 1.8k views

ADD COMMENT • link updated 7.9 years ago by Gordon Smyth 50k • written 8.1 years ago by gatk • 0

score 1 · Answer 1 · 2016-03-11

Dear gatk,

I am not quite following what you are worried about.

You seem to have already completed a perfectly satisfactory analysis, as shown by your R code. Personally I would have setup the design matrix in, what seems to me to be, a more transparent fashion:

Treat1vsBaseline <- Treat2vsBaseline <- rep(0,12)
Treat1vsBaseline[time==1 & treatment==1] <- 1
Treat2vsBaseline[time==1 & treatment==2] <- 1
design <- model.matrix(~mouse+Treat1vsBaseline+Treat2vsBaseline)
d=estimateGLMCommonDisp(d,design)
d=estimateGLMTrendedDisp(d,design)
d=estimateGLMTagwiseDisp(d,design)
fit = glmFit(d,design)
lrt <- glmLRT(fit, contrast=c(0,0,0,0,-1,1))
topTags(lrt)

In my approach, we estimate the two treatment effects, each corrected for its baseline level, and then test for a difference between the two. However my code will give exactly the same DE results as yours.

What I don't see is why you are considering discarding this analysis and instead doing some complicated fudge after averaging the counts of two pre-treatment samples. Why would you want to average the samples? Averaging counts is obviously wrong, because edgeR analyses counts and the average of two counts is not a count. Averaging also throws out information unnecessarily. Do you think that edgeR can't figure out for itself that there are two samples or can't compute averages as appropriate as part of the analysis?

Are you worried about correlations between repeated samples from the same mouse? (You ask about that in offline emails to me, although you didn't mention this concern in your post here.) Including mouse as an additive term in the model fully accounts for any such correlation. And, anyway, correlations exist between all the samples on the same mouse, not just the time==0 samples, so averaging wouldn't help.