Question

edgeR vs voom

0

Entering edit mode

dongamit123 ▴ 20

@dongamit123-8091

Last seen 8.8 years ago

United States

Hi,

I have a paired study design similar to section 3.5 in edgeR manual - 25 subjects including with disease (D) and controls (C), expression montinored before drug (B) and after drug (A). I also wish to control for Batch (1/2) and other medication (0/1). The data looks something like this:

subject   Disease   treatment Batch other_medication
1   D   B   1   0
1   D   A   1   0
2   D   B   1   1
2   D   A   1   1
3   D   B   2   0
3   D   A   2   0
4   D   B   1   1
4   D   A   1   1
5   D   B   1   1
5   D   A   1   0
6   D   B   2   0
6   D   A   2   1
7   D   B   1   1
8   D   B   1   1
8   D   A   1   0
9   D   B   2   1
9   D   A   2   1
10   D   B   1   1
10   D   A   2   1
11   D   B   1   1
11   D   A   1   1
12   D   B   1   1
13   D   A   1   0
14   D   A   1   1
15   D   A   1   1
16   D   A   1   0
17   D   A   1   1
18   D   A   1   1
19   D   B   2   1
19   D   A   2   1
20   D   B   2   1
20   D   A   2   1
21   C   B   2   0
21   C   A   2   0
22   C   B   1   0
22   C   A   1   0
23   C   B   2   0
23   C   A   2   0
24   C   B   1   0
24   C   A   1   0
25   C   B   2   0
25   C   A   2   0

One of the camparisons I wish to do is to find DE genes between disease subjects and controls before taking drug. The dataset is unbalanced, but for just disease-control comparison @ baseline I guess it should not matter. I used both edgeR and limma-voom, but they gave very different results. I would really appreciate any help.

Voom:

Combined Disease and treatment into single variable combo:

combo <- factor(paste(Disease,Treatment,sep="."))

And used the following design:

design <- model.matrix(~0 + combo + Batch + other_medication)

And then block by subject using duplicate correlation:

v <- voom(rawcount,design, lib.size = colSums(rawcount)*y$samples$norm.factors)
corfit <- duplicateCorrelation(v, design, block=subject)
v <- voom(rawcount,design,lib.size = colSums(rawcount)*y$samples$norm.factors,
          block = subject,correlation=corfit$consensus)
fit <- lmFit(v,design,block = subject, correlation = corfit$consensus)
cm <- makeContrasts(DvsC.B = D.B - C.B, levels=design)
fit2 <- contrasts.fit(fit, cm)
fit2 <- eBayes(fit2)
top <- topTable(fit2, adjust="BH",n=1000000)

I get 4000 DE genes with FDR<0.05.

edgeR:

I used a subset of data with disease and control samples at before drug timepoint. And used the fowling design:

design <- model.matrix(~batch + other_medication + disease)
y <- DGEList(counts=rawcount, genes=row.names(rawcount))
y <- calcNormFactors(y)
y <- estimateGLMCommonDisp(y,design, verbose=TRUE)
y <- estimateGLMTrendedDisp(y,design)
y <- estimateGLMTagwiseDisp(y,design)
fit <- glmFit(y,design)
lrt <- glmLRT(fit,coef=4)

But this did not give me any DE genes.

Are these two approaches comparable? And which is more reliable? Thank you for the help!

AD

edger voom limma • 6.0k views

ADD COMMENT • link 8.8 years ago dongamit123 ▴ 20

score 1 · Answer 1 · 2015-07-22

No the two approaches are not comparable. The edgeR analysis is using a different subset of the data (apparently) and is not adjusting for baseline differences between subjects.

Actually this experiment is more complex than a "paired design". In the limma User's Guide, this is called a multilevel design. Given that you want to compare the two diseases, you can only analyse this experiment properly in limma. It is the only package that has the ability to use random effects to correlate repeated samples from the same subject. None of the negative binomial packages, including edgeR, can do that.

Do Batch and other_medication have large effects? If not, the analysis might work just as well without them in the model matrix. limma will compensate by increasing the consensus correlation.

BTW, you shouldn't need a complicated voom call like:

v <- voom(rawcount,design, lib.size = colSums(rawcount)*y$samples$norm.factors)

Why not simply:

v <- voom(y, design, plot=TRUE)

score 0 · Answer 2 · 2015-07-22

0

Entering edit mode

dongamit123 ▴ 20

@dongamit123-8091

Last seen 8.8 years ago

United States

Thank you for the comments.

The reason I included lib.size in voom call was to apply edgeR's TMM normalization to the counts.

ADD COMMENT • link 8.8 years ago dongamit123 ▴ 20

0

Entering edit mode

My point was that you don't need to intervene in this way. voom uses edgeR's normalization factors automatically, as the help page explains.

ADD REPLY • link 8.8 years ago Gordon Smyth 50k

score 0 · Answer 3 · 2015-07-23

Thanks a lot for your reply.

I have one more question regarding this study design. I wonder if there is a way in limma-voom to find genes whose expression depend on the disease irrespective of treatment. Can I do the following?

design <- model.matrix(~Disease + Treatment + Batch + other_medication)

>colnames(design)
[1] "(Intercept)"      "DiseaseD"    "TreatmentA" "Batch2"    
[5] "other_medication1"

v <- voom(y,design)
corfit <- duplicateCorrelation(v, design, block=subject)
v <- voom(y,design,block = subject,correlation=corfit$consensus)
fit <- lmFit(v,design,block = subject, correlation = corfit$consensus)
fit2 <- eBayes(fit)
top <- topTable(fit2, coef=2, adjust="BH", n=10000000)

Thank you again for all your help!

-------

Edit:

I don't think coef=2 will give genes that depend on the disease irrespective of treatment. In fact, if you could shed light on what would coef=2 and coef=3 will tell in the above design, that would be super helpful. Thank you!

score 0 · Answer 4 · 2015-07-23

0

Entering edit mode

dongamit123 ▴ 20

@dongamit123-8091

Last seen 8.8 years ago

United States

Thank you for the clarification!

ADD COMMENT • link 8.8 years ago dongamit123 ▴ 20

score 0 · Answer 5 · 2015-07-27

0

Entering edit mode

dongamit123 ▴ 20

@dongamit123-8091

Last seen 8.8 years ago

United States

Dear Gordon, Aaron,

In the analysis mentioned above I get many genes that are significantly DE between disease, but none between treatments. To reiterate the experimental design, I have a multi-level experiment, and I am analyzing the data using voom after combining disease and treatment into a single variable, the design matrix is of the form model.matrix(~0 + combo), and I block by subject using duplicate correlation.

Theoretially, both disease and treatment should have significant effect on gene expression. And if I subset the data into disease and control groups and perform DE analysis between treatments on the disease and control subjects separately, I get significant number of DE genes.

In the full dataset, the between disease comparison is inter-subject and between treatment comparison is intra-subject. So, the question is why don't I get any DE genes between treatments when I analyze all data together even if I account for intra-subject correlation using duplicate correlation function in voom?

I would really appreciate any comments/advice on this.

Thanks so much! -AD

ADD COMMENT • link 8.8 years ago dongamit123 ▴ 20

0

Entering edit mode

Show the exact code you've used to do all of the comparisons - we need to make sure you're doing something comparable across comparisons. For example, for the comparisons between treatments in the disease subset, make sure you use the same one-way layout as that of the full data set. You can guarantee this by subsetting combo appropriately, and then using it to construct the design matrix for the subset analysis.

P.S. Don't be afraid to use the "add comment" mechanism. This guarantees that your replies are linked to the post that you were originally replying to. Answers are not guaranteed to be linked to the preceding post that it was responding to, as the posts are continually resorted based on the number of votes.

ADD REPLY • link 8.8 years ago Aaron Lun ★ 28k

0

Entering edit mode

Here's the code: For full data

combo <- factor(paste(disease,treatment,sep="."))
y <- DGEList(counts=rawcount, genes=row.names(rawcount))
y <- calcNormFactors(y)
design <- model.matrix(~0 + combo)

colnames(design)
[1] "C.B" "C.A" "D.B" "D.A"

v <- voom(y,design)
corfit <- duplicateCorrelation(v, design, block=subject)
v <- voom(y,design,plot=TRUE,block = subject,correlation=corfit$consensus)
fit <- lmFit(v,design,block = subject, correlation = corfit$consensus)
cm <- makeContrasts(DvsC.B = D.B - C.B,
                    DvsC.A = D.A - C.A,
                    AvsB.D = D.A - D.B,
                    AvsB.C = C.A - C.B, levels=design)
fit2 <- contrasts.fit(fit, cm)
fit2 <- eBayes(fit2)

top <- topTable(fit2, adjust="BH",n=1000000, coef="AvsB.C") nrow(top[top$adj.P.Val < 0.05,]) => 0 genes

top <- topTable(fit2, adjust="BH",n=1000000, coef="DvsC.B") nrow(top[top$adj.P.Val < 0.05,]) => 2000 genes

For control subset (get similar results for disease subset):

By using intra-subject correlation:

design <- model.matrix(~subject + treatment)
y <- DGEList(counts=rawcount, genes=row.names(rawcount))
y <- calcNormFactors(y)
v <- voom(y,design)
corfit <- duplicateCorrelation(v, design, block=subject)
v <- voom(y,design,plot=TRUE,block = subject,correlation=corfit$consensus)
fit <- lmFit(v,design,block = subject, correlation = corfit$consensus)
fit2 <- eBayes(fit)
top <- topTable(fit2, adjust="BH",n=1000000,coef="treatmentA")
nrow(top[top$adj.P.Val < 0.05,])

This gives 820 genes.

Not including intra-subject correlation and using subject as linear predictor

design <- model.matrix(~subject + treatment)
y <- DGEList(counts=rawcount, genes=row.names(rawcount))
y <- calcNormFactors(y)
v <- voom(y,design,plot=TRUE)
fit <- lmFit(v,design)
fit2 <- eBayes(fit)
top <- topTable(fit2, adjust="BH",n=1000000,coef="treatmentA")
nrow(top[top$adj.P.Val < 0.05,])

This gives 1300 genes

ADD REPLY • link 8.8 years ago dongamit123 ▴ 20

0

Entering edit mode

In your first approach, the patient-specific effects are absorbed into the variance estimate. This is appropriate for your disease comparison as you are comparing between different patients, which means you should account for patient-to-patient variability. However, the same variance estimates get used for the comparison between treatments in the AvsB.C contrast. This makes less sense, as any patient-specific differences in the magnitude of expression should have no bearing on the comparison between treatments within each patient (where patient-specific effects should cancel out).

In other words, comparing between treatments in your first approach will result in loss of power, because you're using larger variance estimates than what you really need. This suggests a two-pronged strategy that depends on the comparison being performed; using the one-way layout with duplicateCorrelation for comparing between diseases, and then a paired-sample analysis blocking on subject to compare between treatments. In this manner, you get a comparable number of DE genes in both comparisons (~2000 and ~1300 for disease and treatment, respectively).

ADD REPLY • link 8.8 years ago Aaron Lun ★ 28k

0

Entering edit mode

Thank you so much, Aaron.

There is one additional complexity in my data. I have an unbalanced dataset with before or after timpoints missing for a few patients. So, to get the most out of the data for the between-treatment analysis, instead of subsetting the data into disease and control groups and doing paired analysis, can I use all samples with the following study design?

design<-model.matrix(~0+subject+treatment:disease)
design <- design[,-match(c("B:D", "B:C"), colnames(design))]

The above design however gives fewer DE genes compared from when I subset the data.

I am also interested in disease-specific differences in the change between before and after treatment. But, I do not get any genes with the above design and the contrast A.D - A.C. Is there any other way to test for that? Thanks again for all your help.

ADD REPLY • link 8.8 years ago dongamit123 ▴ 20

0

Entering edit mode

Patients missing "before" or "after" treatment samples will be effectively ignored in your model, as any information contained in those samples will be fully absorbed into the subject blocking factor. Nothing will be left for estimation of the variance or disease-specific treatment effects. This is inevitable when you try to do a paired-sample analysis without pairs. The only way to get around it is to use duplicateCorrelation, but that runs into other issues as I've discussed above. So, in your case, I'd recommend you grit your teeth and put up with some loss of information.

On a related note, your above design is more appropriate than subsetting by disease, because you're making full use of the data set. In particular, variances are estimated using all (paired) samples from the entire data set, which gives more residual d.f. and more precision/reliability. Some changes in the number of DE genes is to be expected due to the ensuing change in downstream statistics. A lower number of DE genes isn't necessarily bad if it gives more accurate results.

Finally, you've got the correct contrast for testing differential treatment effects between disease states. I assume you renamed the columns of the design matrix, because the equivalent columns would be treatmentA:diseaseD and treatmentA:diseaseC in my design. If you don't get any DE genes, then... well, that might just be the true state of affairs, and there's actually no differences in your data. Alternatively, you might not have enough power in your study, but given you're picking up ~1000 DE genes with the other contrasts, I don't think this is a problem.