Question

Limma-Voom which of two models is correct?

0

Entering edit mode

Javier Gutierrez-Achury • 0

@javier-gutierrez-achury-11591

Last seen 8.6 years ago

I'm analysing RNA-seq data with Limma-Voom and after reading and apply different methods of analysis I have arrived at two different results and I'm not sure which one is correct. Here the specs: n=60 (~30 per tissue) The experiment (Multi-level time course experiment):

dataForAnalysis = data.frame(Sample=c("s_1", "s_1", "s_2", "s_2", "s_1", "s_1", "s_2", "s_2"), Tissue=c("A", "A","A", "A", "B", "B", "B", "B"), Condition=c("before", "after","before", "after","before", "after","before", "after"))

Sample	Tissue	Condition
s_1	A	before
s_1	A	after
s_2	A	before
s_2	A	after
s_1	B	before
s_1	B	after
s_2	B	before
s_2	B	after

Analysis No. 1

designMatrix = model.matrix(~ Tissue + Condition, dataForAnalysis)

nf = calcNormFactors(counts)

forDgeData.voom <- voom(counts, designMatrix, plot=False,  lib.size=colSums(counts)*nf)

forDgeData.dups = duplicateCorrelation(forDgeData.voom, designMatrix, block= dataForAnalysis$Sample)

forDgeData.fitRobust = lmFit(forDgeData.voom,  forDgeData.voom$design, correlation= forDgeData.dups$consensus, block= dataForAnalysis$Sample, method='robust')

forDgeData.ebayes = eBayes(forDgeData.fitRobust, robust=True)

Using this method, these are my results:

summary(decideTests(forDgeData.ebayes))

(Intercept) TissueA ConditionAfter

-1 1615 6594 22

0 446 1573 15631

1 13631 7525 39

From these results, I'm taking the ones in the "ConditionAfter" to get the DE genes between the conditions using:

toptable(forDgeData.ebayes, coef="ConditionAfter", ...)

Analysis No. 2 (For this analysis I pasted tissue and condition for the design matrix and use contrasts)

predesignMatrix = factor(paste(dataForAnalysis$Tissue, dataForAnalysis$Condition, sep='.'))

designMatrix = model.matrix(~ 0 + predesignMatrix)

nf = calcNormFactors(counts)

forDgeData.voom <- voom(counts, designMatrix, plot=False,  lib.size=colSums(counts)*nf)

forDgeData.dups = duplicateCorrelation(forDgeData.voom, designMatrix, block= dataForAnalysis$Sample)

forDgeData.fitRobust = lmFit(forDgeData.voom,  forDgeData.voom$design, correlation= forDgeData.dups$consensus, block= dataForAnalysis$Sample, method='robust')

 forDgeData.ebayes = eBayes(forDgeData.fitRobust, robust=True)


designMatrix.forContrast = makeContrasts(diff_Before_vs_After = (A.after-A.before)-(B.after-B.before), levels= designMatrix)

forDgeData.fitRobust.fitRobust2 = contrasts.fit(forDgeData.fitRobust, designMatrix.forContrast)

forDgeData.ebayes = eBayes(forDgeData.fitRobust.fitRobust2)

Using this method, these are my results:

summary(decideTests(forDgeData.ebayes))

diff_Before_vs_After

-1 7

0 15680

1 5

So, comparing results from Analysis No. 1 and No. 2, only 3 genes overlap. My main question here is if there are DE genes between the conditions (before and after), taking into account the tissue factor and also that the samples are duplicated.

Any help or advice... more than welcome!

Thanks

limma-voom rnaseq differential gene expression • 1.7k views

ADD COMMENT • link 8.6 years ago Javier Gutierrez-Achury • 0

0

Entering edit mode

Aaron answered your main question already, but a minor comment: note that Gordon suggests to use duplicateCorrelation with voom by performing these calls twice each using two passes over the data:

A: using duplicateCorrelation with limma+voom for RNA-seq data

ADD REPLY • link 8.6 years ago Steve Lianoglou ★ 13k

0

Entering edit mode

Also, you should split the limma and voom tags, otherwise the maintainers don't get notified.

ADD REPLY • link 8.6 years ago Aaron Lun ★ 28k

score 1 · Answer 1 · 2016-10-03

Well, your comparisons are inherently different between the two analyses, so your results will also be different. In the first analysis, you're testing for genes that are differentially expressed between conditions, under the assumption that the log-fold change is not affected by the tissue (i.e., the log-fold change between conditions in one tissue is expected to be the same as that in the other tissue). In the second analysis, you're testing for genes where the log-fold changes between conditions differs across tissues.

This is a fundamental difference in the nature of the comparison. For example, a gene that has a log-fold change of 2 between conditions in tissue A and a log-fold change of -2 between conditions in tissue B will not be DE in the first analysis, because the overall log-fold change between conditions (averaged across tissues) would be zero. However, the same gene would be strongly DE in the second analysis, because the log-fold changes between conditions are clearly quite different between tissues, i.e., 2 vs -2.

Another contributor is the fact that the first analysis assumes additivity of the tissue and condition factors, while the second analysis accounts for possible interactions between them (i.e., that the condition effect can vary depending on the tissue). If the additivity assumption is false, then the variance estimates will be inflated as systematic tissue-specific differences are not properly modelled. The second analysis is more flexible at the cost of burning up some residual d.f. for variance estimation.

If you want to make your second analysis more comparable to your first, you should either compare the averages for each condition:

con <- makeContrasts((A.after + B.after)/2 - (A.before + B.before)/2, levels=designMatrix)

... or, for more stringency, you can test DE between conditions for A and B separately, and then intersect the resulting DE lists. Neither of these strategies will completely recapitulate the behaviour of the first analysis, but it's a lot closer than what you have now.

Finally, there's a whole bunch of slightly weird things with your analysis:

Is there a specific reason for using robust fitting? This generally seems unnecessary - it'll reduce efficiency (i.e., how well the information in the dataset is used during estimation) and it won't account for the correlations due to sample.
You should be using robust=TRUE (note the all-caps), True is not valid syntax as far as I know.
You should set up a DGEList object at the start of the analysis, rather than passing arguments manually to voom.
I guess you have more samples in your dataset than you've shown in the table above, otherwise duplicateCorrelation wouldn't be able to do much if it just has two levels of the blocking factor to play with.