Limma-Voom which of two models is correct?
1
0
Entering edit mode
@javier-gutierrez-achury-11591
Last seen 8.2 years ago

I'm analysing RNA-seq data with Limma-Voom and after reading and apply different methods of analysis I have arrived at two different results and I'm not sure which one is correct. Here the specs: n=60 (~30 per tissue) The experiment (Multi-level time course experiment):

dataForAnalysis = data.frame(Sample=c("s_1", "s_1", "s_2", "s_2", "s_1", "s_1", "s_2", "s_2"), Tissue=c("A", "A","A", "A", "B", "B", "B", "B"), Condition=c("before", "after","before", "after","before", "after","before", "after"))
Sample Tissue Condition
s_1 A before
s_1 A after
s_2 A before
s_2 A after
s_1 B before
s_1 B after
s_2 B before
s_2 B after

 

Analysis No. 1

designMatrix = model.matrix(~ Tissue + Condition, dataForAnalysis)

nf = calcNormFactors(counts)

forDgeData.voom <- voom(counts, designMatrix, plot=False,  lib.size=colSums(counts)*nf)

forDgeData.dups = duplicateCorrelation(forDgeData.voom, designMatrix, block= dataForAnalysis$Sample)

forDgeData.fitRobust = lmFit(forDgeData.voom,  forDgeData.voom$design, correlation= forDgeData.dups$consensus, block= dataForAnalysis$Sample, method='robust')

forDgeData.ebayes = eBayes(forDgeData.fitRobust, robust=True)

Using this method, these are my results:

summary(decideTests(forDgeData.ebayes))

(Intercept) TissueA ConditionAfter

-1 1615 6594 22

0 446 1573 15631

1 13631 7525 39

From these results, I'm taking the ones in the "ConditionAfter" to get the DE genes between the conditions using:

toptable(forDgeData.ebayes, coef="ConditionAfter", ...)

Analysis No. 2 (For this analysis I pasted tissue and condition for the design matrix and use contrasts)

predesignMatrix = factor(paste(dataForAnalysis$Tissue, dataForAnalysis$Condition, sep='.'))

designMatrix = model.matrix(~ 0 + predesignMatrix)

nf = calcNormFactors(counts)

forDgeData.voom <- voom(counts, designMatrix, plot=False,  lib.size=colSums(counts)*nf)

forDgeData.dups = duplicateCorrelation(forDgeData.voom, designMatrix, block= dataForAnalysis$Sample)

forDgeData.fitRobust = lmFit(forDgeData.voom,  forDgeData.voom$design, correlation= forDgeData.dups$consensus, block= dataForAnalysis$Sample, method='robust')

 forDgeData.ebayes = eBayes(forDgeData.fitRobust, robust=True)


designMatrix.forContrast = makeContrasts(diff_Before_vs_After = (A.after-A.before)-(B.after-B.before), levels= designMatrix)

forDgeData.fitRobust.fitRobust2 = contrasts.fit(forDgeData.fitRobust, designMatrix.forContrast)

forDgeData.ebayes = eBayes(forDgeData.fitRobust.fitRobust2)

Using this method, these are my results:

summary(decideTests(forDgeData.ebayes))

diff_Before_vs_After

-1 7

0 15680

1 5

So, comparing results from Analysis No. 1 and No. 2, only 3 genes overlap. My main question here is if there are DE genes between the conditions (before and after), taking into account the tissue factor and also that the samples are duplicated.

Any help or advice... more than welcome!

Thanks

limma-voom rnaseq differential gene expression • 1.6k views
0
Entering edit mode

Aaron answered your main question already, but a minor comment: note that Gordon suggests to use duplicateCorrelation with voom by performing these calls twice each using two passes over the data:

A: using duplicateCorrelation with limma+voom for RNA-seq data

 

ADD REPLY
0
Entering edit mode

Also, you should split the limma and voom tags, otherwise the maintainers don't get notified.

ADD REPLY
1
Entering edit mode
Aaron Lun ★ 28k
@alun
Last seen 7 hours ago
The city by the bay

Well, your comparisons are inherently different between the two analyses, so your results will also be different. In the first analysis, you're testing for genes that are differentially expressed between conditions, under the assumption that the log-fold change is not affected by the tissue (i.e., the log-fold change between conditions in one tissue is expected to be the same as that in the other tissue). In the second analysis, you're testing for genes where the log-fold changes between conditions differs across tissues. 

This is a fundamental difference in the nature of the comparison. For example, a gene that has a log-fold change of 2 between conditions in tissue A and a log-fold change of -2 between conditions in tissue B will not be DE in the first analysis, because the overall log-fold change between conditions (averaged across tissues) would be zero. However, the same gene would be strongly DE in the second analysis, because the log-fold changes between conditions are clearly quite different between tissues, i.e., 2 vs -2.

Another contributor is the fact that the first analysis assumes additivity of the tissue and condition factors, while the second analysis accounts for possible interactions between them (i.e., that the condition effect can vary depending on the tissue). If the additivity assumption is false, then the variance estimates will be inflated as systematic tissue-specific differences are not properly modelled. The second analysis is more flexible at the cost of burning up some residual d.f. for variance estimation.

If you want to make your second analysis more comparable to your first, you should either compare the averages for each condition:

con <- makeContrasts((A.after + B.after)/2 - (A.before + B.before)/2, levels=designMatrix)

... or, for more stringency, you can test DE between conditions for A and B separately, and then intersect the resulting DE lists. Neither of these strategies will completely recapitulate the behaviour of the first analysis, but it's a lot closer than what you have now.

Finally, there's a whole bunch of slightly weird things with your analysis:

  • Is there a specific reason for using robust fitting? This generally seems unnecessary - it'll reduce efficiency (i.e., how well the information in the dataset is used during estimation) and it won't account for the correlations due to sample.
  • You should be using robust=TRUE (note the all-caps), True is not valid syntax as far as I know.
  • You should set up a DGEList object at the start of the analysis, rather than passing arguments manually to voom.
  • I guess you have more samples in your dataset than you've shown in the table above, otherwise duplicateCorrelation wouldn't be able to do much if it just has two levels of the blocking factor to play with.
ADD COMMENT
0
Entering edit mode

Hi Aaron, thanks for your reply. 

You're right, the full dataset contains 67 samples (A=29, B=38), Yes, I'm passing the object as DGEList and, yes I'm using TRUE (sorry my mistyping).   

About the robust fitting, I found a difference in the variance of the normalized counts between the tissues and with the method="ls" I was getting Zero DE genes.

ADD REPLY
1
Entering edit mode

Perhaps it would be advisable to try to figure out the cause for the difference in results with and without robust fitting. For example, do you have genes with outlier expression patterns? Do you have a number of low-quality samples? Being able to answer these questions would motivate the choice of parameters and provide some confidence in your analysis, rather than just blindly picking the combination of parameters that happens to give you some DE genes.

ADD REPLY

Login before adding your answer.

Traffic: 637 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6