Question

Gene expression associated with continuous (quantitative) traits (Limma/edgeR/correlation)

1

Entering edit mode

anna.cot.anna.cot ▴ 40

@annacotannacot-20795

Last seen 11 months ago

United States

Hi,

I'm trying to figure out which is the best model to go with in an experiment, so I'd appreciate any advice people can give, as I'm essentionally a wet lab person.

I'm aiming to find genes that associate with changes in triglyceride levels (TG) in the liver samples.
I used featureCounts to summarize the gene level.

> dge <- DGEList(counts=expr)
> keep <- filterByExpr(dge)
> dge <- dge[keep, , keep.lib.sizes=FALSE]
> dge <- calcNormFactors(dge)
> TG <- pheno_untranformed$triglycerides

1) edgeR

> design <- model.matrix( ~TG) 
> dge <- estimateDisp(dge, design,robust=TRUE) 
> fit <- glmFit(dge, design) lrt <- glmLRT(fit) 
> diffTable <- topTags(lrt, n=15, adjust.method="BH", sort.by="PValue", p.value=1)

2) Limma -Voom

> design <- model.matrix( ~TG)
> y <- voom(dge, design)
> fit <- lmFit(y, design)
> fit <- eBayes(fit)
> diffTable2 <- topTable(fit, coef="TG", number=15)

3) Regular Pearson correlation analysis

From all the approaches tested the edgeR gives noticeably longer list of significant results.

And finally time for questions:

My first question is which tool would be the best choice to answer my question (Limma, edgeR or Pearson correlation analysis) and why?
Are my assumptions correct?
Is it correct to use the untransformed variable (TG- in this case not normally distributed) as an independent variable, or should I rather used log2 transformed value?
As far as I understand logFC in case of continuous trait the beta coefficient, right?

Thank you very much for all the tips!

limma edger • 4.5k views

ADD COMMENT • link updated 6.6 years ago by Gordon Smyth 53k • written 6.6 years ago by anna.cot.anna.cot ▴ 40

score 3 · Accepted Answer · 2019-05-16

My first question is which tool would be the best choice to answer my question (Limma, edgeR or Pearson correlation analysis) and why?

Both edgeR and voom-limma work fine here. I don't know what you mean by "regular pearson correlation analysis" - I assume you're testing for a non-zero correlation. edgeR and limma have a few advantages over such a simple approach:

They will stabilize the variance estimates by sharing information between genes through empirical Bayes shrinkage, improving power to detect significant differences.
They will account for differences in precision across observations due to differences in library size between samples, something that you're probably not doing with an off-the-shelf test for non-zero correlations.
They can accommodate non-linear trends by using a spline basis matrix in the design.

You're getting a lot more DE genes from edgeR, probably because you're using the older GLM approach. Try using the quasi-likelihood methods, i.e., glmQLFit and glmQLFTest, that are closer to the behaviour of voom and limma in theory and in practice.

While both of these work fine, if you want a recommendation for a single method, I would probably go with voom-limma because it is faster and plugs more easily into downstream methods, e.g., gene set testing.

Are my assumptions correct?

I'm not sure what assumptions you're referring to. The only obvious one is that you're assuming that log-gene expression is linear with respect to increasing triglyceride level. Whether or not that is reasonable is debatable. If you want to avoid that assumption, and if you have enough samples, you can use a spline basis matrix instead. However, this comes at the cost of reducing the interpretability of your results.

Is it correct to use the untransformed variable (TG- in this case not normally distributed) as an independent variable, or should I rather used log2 transformed value?

See my previous point. Is log-gene expression linear with respect to the triglyceride levels, or linear with respect to the log-levels? Ideally, you'd have some biological knowledge about the system that you could use to make this decision. For example, I would say that it would make sense to log-transform the covariate, because I could easily imagine that many expression responses will be linear with respect to concentration (and thus log-expression will be linear with log-concentration). It's a bit harder - though not inconceivable - to imagine genes that increase exponentially with concentration, in which case linearity of log-expression values with the untransformed levels would be more appropriate.

If you can't make a reasonable decision a priori... well, just try them both. The better-fitting model should have lower average dispersion estimates and (in this simple case) probably more DE genes. Note that more DE genes doesn't always mean a better-fitting model; poor fits for models involving blocking factors can result in more false positives, for example. And, it must be said, choosing a model that gives you the lowest variance doesn't always return the "correct" model, due to natural stochasticity in the quality of the fit. Our hope is that a consistent effect over thousands of genes is a reliable indicator of the model's suitability.

An additional complication is that you can only use one or the other in any single limma/edgeR run. Some genes might respond linearly while other genes might respond exponentially, but you have to use the same design matrix for both. If this is a concern, you should use splines; then none of this would be an issue, because given enough knots you can approximate any curve for any gene.

The lack of normality in the covariate is completely irrelevant to edgeR and co.

As far as I understand logFC in case of continuous trait the beta coefficient, right?

In your case, the log-fold change in the log-change in expression per unit of increase in the triglyceride concentration. Or increase in the log-unit, if you decide to log-transform your covariates.