Gene expression associated with continuous (quantitative) traits (Limma/edgeR/correlation)
1
0
Entering edit mode
@annacotannacot-20795
Last seen 19 months ago
United States

Hi,

I'm trying to figure out which is the best model to go with in an experiment, so I'd appreciate any advice people can give, as I'm essentionally a wet lab person.

I'm aiming to find genes that associate with changes in triglyceride levels (TG) in the liver samples.
I used featureCounts to summarize the gene level.

> dge <- DGEList(counts=expr)
> keep <- filterByExpr(dge)
> dge <- dge[keep, , keep.lib.sizes=FALSE]
> dge <- calcNormFactors(dge)
> TG <- pheno_untranformed$triglycerides

1) edgeR

> design <- model.matrix( ~TG) 
> dge <- estimateDisp(dge, design,robust=TRUE) 
> fit <- glmFit(dge, design) lrt <- glmLRT(fit) 
> diffTable <- topTags(lrt, n=15, adjust.method="BH", sort.by="PValue", p.value=1)

2) Limma -Voom

> design <- model.matrix( ~TG)
> y <- voom(dge, design)
> fit <- lmFit(y, design)
> fit <- eBayes(fit)
> diffTable2 <- topTable(fit, coef="TG", number=15)

3) Regular Pearson correlation analysis

From all the approaches tested the edgeR gives noticeably longer list of significant results.

And finally time for questions:

  1. My first question is which tool would be the best choice to answer my question (Limma, edgeR or Pearson correlation analysis) and why?
  2. Are my assumptions correct?
  3. Is it correct to use the untransformed variable (TG- in this case not normally distributed) as an independent variable, or should I rather used log2 transformed value?
  4. As far as I understand logFC in case of continuous trait the beta coefficient, right?

Thank you very much for all the tips!

limma edger • 3.7k views
ADD COMMENT
3
Entering edit mode
Aaron Lun ★ 28k
@alun
Last seen 7 hours ago
The city by the bay

My first question is which tool would be the best choice to answer my question (Limma, edgeR or Pearson correlation analysis) and why?

Both edgeR and voom-limma work fine here. I don't know what you mean by "regular pearson correlation analysis" - I assume you're testing for a non-zero correlation. edgeR and limma have a few advantages over such a simple approach:

  • They will stabilize the variance estimates by sharing information between genes through empirical Bayes shrinkage, improving power to detect significant differences.
  • They will account for differences in precision across observations due to differences in library size between samples, something that you're probably not doing with an off-the-shelf test for non-zero correlations.
  • They can accommodate non-linear trends by using a spline basis matrix in the design.

You're getting a lot more DE genes from edgeR, probably because you're using the older GLM approach. Try using the quasi-likelihood methods, i.e., glmQLFit and glmQLFTest, that are closer to the behaviour of voom and limma in theory and in practice.

While both of these work fine, if you want a recommendation for a single method, I would probably go with voom-limma because it is faster and plugs more easily into downstream methods, e.g., gene set testing.

Are my assumptions correct?

I'm not sure what assumptions you're referring to. The only obvious one is that you're assuming that log-gene expression is linear with respect to increasing triglyceride level. Whether or not that is reasonable is debatable. If you want to avoid that assumption, and if you have enough samples, you can use a spline basis matrix instead. However, this comes at the cost of reducing the interpretability of your results.

Is it correct to use the untransformed variable (TG- in this case not normally distributed) as an independent variable, or should I rather used log2 transformed value?

See my previous point. Is log-gene expression linear with respect to the triglyceride levels, or linear with respect to the log-levels? Ideally, you'd have some biological knowledge about the system that you could use to make this decision. For example, I would say that it would make sense to log-transform the covariate, because I could easily imagine that many expression responses will be linear with respect to concentration (and thus log-expression will be linear with log-concentration). It's a bit harder - though not inconceivable - to imagine genes that increase exponentially with concentration, in which case linearity of log-expression values with the untransformed levels would be more appropriate.

If you can't make a reasonable decision a priori... well, just try them both. The better-fitting model should have lower average dispersion estimates and (in this simple case) probably more DE genes. Note that more DE genes doesn't always mean a better-fitting model; poor fits for models involving blocking factors can result in more false positives, for example. And, it must be said, choosing a model that gives you the lowest variance doesn't always return the "correct" model, due to natural stochasticity in the quality of the fit. Our hope is that a consistent effect over thousands of genes is a reliable indicator of the model's suitability.

An additional complication is that you can only use one or the other in any single limma/edgeR run. Some genes might respond linearly while other genes might respond exponentially, but you have to use the same design matrix for both. If this is a concern, you should use splines; then none of this would be an issue, because given enough knots you can approximate any curve for any gene.

The lack of normality in the covariate is completely irrelevant to edgeR and co.

As far as I understand logFC in case of continuous trait the beta coefficient, right?

In your case, the log-fold change in the log-change in expression per unit of increase in the triglyceride concentration. Or increase in the log-unit, if you decide to log-transform your covariates.

ADD COMMENT
0
Entering edit mode

Thanks a lot Aaron for your clear answers!

ADD REPLY
0
Entering edit mode

Aaron, If I may ask one more thing. What is the point of including 0, I know it says "an instruction not to include an intercept column and instead to include a column for each group", but what does it do in simple words? Should my code include it?

> design <- model.matrix( ~TG) 
# or rather
> design <- model.matrix( ~0+TG) 

Thanks again!

ADD REPLY
1
Entering edit mode

It would be unwise to have ~0 in this case. This would remove the intercept, which means that you're requiring that all of your genes have an average log-expression of zero at a triglyceride concentration of zero. Such a requirement would be even more unreasonable than assuming linearity.

The use of ~0 is more common for models with factors, where it simplifies the interpretation of the coefficients in some situations. In your case, though, there's not much extra simplification that's needed. One coefficient is the intercept and another is the slope - pretty straightforward.

ADD REPLY
0
Entering edit mode

Thanks a million Aaron!

ADD REPLY

Login before adding your answer.

Traffic: 594 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6