Question: Gene expression associated with continuous (quantitative) traits (Limma/edgeR/correlation)
0
5 months ago by
anna.cot.anna.cot0 wrote:

Hi,

I'm trying to figure out which is the best model to go with in an experiment, so I'd appreciate any advice people can give, as I'm essentionally a wet lab person.

I'm aiming to find genes that associate with changes in triglyceride levels (TG) in the liver samples.
I used featureCounts to summarize the gene level.

> dge <- DGEList(counts=expr)
> keep <- filterByExpr(dge)
> dge <- dge[keep, , keep.lib.sizes=FALSE]
> dge <- calcNormFactors(dge)
> TG <- pheno_untranformed\$triglycerides


1) edgeR

> design <- model.matrix( ~TG)
> dge <- estimateDisp(dge, design,robust=TRUE)
> fit <- glmFit(dge, design) lrt <- glmLRT(fit)
> diffTable <- topTags(lrt, n=15, adjust.method="BH", sort.by="PValue", p.value=1)


2) Limma -Voom

> design <- model.matrix( ~TG)
> y <- voom(dge, design)
> fit <- lmFit(y, design)
> fit <- eBayes(fit)
> diffTable2 <- topTable(fit, coef="TG", number=15)


3) Regular Pearson correlation analysis

From all the approaches tested the edgeR gives noticeably longer list of significant results.

And finally time for questions:

1. My first question is which tool would be the best choice to answer my question (Limma, edgeR or Pearson correlation analysis) and why?
2. Are my assumptions correct?
3. Is it correct to use the untransformed variable (TG- in this case not normally distributed) as an independent variable, or should I rather used log2 transformed value?
4. As far as I understand logFC in case of continuous trait the beta coefficient, right?

Thank you very much for all the tips!

limma edger • 275 views
modified 5 months ago by Gordon Smyth39k • written 5 months ago by anna.cot.anna.cot0
Answer: Gene expression associated with continuous (quantitative) traits (Limma/edgeR/co
3
5 months ago by
Aaron Lun25k
Cambridge, United Kingdom
Aaron Lun25k wrote:

My first question is which tool would be the best choice to answer my question (Limma, edgeR or Pearson correlation analysis) and why?

Both edgeR and voom-limma work fine here. I don't know what you mean by "regular pearson correlation analysis" - I assume you're testing for a non-zero correlation. edgeR and limma have a few advantages over such a simple approach:

• They will stabilize the variance estimates by sharing information between genes through empirical Bayes shrinkage, improving power to detect significant differences.
• They will account for differences in precision across observations due to differences in library size between samples, something that you're probably not doing with an off-the-shelf test for non-zero correlations.
• They can accommodate non-linear trends by using a spline basis matrix in the design.

You're getting a lot more DE genes from edgeR, probably because you're using the older GLM approach. Try using the quasi-likelihood methods, i.e., glmQLFit and glmQLFTest, that are closer to the behaviour of voom and limma in theory and in practice.

While both of these work fine, if you want a recommendation for a single method, I would probably go with voom-limma because it is faster and plugs more easily into downstream methods, e.g., gene set testing.

Are my assumptions correct?

I'm not sure what assumptions you're referring to. The only obvious one is that you're assuming that log-gene expression is linear with respect to increasing triglyceride level. Whether or not that is reasonable is debatable. If you want to avoid that assumption, and if you have enough samples, you can use a spline basis matrix instead. However, this comes at the cost of reducing the interpretability of your results.

Is it correct to use the untransformed variable (TG- in this case not normally distributed) as an independent variable, or should I rather used log2 transformed value?

See my previous point. Is log-gene expression linear with respect to the triglyceride levels, or linear with respect to the log-levels? Ideally, you'd have some biological knowledge about the system that you could use to make this decision. For example, I would say that it would make sense to log-transform the covariate, because I could easily imagine that many expression responses will be linear with respect to concentration (and thus log-expression will be linear with log-concentration). It's a bit harder - though not inconceivable - to imagine genes that increase exponentially with concentration, in which case linearity of log-expression values with the untransformed levels would be more appropriate.

If you can't make a reasonable decision a priori... well, just try them both. The better-fitting model should have lower average dispersion estimates and (in this simple case) probably more DE genes. Note that more DE genes doesn't always mean a better-fitting model; poor fits for models involving blocking factors can result in more false positives, for example. And, it must be said, choosing a model that gives you the lowest variance doesn't always return the "correct" model, due to natural stochasticity in the quality of the fit. Our hope is that a consistent effect over thousands of genes is a reliable indicator of the model's suitability.

An additional complication is that you can only use one or the other in any single limma/edgeR run. Some genes might respond linearly while other genes might respond exponentially, but you have to use the same design matrix for both. If this is a concern, you should use splines; then none of this would be an issue, because given enough knots you can approximate any curve for any gene.

The lack of normality in the covariate is completely irrelevant to edgeR and co.

As far as I understand logFC in case of continuous trait the beta coefficient, right?

In your case, the log-fold change in the log-change in expression per unit of increase in the triglyceride concentration. Or increase in the log-unit, if you decide to log-transform your covariates.

Aaron, If I may ask one more thing. What is the point of including 0, I know it says "an instruction not to include an intercept column and instead to include a column for each group", but what does it do in simple words? Should my code include it?

> design <- model.matrix( ~TG)
# or rather
> design <- model.matrix( ~0+TG)


Thanks again!

1

It would be unwise to have ~0 in this case. This would remove the intercept, which means that you're requiring that all of your genes have an average log-expression of zero at a triglyceride concentration of zero. Such a requirement would be even more unreasonable than assuming linearity.

The use of ~0 is more common for models with factors, where it simplifies the interpretation of the coefficients in some situations. In your case, though, there's not much extra simplification that's needed. One coefficient is the intercept and another is the slope - pretty straightforward.

Thanks a million Aaron!

Answer: Gene expression associated with continuous (quantitative) traits (Limma/edgeR/co
0
5 months ago by
Netherlands
mikhael.manurung190 wrote:

I do not think you can use a continuous variable in your design matrix for either limma or edgeR. Well, you can use it as a covariate for adjustment of confounding or batch effects.

It is possible to correlate the expression of your genes with your continuous predictor, provided you have removed the mean-variance dependency of your gene exp data with either rlog or vst transformation from the DESeq2 package.

I think it is worth it to try looking into the CEMiTool package. This method will greatly reduce the dimensionality of your data by summarising your genes into gene modules. Afterwards, you can correlate the activity (i.e. eigengene) of the modules to your outcome!

I hope this helps.

Best, Mikhael