My first question is which tool would be the best choice to answer my question (Limma, edgeR or Pearson correlation analysis) and why?
Both edgeR and voom
-limma work fine here. I don't know what you mean by "regular pearson correlation analysis" - I assume you're testing for a non-zero correlation. edgeR and limma have a few advantages over such a simple approach:
- They will stabilize the variance estimates by sharing information between genes through empirical Bayes shrinkage, improving power to detect significant differences.
- They will account for differences in precision across observations due to differences in library size between samples, something that you're probably not doing with an off-the-shelf test for non-zero correlations.
- They can accommodate non-linear trends by using a spline basis matrix in the
design
.
You're getting a lot more DE genes from edgeR, probably because you're using the older GLM approach. Try using the quasi-likelihood methods, i.e., glmQLFit
and glmQLFTest
, that are closer to the behaviour of voom
and limma in theory and in practice.
While both of these work fine, if you want a recommendation for a single method, I would probably go with voom
-limma because it is faster and plugs more easily into downstream methods, e.g., gene set testing.
Are my assumptions correct?
I'm not sure what assumptions you're referring to. The only obvious one is that you're assuming that log-gene expression is linear with respect to increasing triglyceride level. Whether or not that is reasonable is debatable. If you want to avoid that assumption, and if you have enough samples, you can use a spline basis matrix instead. However, this comes at the cost of reducing the interpretability of your results.
Is it correct to use the untransformed variable (TG- in this case not normally distributed) as an independent variable, or should I rather used log2 transformed value?
See my previous point. Is log-gene expression linear with respect to the triglyceride levels, or linear with respect to the log-levels? Ideally, you'd have some biological knowledge about the system that you could use to make this decision. For example, I would say that it would make sense to log-transform the covariate, because I could easily imagine that many expression responses will be linear with respect to concentration (and thus log-expression will be linear with log-concentration). It's a bit harder - though not inconceivable - to imagine genes that increase exponentially with concentration, in which case linearity of log-expression values with the untransformed levels would be more appropriate.
If you can't make a reasonable decision a priori... well, just try them both. The better-fitting model should have lower average dispersion estimates and (in this simple case) probably more DE genes. Note that more DE genes doesn't always mean a better-fitting model; poor fits for models involving blocking factors can result in more false positives, for example. And, it must be said, choosing a model that gives you the lowest variance doesn't always return the "correct" model, due to natural stochasticity in the quality of the fit. Our hope is that a consistent effect over thousands of genes is a reliable indicator of the model's suitability.
An additional complication is that you can only use one or the other in any single limma/edgeR run. Some genes might respond linearly while other genes might respond exponentially, but you have to use the same design matrix for both. If this is a concern, you should use splines; then none of this would be an issue, because given enough knots you can approximate any curve for any gene.
The lack of normality in the covariate is completely irrelevant to edgeR and co.
As far as I understand logFC in case of continuous trait the beta coefficient, right?
In your case, the log-fold change in the log-change in expression per unit of increase in the triglyceride concentration. Or increase in the log-unit, if you decide to log-transform your covariates.
Thanks a lot Aaron for your clear answers!
Aaron, If I may ask one more thing. What is the point of including 0, I know it says "an instruction not to include an intercept column and instead to include a column for each group", but what does it do in simple words? Should my code include it?
Thanks again!
It would be unwise to have
~0
in this case. This would remove the intercept, which means that you're requiring that all of your genes have an average log-expression of zero at a triglyceride concentration of zero. Such a requirement would be even more unreasonable than assuming linearity.The use of
~0
is more common for models with factors, where it simplifies the interpretation of the coefficients in some situations. In your case, though, there's not much extra simplification that's needed. One coefficient is the intercept and another is the slope - pretty straightforward.Thanks a million Aaron!