Question

limma: R-squared for time-series data

0

Entering edit mode

ATpoint ★ 5.0k

@atpoint-13662

Last seen 1 day ago

Germany

Dear Gordon,

I am referring to the RNAseq123 workflow section 8.7: For these sorts of time-series experiments for which we assume a certain pattern (circadian for example), wouldn't it make sense to also calculate an R-squared value beyond the p-value to assess how well the data fit the theoretical model? Since it is not in the workflow there is probably a good reason for it, may I know why that is, and whether this makes sense with limma?

limma • 1.2k views

ADD COMMENT • link updated 3.2 years ago by Gordon Smyth 53k • written 3.2 years ago by ATpoint ★ 5.0k

score 2 · Accepted Answer · 2022-12-14

limma computes the F-test, which measures the sums of squares (SS) per df for the trend as compared to the SS per df for the residual. That seems to me to give a pretty good idea of goodness of fit. I view F-statistics as superior to R-square in pretty much every context because they take account of available df whereas R-square doesn't. In the limma context, F-statistics have the additional advantage of using empirical Bayes, whereas R-square can't take advantage of that.

So the reason is basically that I don't like R-square. I wouldn't know what to do with them and they are so open to misinterpretation. For example, if you fit a 5df trend to 6 observations then you expect to get an R-square of 5/6 just by chance.

I think that "goodness of fit" is an over-used term. I don't think that it actually has any well-defined meaning for normal linear regression. Same remark would apply to any regression context where there is a variance parameter than needs to be estimated.

I only ever use R-square when there is no variance parameter to be estimated and when the number of observations is really huge compared to the number of parameters in the fitted model, for example Table 1 of this paper: https://doi.org/10.1101/2022.07.02.498573. That table uses percent deviance explained, which is a direct generalization of R-square for glms and more general likelihood models.