fitFDist() shows poor fit if df1=1
1
0
Entering edit mode
@frederik-ziebell-14676
Last seen 1 day ago
Heidelberg, Germany

If df1=1 and df2 is larger than about 25, the estimate of fitFDist() results in df2 = Inf rather than giving a good estimate. Here's an example:

set.seed(1)
scale <- .3
df1 <- 1
df2 <- 50
x <- scale*rf(10^6,df1,df2)
fitFDist(x,df1)

# $scale
# [1] 0.3262758
# $df2
# [1] Inf

Conversely, if one sets df1 <- 2 or df2 <- 20, a good estimate is returned. Is this a bug? How can we get a good estimate when df1 = 1 and df2 > 25?

limma • 733 views
ADD COMMENT
1
Entering edit mode
@gordon-smyth
Last seen 6 hours ago
WEHI, Melbourne, Australia

First, let me say that df2=Inf is a legimate value for df2 that is deliberately set by fitFDist whenever the variances are more homogeneous than expected, i.e., when the variance of log(x) is smaller than expected. The infinite value does not arise from floating underflow or numerical instabilility.

For the scenario in your simulation, the function is over-estimating both df2 and the scale somewhat, but there is a reason for this. Before I explain the reason, let me explain why this over-estimation isn't usually a problem. The purpose of limma is obviously to analyse expression data rather than to solve theoretical mathematical problems. The fitFDist function is not called directly by users and the scenario of df1=1 and df2=50 is almost impossible for real data because it implies almost no replication in the experiment and virtually identical genewise variances. Even in the rare cases that data like this could occur, the over-estimation of df2 does not have a dramatic effect on the DE results. The infinite value is not used to compute p-values and there is compensating over-estimation of the scale parameter that prevents the p-values from becoming over-significant.

The over-estimation of df2 in the above context is caused by limma's robustness checking. The fitFDist function checks for very small variances that can sometimes arise from rounding of expression data. Extremely small variances are reset to a larger value, equal to 1e-5 times the median of the genewise variances. The extreme scenario that you've simulated generates very small variances with some being 1e-14 times the median. When fitFDist resets the smallest variances, it reduces the variance of the input values on the log-scale and this leads to over-estimation of df2.

I don't plan to change this behaviour because robustness is a higher priority for limma than accuracy of hyperparameters in theoretical scenarios than don't occur in practice. Your simulation generates genewise variances that are extremely variable, with the smallest being 1e-16 or less times the largest value. It isn't realistic for limma to trust variances that are so small as being an accurate reflection of biological reproducibility because expression data isn't measured and recorded to this precision in practice. Expression data would have to be recorded to more than 16 significant figures to make such an analysis meaningful.

ADD COMMENT
0
Entering edit mode

Thank you for the clarification. We have an analysis with limited amount of data (just 1 residual degree of freedom) and figured out that the prior degrees of freedom were estimated as Inf, similar to the above scenario. I wouldn't expect that a real-valued df2 estimate would influence downstream results because df2 clearly dominates df1, but just wanted to check whether some numerical instability was causing this result.

I agree that with this limited amount of data, variance estimates are very unstable and so it is sensible that variances are set to the prior estimate.

ADD REPLY
0
Entering edit mode

No, it's not numerical instability. Inf is a legitimate value for df2 that is deliberatory set by fitFDist whenever the variances are more homogeneous than expected. If I modify fitFDist to omit the variance thresholding then it gives for your simulation:

> fitFDist(x,df1)
$scale
[1] 0.3003433

$df2
[1] 50.97547
ADD REPLY

Login before adding your answer.

Traffic: 929 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6