segfault in edgeR glmQLFit
1
0
Entering edit mode
Mark • 0
@6ae9ccb7
Last seen 5 hours ago
United States

I am getting a segfault when analyzing a 16S metagenomics data sets with edgeR, from the glmQLFit function. The error is:

Error: segfault from C stack overflow

Other data sets run fine for me in this installation of edgeR, it's just this particular data set that seems to be getting the error. This data set does have a large number of samples (155) and a small number of genomic features (13 bacterial orders), so I'm not sure if those dimensions are part of the issue. Any help would be welcome.

Thanks, Mark

These are the commands I am running:

library(edgeR)
dat = read.delim("microbial_data.txt",row.names=1)
meta = read.delim("microbial_metadata.txt",row.names=1)

f1 = factor(meta[,1])
f2 = factor(meta[,2])
f3 = factor(meta[,3])
f4 = factor(meta[,4])

design = model.matrix(~ f1 + f2 + f3 + f4)

dge = DGEList(dat)
dge = calcNormFactors(dge)
dge = estimateDisp(dge, design)
fit = glmQLFit(dge, design)

Here is the sessionInfo():

R version 4.5.2 (2025-10-31)
Platform: x86_64-apple-darwin20
Running under: macOS Sequoia 15.6.1

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.5-x86_64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.5-x86_64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.1

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/Chicago
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] edgeR_4.8.0  limma_3.66.0

loaded via a namespace (and not attached):
[1] compiler_4.5.2  grid_4.5.2      locfit_1.5-9.12 lattice_0.22-7 
[5] statmod_1.5.1

The data sets, as well as above R script and sessionInfo, are available here: Box link to data

edgeR • 125 views
ADD COMMENT
0
Entering edit mode
@gordon-smyth
Last seen 1 hour ago
WEHI, Melbourne, Australia

Dear Mark,

Thank you for providing a reproducible example. I confirm that I can reproduce the segfault myself on my own computer. The problem is caused by the fact that glmQLFit() assumes NB dispersions <4 whereas estimateDisp() is returning dispersions about 6 for this data. We thought that dispersions this large would not be useful for real datasets. Anyway, we will insert a check into glmQLFit() to prevent errors of this type from occuring in the future.

In the meantime, let me say that estimateDisp() is no longer required when running glmQLFit() in edgeR v4. glmQLFit() prefers to estimate the NB dispersion itself. If you simply remove the estimateDisp() step, then your code will run without a segmentation fault.

Apart from sheer variability, there is a problem here with the estimation of library size normalisation factors. The number of genomic features is very small (just 13), some of the library sizes are extremely small (as few as 82 reads), and the between sample variation is enormous. I don't think that the basic assumptions required by calcNormFactors(), that most rows are not DE, are satisfied here. You might consider just skipping the calcNormFactors() step as well.

Best wishes,
Gordon

ADD COMMENT
0
Entering edit mode

Thanks! We had been considering some of these adjustments as well, given the small number of features and total reads. Your advice is well-taken.

ADD REPLY
0
Entering edit mode

We have committed a bug fix now to edgeR on both the release and developmental Bioconductor repositories.

ADD REPLY

Login before adding your answer.

Traffic: 1430 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6