Question

segfault in edgeR glmQLFit

0

Entering edit mode

Mark • 0

@6ae9ccb7

Last seen 5 hours ago

United States

I am getting a segfault when analyzing a 16S metagenomics data sets with edgeR, from the glmQLFit function. The error is:

Error: segfault from C stack overflow

Other data sets run fine for me in this installation of edgeR, it's just this particular data set that seems to be getting the error. This data set does have a large number of samples (155) and a small number of genomic features (13 bacterial orders), so I'm not sure if those dimensions are part of the issue. Any help would be welcome.

Thanks, Mark

These are the commands I am running:

library(edgeR)
dat = read.delim("microbial_data.txt",row.names=1)
meta = read.delim("microbial_metadata.txt",row.names=1)

f1 = factor(meta[,1])
f2 = factor(meta[,2])
f3 = factor(meta[,3])
f4 = factor(meta[,4])

design = model.matrix(~ f1 + f2 + f3 + f4)

dge = DGEList(dat)
dge = calcNormFactors(dge)
dge = estimateDisp(dge, design)
fit = glmQLFit(dge, design)

Here is the sessionInfo():

R version 4.5.2 (2025-10-31)
Platform: x86_64-apple-darwin20
Running under: macOS Sequoia 15.6.1

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.5-x86_64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.5-x86_64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.1

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/Chicago
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] edgeR_4.8.0  limma_3.66.0

loaded via a namespace (and not attached):
[1] compiler_4.5.2  grid_4.5.2      locfit_1.5-9.12 lattice_0.22-7 
[5] statmod_1.5.1

The data sets, as well as above R script and sessionInfo, are available here: Box link to data

edgeR • 125 views

ADD COMMENT • link updated 1 hour ago by Gordon Smyth 53k • written 3 days ago by Mark • 0

score 0 · Answer 1 · 2025-12-20

Dear Mark,

Thank you for providing a reproducible example. I confirm that I can reproduce the segfault myself on my own computer. The problem is caused by the fact that glmQLFit() assumes NB dispersions <4 whereas estimateDisp() is returning dispersions about 6 for this data. We thought that dispersions this large would not be useful for real datasets. Anyway, we will insert a check into glmQLFit() to prevent errors of this type from occuring in the future.

In the meantime, let me say that estimateDisp() is no longer required when running glmQLFit() in edgeR v4. glmQLFit() prefers to estimate the NB dispersion itself. If you simply remove the estimateDisp() step, then your code will run without a segmentation fault.

Apart from sheer variability, there is a problem here with the estimation of library size normalisation factors. The number of genomic features is very small (just 13), some of the library sizes are extremely small (as few as 82 reads), and the between sample variation is enormous. I don't think that the basic assumptions required by calcNormFactors(), that most rows are not DE, are satisfied here. You might consider just skipping the calcNormFactors() step as well.

Best wishes,
Gordon