segfault in edgeR glmQLFit
1
0
Entering edit mode
Mark • 0
@6ae9ccb7
Last seen 14 hours ago
United States

I am getting a segfault when analyzing a 16S metagenomics data sets with edgeR, from the glmQLFit function. The error is:

Error: segfault from C stack overflow

Other data sets run fine for me in this installation of edgeR, it's just this particular data set that seems to be getting the error. This data set does have a large number of samples (155) and a small number of genomic features (13 bacterial orders), so I'm not sure if those dimensions are part of the issue. Any help would be welcome.

Thanks, Mark

These are the commands I am running:

library(edgeR)
dat = read.delim("microbial_data.txt",row.names=1)
meta = read.delim("microbial_metadata.txt",row.names=1)

f1 = factor(meta[,1])
f2 = factor(meta[,2])
f3 = factor(meta[,3])
f4 = factor(meta[,4])

design = model.matrix(~ f1 + f2 + f3 + f4)

dge = DGEList(dat)
dge = calcNormFactors(dge)
dge = estimateDisp(dge, design)
fit = glmQLFit(dge, design)

Here is the sessionInfo():

R version 4.5.2 (2025-10-31)
Platform: x86_64-apple-darwin20
Running under: macOS Sequoia 15.6.1

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.5-x86_64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.5-x86_64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.1

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/Chicago
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] edgeR_4.8.0  limma_3.66.0

loaded via a namespace (and not attached):
[1] compiler_4.5.2  grid_4.5.2      locfit_1.5-9.12 lattice_0.22-7 
[5] statmod_1.5.1

The data sets, as well as above R script and sessionInfo, are available here: Box link to data

edgeR • 49 views
ADD COMMENT
0
Entering edit mode
@gordon-smyth
Last seen 3 hours ago
WEHI, Melbourne, Australia

Dear Mark,

Thank you for providing a reproducible example. I confirm that I can reproduce the segfault myself on my own computer. We will have a look at this and will fix whatever the problem is.

In the meantime, let me say that estimateDisp() is no longer required when running glmQLFit() in edgeR v4. glmQLFit() prefers to estimate the NB dispersion itself. If you simply remove the estimateDisp() step, then your code will run without a segmentation fault.

The basic problem here has to do with the estimation of library size normalisation factors. The number of genomic features is very small (just 13), some of the library sizes are extremely small (as few as 82 reads), and the between sample variation is enormous. I don't think that the basic assumptions required by calcNormFactors(), that most rows are not DE, are satisfied here. You might consider just skipping the calcNormFactors() step as well.

Best wishes, Gordon

ADD COMMENT

Login before adding your answer.

Traffic: 1041 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6