[Bug] Segfault in edgeR::cpm
2
1
Entering edit mode
@b35c41e0
Last seen 6 months ago
Germany

Hello all,

I am trying to get cpm counts from this GEO accession [1] following this method. However I get a segfault ("memory not mapped") killing my entire R session.

Edit: My issue is not getting this to run at all, the issue is that I get a segfault with these precise inputs.
Edit2: As stated in this reply I originally ran into this issue on a compute server with about 1TB of RAM in terms of memory leeway. Following the related comment I also ran the example code on my personal laptop on WSL2 (30GB of RAM allocated + 100GB of swap), to the same result. Output appended at [2].

count_mat dimensions: nrow = 29'733, ncol = 100'064

Reproducible example (Edit: code used to generate the RDS provided below):

library("edgeR")
library("Matrix")

dl_url <- "https://bimsbstatic.mdc-berlin.de/akalin/jfreige/wu_raw_crash_objs.RDS"
dl_dest <- "wu_raw_crash_objs.RDS"
if (!file.exists(dl_dest)) {
  download.file(dl_url, dl_dest, timeout = 60^2)
}

crash_objs <- readRDS(dl_dest)
count_mat <- crash_objs$count_mat
norm_factors <- crash_objs$norm_factors

print(sessionInfo())

# This will generate a segmentation fault.
edgeR::cpm(count_mat, norm_factors * colSums(count_mat))

Output:

Loading required package: limma
trying URL 'https://bimsbstatic.mdc-berlin.de/akalin/jfreige/wu_raw_crash_objs.RDS'                                                                                       Content type 'application/octet-stream' length 412819144 bytes (393.7 MB)
==================================================
downloaded 393.7 MB

R version 4.3.1 (2023-06-16)
Platform: x86_64-unknown-linux-gnu (64-bit)
Running under: Ubuntu 22.04.2 LTS

Matrix products: default
BLAS/LAPACK: /gnu/store/d94sxy6axgzd3xr6bnp49l4v1nqzf825-openblas-0.3.20/lib/libopenblasp-r0.3.20.so;  LAPACK version 3.9.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

time zone: Europe/Berlin
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] edgeR_3.42.4 limma_3.56.2

loaded via a namespace (and not attached):
[1] compiler_4.3.1 Rcpp_1.0.11    grid_4.3.1     locfit_1.5-9.8 lattice_0.21-9

 *** caught segfault ***
address 0x7fc0514ff320, cause 'memory not mapped'

Traceback:
 1: cpm.default(count_mat, norm_factors * colSums(count_mat))
 2: edgeR::cpm(count_mat, norm_factors * colSums(count_mat))
An irrecoverable exception occurred. R is aborting now ...
Segmentation fault (core dumped)

Entire example script for posterity, the file at https://bimsbstatic.mdc-berlin.de/akalin/jfreige/wu_raw_crash_objs.RDS may not stay there long after this is resolved:

library("edgeR")

options("timeout" = 60 * 60)

download_raw <- T

if (download_raw) {
  library("Matrix")

  files_needed <- c(
    cells = "count_matrix_barcodes.tsv",
    genes = "count_matrix_genes.tsv",
    matrix = "count_matrix_sparse.mtx"
  )

  dl_url <- "ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE176nnn/GSE176078/suppl/GSE176078_Wu_etal_2021_BRCA_scRNASeq.tar.gz"
  dl_dest <- "wu_raw.tar"
  unpacked_dir_name <- "Wu_etal_2021_BRCA_scRNASeq/"

  needed_paths <- paste0(unpacked_dir_name, "/", files_needed)
  names(needed_paths) <- names(files_needed)

  if (!all(lapply(needed_paths, file.exists))) {
    download.file(url = dl_url, destfile = dl_dest)
    untar(dl_dest)
  }

  matrix <- Matrix::readMM(needed_paths["matrix"])

  matrix@Dimnames <- list(
    readLines(needed_paths["genes"]),
    readLines(needed_paths["cells"])
  )

  count_mat <- as.matrix(matrix)

  norm_factors <- edgeR::calcNormFactors(count_mat)

  save_crash_objs <- FALSE

  if (save_crash_objs) {
    saveRDS(
      list(
        count_mat = count_mat,
        norm_factors = norm_factors
      ),
      "wu_raw_crash_objs.RDS"
    )
  }
} else {
  dl_url <- "https://bimsbstatic.mdc-berlin.de/akalin/jfreige/wu_raw_crash_objs.RDS"
  dl_dest <- "wu_raw_crash_objs.RDS"
  if (!file.exists(dl_dest)) {
    download.file(url = dl_url, destfile = dl_dest)
  }

  crash_objs <- readRDS(dl_dest)
  count_mat <- crash_objs$count_mat
  norm_factors <- crash_objs$norm_factors
}

print(sessionInfo())

# This will generate a segmentation fault.
edgeR::cpm(count_mat, norm_factors * colSums(count_mat))

[1] Wu SZ, Al-Eryani G, Roden DL, Junankar S et al. A single-cell and spatially resolved atlas of human breast cancers. Nat Genet 2021 Sep;53(9):1334-1347. PMID: 34493872
[2] WSL2 Run output:

Loading required package: limma
trying URL 'https://bimsbstatic.mdc-berlin.de/akalin/jfreige/wu_raw_crash_objs.RDS'
Content type 'application/octet-stream' length 412819144 bytes (393.7 MB)
==================================================
downloaded 393.7 MB

R version 4.3.1 (2023-06-16)
Platform: x86_64-unknown-linux-gnu (64-bit)
Running under: Ubuntu 22.04.3 LTS

Matrix products: default
BLAS/LAPACK: /gnu/store/z9aavic2jyvp6jyfxnl9ka3i8vfk5phc-openblas-0.3.20/lib/libopenblasp-r0.3.20.so;  LAPACK version 3.9.0

locale:
[1] C

time zone: Europe/Berlin
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] Matrix_1.6-1.1 edgeR_3.42.4   limma_3.56.2

loaded via a namespace (and not attached):
[1] compiler_4.3.1 Rcpp_1.0.11    grid_4.3.1     locfit_1.5-9.8 lattice_0.21-9

 *** caught segfault ***
address 0x7fcba94ff300, cause 'memory not mapped'

Traceback:
 1: cpm.default(count_mat, norm_factors * colSums(count_mat))
 2: edgeR::cpm(count_mat, norm_factors * colSums(count_mat))
An irrecoverable exception occurred. R is aborting now ...
edgeR segfault • 1.4k views
ADD COMMENT
3
Entering edit mode
@gordon-smyth
Last seen 10 hours ago
WEHI, Melbourne, Australia

We were initially surprised by this error because we've never had any problems with edgeR::cpm before. After a day of investigation, we think we have identified the cause of the segfault.

Your matrix count_mat is a "long vector", meaning that length(count_mat) is greater than .Machine$integer.max. Long vectors were not allowed in R prior to R 3.0.0 and they still cause problems in some contexts. It seems that cpm does not work on double-precision matrices whose length is too large to be represented as an integer. I think you will be able to confirm on your system that

> sqrt(.Machine$integer.max)
[1] 46340.95
> z <- cpm(matrix(1,46340,46340))

works, but

> z <- cpm(matrix(1,46341,46341))

causes a segfault.

The problem arises because a 32-bit integer variable is somewhere being used to index the elements of count_mat. We don't do any such indexing in edgeR and there doesn't seem to be anything that we can change in edgeR. edgeR manages C++ code via the Rcpp package and the issue seems to be with Rcpp and perhaps also with how your version of R was compiled. We will raise the issue with the Rcpp maintainers.

For reasons that I don't understand, integer matrices behave differently from numeric matrices, so

> z <- cpm(matrix(1L,46341,46341))

works fine.

There are many work-arounds in the context of your analysis. Any of the following should avoid the segfault:

  • Reset the values of count_mat to be integer.

  • Filter the rows of count_mat to exclude genes that are detected in very few cells. That will avoid the segfault if the number of rows remaining is 21461 or less. That is a good idea anyway because edgeR analyses assume that such filtering has been done.

  • Compute cpm using plain R by z <- t(t(count_mat)/L) where L is the vector of normalized library sizes in millions.

  • Compute cpm using Matrix sparse arithmetic by z <- matrix %*% Diagonal(x=1/L).

ADD COMMENT
1
Entering edit mode

That must have been quite annoying to tease out, thank you very much for taking the time! Indeed running cpm on an numeric matrix with more than 46'340 rows/columns generated a segfault for me as well, while doing the same with an integer matrix did not.

Thank you also very much for compiling this list of options for avoiding this issue!

ADD REPLY
1
Entering edit mode

https://github.com/RcppCore/Rcpp/issues/1280 if you want to follow the progress

ADD REPLY
0
Entering edit mode
@james-w-macdonald-5106
Last seen 19 hours ago
United States

Trying to use scRNA-Seq data as if it were bulk RNA-Seq is not likely to work well. Instead you should be analyzing as if it were scRNA-Seq instead, using e.g., scran and scater, etc. You can then collapse cell types and do pseudobulk analysis if you want.

ADD COMMENT
0
Entering edit mode

Thank you for your response!
I am aware that this is not entirely sensible in terms of results and computational efficiency, I mainly need this for comparison. Before I start to fiddle around with the implementation myself, I'll try to get the method as close as the paper's implementation. There's a couple more steps involved as this is also UMI data which the script I linked wasn't really designed for, this is just the minimal example I managed to reproduce the segfault on.

ADD REPLY

Login before adding your answer.

Traffic: 519 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6