cellCounts() function floating point error
0
0
Entering edit mode
J-E • 0
@82bf0bc5
Last seen 8 hours ago
United States

I tested the R function 'cellCounts' [Rsubread package] on a simplisitc 2 samples scRNAseq data from a Mouse experiment. The function inputs reads from FastQ files. The cellCOunts() function requested 24 threads on a 24-CPU node with 90Gb of RAM. The job does not appear to have run out of memory and there has not been an alert about quotas in the last 3 days, so it does not appear disk space was an issue. In addition, the 'cellCounts' function seemed to have finished: the output says that it could map and count 554,000,000 reads in 41.5 mins. However, upon closer inspection of the log file and R output, 'cellCounts' did not return any value as expected, and the log file actually indicates that the job seems to have aborted at the very end due to a floating point exception (core dumped). See R output log below (edited for conciseness and privacy).

Floating point exceptions are caused by things like division by zero or overflow (trying to store a number too big for the data type). I believe R handles division by 0 gracefully, but the underlying C/C++ code would not.

Hopefully the developers/maintainers or someone can suggest some troubleshooting steps.

# 'M29.index' is the reference sequence index, succesfully built
# using Mouse Genome Reference (Release M29 - GRCm39,
# 'gencode.vM29.transcripts.fa.gz') from the GENCODE database.

# 'sample.sheet' is the dataframe referring to fastQ files
# (properly formatted - see output below).

fc <- cellCounts(index=file.path("M29.index", fsep=.Platform$file.sep), sample=sample.sheet, input.mode="FASTQ", cell.barcode=NULL, nsubreads=15, minVotes=1, maxMismatches=10, minMappedLength=1, annot.inbuilt="mm39", nBestLocations=1, useMetaFeatures=TRUE, umi.cutoff=NULL, nthreads=24) # Results of running the following in an R session: sessionInfo( ) R version 4.0.2 (2020-06-22) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Red Hat Enterprise Linux Matrix products: default BLAS: /usr/local/gcc-6_3_0/lapack/3.7.0/lib/libblas.so.3.7.0 LAPACK: /usr/local/gcc-6_3_0/lapack/3.7.0/lib/liblapack.so.3.7.0 Random number generation: RNG: L'Ecuyer-CMRG Normal: Inversion Sample: Rejection locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] Rsubread_2.10.0 edgeR_3.36.0 limma_3.50.3 loaded via a namespace (and not attached): [1] compiler_4.0.2 Matrix_1.2-18 Rcpp_1.0.8.3 grid_4.0.2 [5] locfit_1.5-9.4 lattice_0.20-41 # R output log: ------------------------------- My job will use the following nodes: ------------------------------- compt253 ------------------------------- My job will use the following number of tasks: ------------------------------- 24 ------------------------------- Temporarily Directory ($PFSDIR):
-------------------------------
/scratch/pbsjobs/job.16967554.hpc
-------------------------------

> library("limma")
> library("edgeR")
> fc <- cellCounts(index=file.path("M29.index", fsep=.Platform\$file.sep),
+                  sample=sample.sheet,
+                  input.mode="FASTQ",
+                  cell.barcode=NULL,
+                  maxMismatches=10,
+                  minMappedLength=1,
+                  annot.inbuilt="mm39",
+                  nBestLocations=1,
+                  useMetaFeatures=TRUE,
+                  umi.cutoff=NULL,

NCBI RefSeq annotation for mm39 (build 39) is used.
SampleName
1 GC-CC-1081_1
2 GC-CC-1081_1
3 GC-CC-1081_2
4 GC-CC-1081_2
BarcodeUMIFile
1   /home/.../GC-CC-1081_1_S2_L001_R1_001.fastq.gz
2   /home/.../GC-CC-1081_1_S2_L002_R1_001.fastq.gz
3 /home/.../GC-CC-1081_2_S3_L001_R1_001.fastq.gz
4 /home/.../GC-CC-1081_2_S3_L002_R1_001.fastq.gz
1   /home/.../GC-CC-1081_1_S2_L001_R2_001.fastq.gz
2   /home/.../GC-CC-1081_1_S2_L002_R2_001.fastq.gz
3 /home/.../GC-CC-1081_2_S3_L001_R2_001.fastq.gz
4 /home/.../GC-CC-1081_2_S3_L002_R2_001.fastq.gz

Found 3 known cell barcode sets.
Testing the cell barcodes in 3M-february-2018.txt.gz.
Cell barcode supporting rate : 97.5%.
Found cell-barcode list '3M-february-2018.txt.gz' for the input data: supported by 97.5% reads.
Param #0 = 'R_cellCounts'
Param #1 = '--inputMode'
Param #2 = 'FASTQ'
Param #3 = '--cellBarcodeFile'
Param #5 = '--dataset'
Param #6 = '/home/.../GC-CC-1081_1_S2_L001_R1_001.fastq.gz|Rsd:cCounts:1mFQ|.|Rsd:cCounts:1mFQ|/home/.../GC-CC-1081_1_S2_L001_R2_001.fastq.gz|Rsd:cCounts:mFQs|/home/.../GC-CC-1081_1_S2_L002_R1_001.fastq.gz|Rsd:cCounts:1mFQ|.|Rsd:cCounts:1mFQ|/home/.../GC-CC-1081_1_S2_L002_R2_001.fastq.gz|Rsd:cCounts:mFQs|/home/.../GC-CC-1081_2_S3_L001_R1_001.fastq.gz|Rsd:cCounts:1mFQ|.|Rsd:cCounts:1mFQ|/home/.../GC-CC-1081_2_S3_L001_R2_001.fastq.gz|Rsd:cCounts:mFQs|/home/.../GC-CC-1081_2_S3_L002_R1_001.fastq.gz|Rsd:cCounts:1mFQ|.|Rsd:cCounts:1mFQ|/home/.../GC-CC-1081_2_S3_L002_R2_001.fastq.gz'
Param #7 = '--sampleSheetFile'
Param #9 = '--index'
Param #10 = '/home/.../M29.index'
Param #11 = '--annotation'
Param #13 = '--geneIdColumn'
Param #14 = 'gene_id'
Param #15 = '--annotationType'
Param #16 = 'exon'
Param #18 = '24'
Param #19 = '--output'
Param #21 = '--maxMismatch'
Param #22 = '10'
Param #24 = '1'
Param #26 = '15'
Param #28 = '1'
Param #30 = '2'
Param #31 = '--minMappedLength'
Param #32 = '1'
Param #33 = '--umiCutoff'
Param #34 = '-999'
UMI_CUT=-999.00
INIT LOCKER 1 at 0x7f957c9b9610
|| Load the 1-th index block...                                               ||
|| The index block has been loaded.                                           ||
Mapping and counting: 0  ; 2.0 mins
Mapping and counting: 1000000  ; 2.6 mins
Mapping and counting: 2000000  ; 2.7 mins
Mapping and counting: 3000000  ; 2.7 mins
Mapping and counting: 4000000  ; 2.8 mins
Mapping and counting: 5000000  ; 2.9 mins
Mapping and counting: 6000000  ; 2.9 mins
Mapping and counting: 7000000  ; 3.0 mins
Mapping and counting: 8000000  ; 3.1 mins
Mapping and counting: 9000000  ; 3.1 mins
Mapping and counting: 10000000  ; 3.2 mins
Mapping and counting: 11000000  ; 3.3 mins
Mapping and counting: 12000000  ; 3.3 mins
Mapping and counting: 13000000  ; 3.4 mins
Mapping and counting: 14000000  ; 3.5 mins
Mapping and counting: 15000000  ; 3.5 mins
Mapping and counting: 16000000  ; 3.6 mins
Mapping and counting: 17000000  ; 3.7 mins
Mapping and counting: 18000000  ; 3.7 mins
Mapping and counting: 19000000  ; 3.8 mins
Mapping and counting: 20000000  ; 3.9 mins
Mapping and counting: 21000000  ; 3.9 mins
Mapping and counting: 22000000  ; 4.0 mins
Mapping and counting: 23000000  ; 4.1 mins
Mapping and counting: 24000000  ; 4.1 mins
Mapping and counting: 25000000  ; 4.2 mins
Mapping and counting: 26000000  ; 4.3 mins
Mapping and counting: 27000000  ; 4.3 mins
Mapping and counting: 28000000  ; 4.4 mins
Mapping and counting: 29000000  ; 4.5 mins

[...]

HICONF MAPPING (SIMPLE) = 0, LOWCONF MAPPING (ALL SUBREADS, NOT SIMPLE) = 0
Mapping and counting: 540000000  ; 40.3 mins
Mapping and counting: 541000000  ; 40.3 mins
Mapping and counting: 542000000  ; 40.4 mins
Mapping and counting: 543000000  ; 40.5 mins
Mapping and counting: 544000000  ; 40.6 mins
Mapping and counting: 545000000  ; 40.6 mins
Mapping and counting: 546000000  ; 40.7 mins
Mapping and counting: 547000000  ; 41.0 mins
Mapping and counting: 548000000  ; 41.1 mins
Mapping and counting: 549000000  ; 41.2 mins
Mapping and counting: 550000000  ; 41.2 mins
Mapping and counting: 551000000  ; 41.3 mins
Mapping and counting: 552000000  ; 41.4 mins
Mapping and counting: 553000000  ; 41.4 mins
Mapping and counting: 554000000  ; 41.5 mins
HICONF MAPPING (SIMPLE) = 0, LOWCONF MAPPING (ALL SUBREADS, NOT SIMPLE) = 0

After processing batches, 0 UMIs were removed in step2 of UMI merging.
/tmp/slurm/job16967554/slurm_script: line 40: 22098 Floating point exception(core dumped) Rscript APOE_hpc.r

0
Entering edit mode

Your R is almost 2 years old. You will need to update it to its latest version (4.2) to run it with latest Rsubread.

0
Entering edit mode

1) I initially did not think it was going to be an issue. I actually just tried under a newer version of R already installed on the cluster R-4.1.1. I experienced the same floating point error (running the same code, same machine). This apparently happens during feature counting because the "bam" files are generated. I'm honestly surprised that R.4.2.0 is absolutely required to run Rsubread 2.10.0 and that treatment of scRNAseq by cellCounts() did not work until R 4.2.0 was released approx 2 weeks ago.

2) Can align() and featureCounts() also be used for scRNAseq data?

1
Entering edit mode

You can use align() to map FASTQ-format scRNAseq reads, however this is not recommended. cellCounts implements a read mapping algorithm that is similar to align() but is more sensitive (we found that more sensitive mapping works better for 10x reads). So you should use cellCounts for 10x read mapping instead of align().

featureCounts() generates read counts, but it is unable to generate UMI counts. So you cant use it for scRNAseq data. However, featureCounts() is used in cellCounts to assign reads to genes and then cellCounts will create UMI counts based on the read assignment results and other information.

0
Entering edit mode

Wei, could you provide a link to a dummy synthetic example (e.g. as in the Rsubread vignette) or a small real example (as in the Rsubread UserGuide Case Studies) of scRNAseq that you know works with cellCounts() with Rsubread 2.10.0 under R.4.2.0. That way, I can test whether the error I encounter comes from my data or not. Thanks!

1
Entering edit mode

Yes we are currently working on generating a small example for cellCounts.

0
Entering edit mode

I tested the cellCounts() function again on the new example provided in version 2.10.3, using R 4.2.0 and Biocionductor 3.15. The example runs successfully until after the internal call to featureCounts. However, it fails at the end when it tries to re-collect the results using 'readMM()' from dependecy 'Matrix'. See error message :

I think you should prefix 'readMM' with 'Matrix::readMM()' in internal function .read.sparse.mat(). And no need to attach the library 'Matrix' since it's already in the namespace.

0
Entering edit mode

Thanks for the report but the error cannot be reproduced. We tested Rsubread 2.10.3 in R 4.2.0 and Bioconductor 3.15, and cellCounts didn't report any error messages. We always test our programs before every release of Rsubread in many platforms, and we haven't had similar errors before. Given that the readMM function had been in the Matrix package for many years (at least in 2015), we always assume that this function is available when cellCounts loaded the Matrix package.

Can you provide the sessionInfo() output from R 4.2.0 and Rsubread 2.10.3 after cellCounts reports the error? Also, can you try running this in your R environment:

> library(Matrix)


to see if you can have the sourcecode of the readMM function shown on screen?