I tested the R function 'cellCounts' [Rsubread package] on a simplisitc 2 samples scRNAseq data from a Mouse experiment. The function inputs reads from FastQ files. The cellCOunts() function requested 24 threads on a 24-CPU node with 90Gb of RAM. The job does not appear to have run out of memory and there has not been an alert about quotas in the last 3 days, so it does not appear disk space was an issue. In addition, the 'cellCounts' function seemed to have finished: the output says that it could map and count 554,000,000 reads in 41.5 mins. However, upon closer inspection of the log file and R output, 'cellCounts' did not return any value as expected, and the log file actually indicates that the job seems to have aborted at the very end due to a floating point exception (core dumped). See R output log below (edited for conciseness and privacy).
Floating point exceptions are caused by things like division by zero or overflow (trying to store a number too big for the data type). I believe R handles division by 0 gracefully, but the underlying C/C++ code would not.
Hopefully the developers/maintainers or someone can suggest some troubleshooting steps.
# 'M29.index' is the reference sequence index, succesfully built
# using Mouse Genome Reference (Release M29 - GRCm39,
# 'gencode.vM29.transcripts.fa.gz') from the GENCODE database.
# 'sample.sheet' is the dataframe referring to fastQ files
# (properly formatted - see output below).
fc <- cellCounts(index=file.path("M29.index", fsep=.Platform$file.sep),
sample=sample.sheet,
input.mode="FASTQ",
cell.barcode=NULL,
nsubreads=15,
minVotes=1,
maxMismatches=10,
minMappedLength=1,
annot.inbuilt="mm39",
nBestLocations=1,
useMetaFeatures=TRUE,
umi.cutoff=NULL,
nthreads=24)
# Results of running the following in an R session:
sessionInfo( )
R version 4.0.2 (2020-06-22)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux
Matrix products: default
BLAS: /usr/local/gcc-6_3_0/lapack/3.7.0/lib/libblas.so.3.7.0
LAPACK: /usr/local/gcc-6_3_0/lapack/3.7.0/lib/liblapack.so.3.7.0
Random number generation:
RNG: L'Ecuyer-CMRG
Normal: Inversion
Sample: Rejection
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] Rsubread_2.10.0 edgeR_3.36.0 limma_3.50.3
loaded via a namespace (and not attached):
[1] compiler_4.0.2 Matrix_1.2-18 Rcpp_1.0.8.3 grid_4.0.2
[5] locfit_1.5-9.4 lattice_0.20-41
# R output log:
-------------------------------
My job will use the following nodes:
-------------------------------
compt253
-------------------------------
My job will use the following number of tasks:
-------------------------------
24
-------------------------------
Temporarily Directory ($PFSDIR):
-------------------------------
/scratch/pbsjobs/job.16967554.hpc
-------------------------------
> library("limma")
> library("edgeR")
> library("Rsubread")
> fc <- cellCounts(index=file.path("M29.index", fsep=.Platform$file.sep),
+ sample=sample.sheet,
+ input.mode="FASTQ",
+ cell.barcode=NULL,
+ nsubreads=15,
+ minVotes=1,
+ maxMismatches=10,
+ minMappedLength=1,
+ annot.inbuilt="mm39",
+ nBestLocations=1,
+ useMetaFeatures=TRUE,
+ umi.cutoff=NULL,
+ nthreads=24)
NCBI RefSeq annotation for mm39 (build 39) is used.
SampleName
1 GC-CC-1081_1
2 GC-CC-1081_1
3 GC-CC-1081_2
4 GC-CC-1081_2
BarcodeUMIFile
1 /home/.../GC-CC-1081_1_S2_L001_R1_001.fastq.gz
2 /home/.../GC-CC-1081_1_S2_L002_R1_001.fastq.gz
3 /home/.../GC-CC-1081_2_S3_L001_R1_001.fastq.gz
4 /home/.../GC-CC-1081_2_S3_L002_R1_001.fastq.gz
ReadFile
1 /home/.../GC-CC-1081_1_S2_L001_R2_001.fastq.gz
2 /home/.../GC-CC-1081_1_S2_L002_R2_001.fastq.gz
3 /home/.../GC-CC-1081_2_S3_L001_R2_001.fastq.gz
4 /home/.../GC-CC-1081_2_S3_L002_R2_001.fastq.gz
Found 3 known cell barcode sets.
Testing the cell barcodes in 3M-february-2018.txt.gz.
Loaded 6794880 cell barcodes from /home/.../.Rsubread/cellCounts/3M-february-2018.txt.gz
Cell barcode supporting rate : 97.5%.
Found cell-barcode list '3M-february-2018.txt.gz' for the input data: supported by 97.5% reads.
Param #0 = 'R_cellCounts'
Param #1 = '--inputMode'
Param #2 = 'FASTQ'
Param #3 = '--cellBarcodeFile'
Param #4 = '/home/.../.Rsubread/cellCounts/3M-february-2018.txt.gz'
Param #5 = '--dataset'
Param #6 = '/home/.../GC-CC-1081_1_S2_L001_R1_001.fastq.gz|Rsd:cCounts:1mFQ|.|Rsd:cCounts:1mFQ|/home/.../GC-CC-1081_1_S2_L001_R2_001.fastq.gz|Rsd:cCounts:mFQs|/home/.../GC-CC-1081_1_S2_L002_R1_001.fastq.gz|Rsd:cCounts:1mFQ|.|Rsd:cCounts:1mFQ|/home/.../GC-CC-1081_1_S2_L002_R2_001.fastq.gz|Rsd:cCounts:mFQs|/home/.../GC-CC-1081_2_S3_L001_R1_001.fastq.gz|Rsd:cCounts:1mFQ|.|Rsd:cCounts:1mFQ|/home/.../GC-CC-1081_2_S3_L001_R2_001.fastq.gz|Rsd:cCounts:mFQs|/home/.../GC-CC-1081_2_S3_L002_R1_001.fastq.gz|Rsd:cCounts:1mFQ|.|Rsd:cCounts:1mFQ|/home/.../GC-CC-1081_2_S3_L002_R2_001.fastq.gz'
Param #7 = '--sampleSheetFile'
Param #8 = './.Rsubread_cCounts_Tmp_for_Pid_22098_Rproc.samplesheet'
Param #9 = '--index'
Param #10 = '/home/.../M29.index'
Param #11 = '--annotation'
Param #12 = '/home/.../.usr/local/R_packages/R-4.0.2/Rsubread/annot/mm39_RefSeq_exon.txt.gz'
Param #13 = '--geneIdColumn'
Param #14 = 'gene_id'
Param #15 = '--annotationType'
Param #16 = 'exon'
Param #17 = '--threads'
Param #18 = '24'
Param #19 = '--output'
Param #20 = './.Rsubread_cCounts_Tmp_for_Pid_22098_Rproc'
Param #21 = '--maxMismatch'
Param #22 = '10'
Param #23 = '--minVotesPerRead'
Param #24 = '1'
Param #25 = '--subreadsPerRead'
Param #26 = '15'
Param #27 = '--reportedAlignmentsPerRead'
Param #28 = '1'
Param #29 = '--maxDiffToTopVotes'
Param #30 = '2'
Param #31 = '--minMappedLength'
Param #32 = '1'
Param #33 = '--umiCutoff'
Param #34 = '-999'
Param #35 = '--reportMultiMappingReads'
UMI_CUT=-999.00
INIT LOCKER 1 at 0x7f957c9b9610
|| Load the 1-th index block... ||
|| The index block has been loaded. ||
Mapping and counting: 0 ; 2.0 mins
Mapping and counting: 1000000 ; 2.6 mins
Mapping and counting: 2000000 ; 2.7 mins
Mapping and counting: 3000000 ; 2.7 mins
Mapping and counting: 4000000 ; 2.8 mins
Mapping and counting: 5000000 ; 2.9 mins
Mapping and counting: 6000000 ; 2.9 mins
Mapping and counting: 7000000 ; 3.0 mins
Mapping and counting: 8000000 ; 3.1 mins
Mapping and counting: 9000000 ; 3.1 mins
Mapping and counting: 10000000 ; 3.2 mins
Mapping and counting: 11000000 ; 3.3 mins
Mapping and counting: 12000000 ; 3.3 mins
Mapping and counting: 13000000 ; 3.4 mins
Mapping and counting: 14000000 ; 3.5 mins
Mapping and counting: 15000000 ; 3.5 mins
Mapping and counting: 16000000 ; 3.6 mins
Mapping and counting: 17000000 ; 3.7 mins
Mapping and counting: 18000000 ; 3.7 mins
Mapping and counting: 19000000 ; 3.8 mins
Mapping and counting: 20000000 ; 3.9 mins
Mapping and counting: 21000000 ; 3.9 mins
Mapping and counting: 22000000 ; 4.0 mins
Mapping and counting: 23000000 ; 4.1 mins
Mapping and counting: 24000000 ; 4.1 mins
Mapping and counting: 25000000 ; 4.2 mins
Mapping and counting: 26000000 ; 4.3 mins
Mapping and counting: 27000000 ; 4.3 mins
Mapping and counting: 28000000 ; 4.4 mins
Mapping and counting: 29000000 ; 4.5 mins
[...]
HICONF MAPPING (SIMPLE) = 0, LOWCONF MAPPING (ALL SUBREADS, NOT SIMPLE) = 0
Mapping and counting: 540000000 ; 40.3 mins
Mapping and counting: 541000000 ; 40.3 mins
Mapping and counting: 542000000 ; 40.4 mins
Mapping and counting: 543000000 ; 40.5 mins
Mapping and counting: 544000000 ; 40.6 mins
Mapping and counting: 545000000 ; 40.6 mins
Mapping and counting: 546000000 ; 40.7 mins
Mapping and counting: 547000000 ; 41.0 mins
Mapping and counting: 548000000 ; 41.1 mins
Mapping and counting: 549000000 ; 41.2 mins
Mapping and counting: 550000000 ; 41.2 mins
Mapping and counting: 551000000 ; 41.3 mins
Mapping and counting: 552000000 ; 41.4 mins
Mapping and counting: 553000000 ; 41.4 mins
Mapping and counting: 554000000 ; 41.5 mins
HICONF MAPPING (SIMPLE) = 0, LOWCONF MAPPING (ALL SUBREADS, NOT SIMPLE) = 0
After processing batches, 0 UMIs were removed in step2 of UMI merging.
/tmp/slurm/job16967554/slurm_script: line 40: 22098 Floating point exception(core dumped) Rscript APOE_hpc.r
Your R is almost 2 years old. You will need to update it to its latest version (4.2) to run it with latest Rsubread.
1) I initially did not think it was going to be an issue. I actually just tried under a newer version of R already installed on the cluster R-4.1.1. I experienced the same floating point error (running the same code, same machine). This apparently happens during feature counting because the "bam" files are generated. I'm honestly surprised that R.4.2.0 is absolutely required to run Rsubread 2.10.0 and that treatment of scRNAseq by cellCounts() did not work until R 4.2.0 was released approx 2 weeks ago.
2) Can align() and featureCounts() also be used for scRNAseq data?
You can use align() to map FASTQ-format scRNAseq reads, however this is not recommended. cellCounts implements a read mapping algorithm that is similar to align() but is more sensitive (we found that more sensitive mapping works better for 10x reads). So you should use cellCounts for 10x read mapping instead of align().
featureCounts() generates read counts, but it is unable to generate UMI counts. So you cant use it for scRNAseq data. However, featureCounts() is used in cellCounts to assign reads to genes and then cellCounts will create UMI counts based on the read assignment results and other information.
Wei, could you provide a link to a dummy synthetic example (e.g. as in the Rsubread vignette) or a small real example (as in the Rsubread UserGuide Case Studies) of scRNAseq that you know works with cellCounts() with Rsubread 2.10.0 under R.4.2.0. That way, I can test whether the error I encounter comes from my data or not. Thanks!
Yes we are currently working on generating a small example for cellCounts.
I tested the cellCounts() function again on the new example provided in version 2.10.3, using R 4.2.0 and Biocionductor 3.15. The example runs successfully until after the internal call to featureCounts. However, it fails at the end when it tries to re-collect the results using 'readMM()' from dependecy 'Matrix'. See error message :
Error in readMM(paste0(fn, ".spmtx")) : could not find function "readMM"
I think you should prefix 'readMM' with 'Matrix::readMM()' in internal function .read.sparse.mat(). And no need to attach the library 'Matrix' since it's already in the namespace.
Thanks for the report but the error cannot be reproduced. We tested
Rsubread
2.10.3 inR
4.2.0 andBioconductor
3.15, andcellCounts
didn't report any error messages. We always test our programs before every release ofRsubread
in many platforms, and we haven't had similar errors before. Given that thereadMM
function had been in theMatrix
package for many years (at least in 2015), we always assume that this function is available whencellCounts
loaded theMatrix
package.Can you provide the
sessionInfo()
output fromR
4.2.0 andRsubread
2.10.3 aftercellCounts
reports the error? Also, can you try running this in yourR
environment:to see if you can have the sourcecode of the
readMM
function shown on screen?I tested again the cellCounts() function on the new scNAseq dataset and example provided in Rsubread 2.10.4, using R 4.2.0 and Biocionductor 3.15.
1) Even though the namespace correctly includes the line 'importFrom("Matrix", "readMM")', Matrix::readMM would import only if the package 'Matrix' is manually attached!
2) cellCounts() successfully runs the Rsubread example up to the building of the index and the read mapping part, but fails catastrophically in the featureCounts part. The R session crashes! In version 2.10.0, it was due to a floating point error. In this version, it's due to a memory allocation overflow. I reproduced the problem in a Windows and Linux machine, each with Rsubread 2.10.4, R 4.2.0 and Biocionductor 3.15. See attached sesionInfo() output and crash screenshot below. So, these tests conclusively prove that the problem was neither due to the data nor the R version that I was using as previously suspected by the developper.
Matrix products: default BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0 LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
locale: 1 LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages: 1 parallel stats graphics grDevices utils datasets methods base
other attached packages: 1 BiocParallel_1.30.3 Matrix_1.4-1 Rsubread_2.10.4
loaded via a namespace (and not attached): 1 Rcpp_1.0.8.3 codetools_0.2-18 lattice_0.20-45 grid_4.2.0 R6_2.5.1
[6] lifecycle_1.0.1 scales_1.2.0 rlang_1.0.3 cli_3.3.0 rstudioapi_0.13
[11] tools_4.2.0 munsell_0.5.0 compiler_4.2.0 colorspace_2.0-3 BiocManager_1.30.18
SampleName BarcodeUMIFile ReadFile 1 Example Data/Real2/reads_R1.fastq.gz Data/Real2/reads_R2.fastq.gz readfile1 (b4 check): .../reads_R1.fastq.gz|Rsd:cCounts:1mFQ|.|Rsd:cCounts:1mFQ|.../reads_R2.fastq.gz readfile1 (after check): .../reads_R1.fastq.gz|Rsd:cCounts:1mFQ|.|Rsd:cCounts:1mFQ|.../reads_R2.fastq.gz
//================================= setting ==================================\ || || || Function : Read alignment (RNA-Seq) || || Input file : 1 samples from scRNA-seq || || Output file : .Rsubread_cCounts_Tmp_for_Pid_623021_Rproc (BAM) || || Index name : chr1.index || || || || ------------------------------------ || || || || Threads : 18 || || Phred offset : 33 || || Min votes : 3 / 10 || || Max mismatches : 3 || || Max indel length : 5 || || Report multi-mapping reads : yes || || Max alignments per multi-mapping read : 1 || || Annotations : inbuilt (hg38) || || || \============================================================================//
//================ Running (30-Jun-2022 13:19:47, pid=623021) ================\ || || ...
|| || || Completed successfully. || || || \==================================== ====================================//
//================================ Summary =================================\ || || || Total reads : 403,390 || || Mapped : 401,740 (99.6%) || || Uniquely mapped : 382,378 || || Multi-mapping : 19,362 || || || || Unmapped : 1,650 || || || || Indels : 11 || || || || Running time : 0.6 minutes || || || \============================================================================//
NCBI RefSeq annotation for hg38 (build 38.2) is used.
//========================== featureCounts setting ===========================\ || || || Input files : 1 BAM file || || || || .Rsubread_cCounts_Tmp_for_Pid_623021_Rproc || || || || Paired-end : no || || Count read pairs : no || || Annotation : inbuilt (hg38) || || Dir for temp files : . || || || || scRNA count table : <input_file>.scRNA.table || || scRNA sample sheet : .Rsubread_cCounts_Tmp_for_Pid_623021_Rproc.s ... || || scRNA barcode list : Barcodes || || Threads : 18 || || Level : meta-feature level || || Multimapping reads : counted || || Multi-overlapping reads : not counted || || Min overlapping bases : 1 || || || \============================================================================//
//================================= Running ==================================\ || || || Load annotation file hg38_RefSeq_exon.txt ... || || Features : 261752 || || Meta-features : 28395 || || Chromosomes/contigs : 55 || || || || Load scRNA-related files... || || scRNA samples : 1 || || scRNA cell barcodes : 0 || || || || Process BAM file .Rsubread_cCounts_Tmp_for_Pid_623021_Rproc... || || Strand specific : stranded || || scRNA quality control in first 20,000 reads: || || 100.0 pct reads have valid sample indices. || || 0.0 pct reads have valid cell barcodes. || || ||