Question

Rsubread featureCounts segmentation fault with lots of read groups

0

Entering edit mode

mruffalo • 0

@abae30b8

Last seen 2.5 years ago

United States

Hello-

I'm trying to use Rsubread's featureCounts function in a single-cell ATAC-seq data set, to count the number of reads in each cell that map to each peak. My input is a set of peaks called by MACS2 and a BAM file that initially contained cell barcodes in the CB tag of each read -- then I found that featureCounts has no mechanism to use anything but the read group to separate reads, so I copied the barcodes to each read's RG tag and added @RG header lines for each.

featureCounts is segfaulting when I try to read this file with byReadGroup = TRUE:


        ==========     _____ _    _ ____  _____  ______          _____  
        =====         / ____| |  | |  _ \|  __ \|  ____|   /\   |  __ \ 
          =====      | (___ | |  | | |_) | |__) | |__     /  \  | |  | |
            ====      \___ \| |  | |  _ <|  _  /|  __|   / /\ \ | |  | |
              ====    ____) | |__| | |_) | | \ \| |____ / ____ \| |__| |
        ==========   |_____/ \____/|____/|_|  \_\______/_/    \_\_____/
       Rsubread 2.8.0

//========================== featureCounts setting ===========================\\
||                                                                            ||
||             Input files : 1 BAM file                                       ||
||                                                                            ||
||                           aligned_hg38_rg.bam                              ||
||                                                                            ||
||              Paired-end : yes                                              ||
||        Count read pairs : yes                                              ||
||              Annotation : R data.frame                                     ||
||      Dir for temp files : .                                                ||
||                 Threads : 24                                               ||
||                   Level : meta-feature level                               ||
||      Multimapping reads : counted                                          ||
|| Multi-overlapping reads : not counted                                      ||
||   Min overlapping bases : 1                                                ||
||          Read reduction : to 5' end                                        ||
||                                                                            ||
\\============================================================================//

//================================= Running ==================================\\
||                                                                            ||
|| Load annotation file .Rsubread_UserProvidedAnnotation_pid8 ...             ||
||    Features : 84305                                                        ||
||    Meta-features : 84305                                                   ||
||    Chromosomes/contigs : 280                                               ||
||                                                                            ||
|| Process BAM file aligned_hg38_rg.bam...                                    ||

 *** caught segfault ***
address (nil), cause 'memory not mapped'

Traceback:
 1: Rsubread::featureCounts(opt$bam_file, annot.ext = regions_frame,     read2pos = 5, isPairedEnd = TRUE, byReadGroup = TRUE, nthreads = opt$threads)

 Possible actions:
1: abort (with core dump, if enabled)
2: normal R exit
3: exit R without saving workspace
4: exit R saving workspace
Selection:

Selecting any option then results in xmalloc: out of virtual memory being printed on the same line as the Selection: prompt.

The BAM file has quite a few read groups:

$ samtools view -H aligned_hg38_rg.bam | grep '^@RG' | wc -l
2278644

This would be a large matrix, but I was expecting a lot of CPU and memory usage, not a segmentation fault. I'm running on a machine with 3TB memory, and segfaults happen after a little more than 50GB allocated. I also converted to SAM and tried to read that, with the same result.

As suggested:

> sessionInfo()
R version 4.1.1 (2021-08-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.3 LTS

Matrix products: default
BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.8.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=C             
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] Rsubread_2.8.0

loaded via a namespace (and not attached):
[1] compiler_4.1.1  Matrix_1.3-4    grid_4.1.1      lattice_0.20-44

Is there anything I can do to work around this failure? (I'm actually using another package (cisTopic) which uses Rsubread::featureCounts, so it isn't trivial to switch to htseq-count or something like that.)

Thank you!

Rsubread • 1.7k views

ADD COMMENT • link updated 2.5 years ago by Yang Liao ▴ 380 • written 2.5 years ago by mruffalo • 0

score 0 · Answer 1 · 2021-11-03

featureCounts() normally requires relatively little memory. Counting reads by MACS peaks should be a quick and fairly lightweight operation requiring nothing remotely like 50GB of RAM. So the unusual structure of your BAM file must be somehow causing the problem.

Note that featureCounts is not designed to demultiplex single cell data. It does only what it is documented to do.

score 0 · Answer 2 · 2021-11-04

Yes, I can reproduce this problem. When I created a BAM file that has > 1 million read groups, featureCounts reported the same error.

It was because featureCounts creates a read-count table for each read group when it runs on the "byReadGroup" mode. This uses a huge amount of virtual memory space when there are millions of read groups. In my case, it used 2TB of virtual memory becore crashing. It indeed needed this much of memory to store the per-gene read counts for each of the 1 million read groups.

Even if the computer has a sufficient amount of memory to let featureCounts run on the byReadGroup mode with millions of read groups in the BAM input, the result data object will be hundreds of gigabytes, or even more than one tarabyte. It is very hard to do further analysis using this result object.

For your analysis, I assume that you have around 2.3 millions of distinct cell barcodes and around 84 thousands of ATAC-seq peaks. The analysis was to assign these cell-barcodes to the peaks. For this purpose, I suggest not to use the read group mode in featureCounts, but specify reportReads='BAM' to featureCounts. This option will create a new BAM file (rather quickly and not much larger than the input BAM file). This BAM file will contain the original "CB" tag in each alignment, accompanied by the peak(s) where this alignment is assigned to. You may then use a string-friendly scripting language (e.g. Python or AWK) to process this new BAM file and retrieve the relationship between the cell barcodes and peaks from it.