Entering edit mode
Hello,
Does anyone know if theres a way for featureCounts to analyse bam files in parallel? I have been running featureCounts for my bam files and its taking so long ( currently on hour 22 and it hasn't went through half the files yet!). I tried to increase the threads and cpu but this has made no improvement to previous attempts. Any help would be greatly appreciated.
Note, I'm using v2.01, I tried updating to subread v2.03 but conda doesn't have the latest version.
When you said the 2.0.1 version of Subread, it seems to be the "Command-line Interface" version of Subread, but not the Rsubread package. Subread v2.0.3 has very little differences to v2.0.1; only with a few bugs fixed, and some parameters for paired-end read counting were changed. These improvements don't change its efficiency.
FeatureCounts is generally very efficient; 22 hours of running should be sufficient to processe tens of terabytes of BAM files in a high-performance computer, or at least terabytes of BAM files in a laptop computer. You can use multiple threads to make it faster (e.g.. with a "-T 10" option, assuming that you want to use 10 CPU cores for running it). But if it is still very slow, you may give more details (e.g., the command line, the operating system, the hardware details), so we can further investigate into the reason of the slow running speed.
I have access to a HPC and I submit jobs on a linux operating system. It took 33 hours to process a bam file that is 300MB. I have a large GTF file and I don't know if that is the cause, however, a file that was 150MB took 2 hours to process.
Command line
featureCounts -T 8 -s 1 <bam> <gtf> -g gene_id -M -R BAM -fracOverlap 0.8 -o counts
Thanks for the details. I used the same settings on a Linux server with many (>8) CPU cores. It took featureCounts 18 seconds to process a BAM file of 2.3GBytes and generated the per-alignment result BAM file. I used the Ensembl Human annotations.
It is hard to say what was the reason for the very slow running in the HPC. A HPC environment with a task management system usually uses a network file system and many configurations can make the disk access extremely slow (e.g. if the NFS was configured to keep synchronisation between servers).
Because featureCounts is extremely efficient and uses very low level of memory in a usual setting, you can try to run the task in a local computer (say, the laptop). The Subread package has Windows, macOS and Linux binary builds for downloading on https://sourceforge.net/projects/subread/files/subread-2.0.3/ .
If you would like to use R, the Rsubread package also contains the featureCounts function. It has the same behaviour as the CLI version of featureCounts and is easy to install: https://bioconductor.org/packages/release/bioc/html/Rsubread.html
ok I'll try that. Thank you for your help!