I'm trying to use processAmplicons to generate sg counts for a CRISPR screen. I've got ~200 million reads per fastq file, with 4-5 barcodes per file, and 180000 guides. I'm running this on my school's cluster which has ~30GB per node of the cluster. When I submit one fastq for analysis, I request 30G of memory and an entire node. After a full day, the output still says " -- Processing 10 million reads", however when I check the memory usage, I see that node is only using just under 1G of RAM.
I'm not sure why the memory usage is so low, I would expect the script to need all 30G of memory due to the size of the job. Is there some option in bioconductor or edgeR that may be throttling my memory usage? Any tips to speed this up? I have already tried "lazy parallelization" but due to this memory issue, that doesn't run any faster.
This function breaks the data into smaller chunks that it processes serially, so you shouldn't need 30G of memory. It also shouldn't take that long to process the first 10 million reads. Can you provide the example code and
sessionInfo()
and a small sample of the sequences from your FASTQ file (perhaps a few thousand) that we can test further on?