Hey,
I am trying to read a paired-end BAM file in chunks using BiocParallel::bpiterate but the iterations never stop, here is what I do:
bamfile=Rsamtools::BamFile(file="myBamFile",index="myBamFile",asMates=TRUE,yieldSize=1000000) snow <- BiocParallel::SnowParam(workers = 4, type = "SOCK") BiocParallel::register(snow, default=TRUE) open(bamfile) Done<<-FALSE bamIterator=function(){ chunk=NULL if(!Done){ FlagsParam=Rsamtools::scanBamFlag(isPaired=TRUE,isUnmappedQuery=FALSE, hasUnmappedMate=FALSE, isSecondaryAlignment=FALSE, isNotPassingQualityControls=FALSE, isDuplicate=FALSE) ReadParam=Rsamtools::ScanBamParam(flag=FlagsParam) chunk=GenomicAlignments::readGAlignmentPairs(file=bamfile, use.names=FALSE,with.which_label=FALSE, strandMode=1,param=ReadParam) if(length(chunk)==0L){ close(bamfile)#close connection Done<<-TRUE chunk=NULL } } return(chunk) } PETdata=suppressMessages(BiocParallel::bpiterate(ITER=bamIterator,FUN=function(x,...) x))
If I run the above code, the iterations will never end. My file is not big at all so they should stop quite fast. Now, if I replace FUN=function(x,...) x)
with FUN=c,
it will not run in parallel for some reason, but it will not stop either. If I also replace ==0L
with <1000000,
it will stop but it will not read the whole BAM file. I have also tried to run the example for the bpiterate function and it runs forever too.
Thanks.
Update:
The function works perfect if it is NOT in parallel mode. If I set parallel mode as above it loops for eternity. I cant figure out why that happens. I don't have any such problems with other BiocParallel
functions.
Another Update:
I changed the code and used the GenomicFiles::reduceByYield
function. The problem is the same, if I run the function linearly it runs ok. If I try to initiate parallel backhead and run it in parallel ,it will never stop. This does not happen for every data I have though. I have a paired-bam file for which it will run in parallel, and another one which is a little bigger but not so big, and for this one the function stacks. The thing is that I can upload this BAM in R without using the yield
option, however I have bigger BAM files and therefore I need to use parallelism.
I have also tried to see where the function stacks when in parallel by creating a log file and printing the total number PETs loaded. It seems that it gets stacked when all the workers are in process. Then the log file is not updated anymore (while it should as it does for the other BAM file for which parallel is not a problem). The code is the following:
bamfile=Rsamtools::BamFile(file="myBAMfile",asMates=TRUE,yieldSize=500000) snow <- BiocParallel::SnowParam(workers=4,type="SOCK") BiocParallel::register(snow,default=TRUE) yield=function(x){ FlagsParam=Rsamtools::scanBamFlag(isPaired=TRUE,isUnmappedQuery=FALSE, hasUnmappedMate=FALSE, isSecondaryAlignment=FALSE, isNotPassingQualityControls=FALSE, isDuplicate=FALSE) ReadParam=Rsamtools::ScanBamParam(flag=FlagsParam) chunk=GenomicAlignments::readGAlignmentPairs(file=x, use.names=FALSE,with.which_label=FALSE, strandMode=1,param=ReadParam) if(length(chunk)==0) chunk=NULL return(chunk) } PETdata=GenomicFiles::reduceByYield(X=bamfile,YIELD=yield,REDUCE=c,MAP=identity,parallel=TRUE)
And another update
So the one BAM file that worked in parallel, works because it is small and while the other workers load, the one "default" worker has enough time to do all the jobb it self. So the whole problem comes when the workers need to do their jobb, it then never finishes. However I dont have any such problems with the bplapply function in BiocParallel for example.