Question: Problem with BiocParallel::bpiterate
0
20 months ago by
Norway/Trondheim/NTNU
ioannis.vardaxis30 wrote:

Hey,

I am trying to read a paired-end BAM file in chunks using BiocParallel::bpiterate but the iterations never stop, here is what I do:

bamfile=Rsamtools::BamFile(file="myBamFile",index="myBamFile",asMates=TRUE,yieldSize=1000000)

snow <- BiocParallel::SnowParam(workers = 4, type = "SOCK")
BiocParallel::register(snow, default=TRUE)

open(bamfile)
Done<<-FALSE
bamIterator=function(){
chunk=NULL
if(!Done){
FlagsParam=Rsamtools::scanBamFlag(isPaired=TRUE,isUnmappedQuery=FALSE,
hasUnmappedMate=FALSE,
isSecondaryAlignment=FALSE,
isNotPassingQualityControls=FALSE,
isDuplicate=FALSE)

use.names=FALSE,with.which_label=FALSE,

if(length(chunk)==0L){
close(bamfile)#close connection
Done<<-TRUE
chunk=NULL
}
}
return(chunk)
}

PETdata=suppressMessages(BiocParallel::bpiterate(ITER=bamIterator,FUN=function(x,...) x))

If I run the above code, the iterations will never end. My file is not big at all so they should stop quite fast. Now, if I replace FUN=function(x,...) x) with FUN=c, it will not run in parallel for some reason, but it will not stop either. If I also replace ==0L with <1000000, it will stop but it will not read the whole BAM file. I have also tried to run the example for the bpiterate function and it runs forever too.

Thanks.

Update:

The function works perfect if it is NOT in parallel mode. If I set parallel mode as above it loops for eternity. I cant figure out why that happens. I don't have any such problems with other BiocParallel functions.

Another Update:

I changed the code and used the GenomicFiles::reduceByYield function. The problem is the same, if I run the function linearly it runs ok. If I try to initiate parallel backhead and run it  in parallel ,it will never stop. This does not happen for every data I have though. I have a paired-bam file for which it will run in parallel, and another one which is a little bigger but not so big, and for this one the function stacks. The thing is that I can upload this BAM in R without using the yield option, however I have bigger BAM files and therefore I need to use parallelism.

I have also tried to see where the function stacks when in parallel by creating a log file and printing the total number PETs loaded. It seems that it gets stacked when all the workers are in process. Then the log file is not updated anymore (while it should as it does for the other BAM file for which parallel is not a problem). The code is the following:

bamfile=Rsamtools::BamFile(file="myBAMfile",asMates=TRUE,yieldSize=500000)

snow <- BiocParallel::SnowParam(workers=4,type="SOCK")
BiocParallel::register(snow,default=TRUE)

yield=function(x){
FlagsParam=Rsamtools::scanBamFlag(isPaired=TRUE,isUnmappedQuery=FALSE,
hasUnmappedMate=FALSE,
isSecondaryAlignment=FALSE,
isNotPassingQualityControls=FALSE,
isDuplicate=FALSE)

use.names=FALSE,with.which_label=FALSE,
if(length(chunk)==0) chunk=NULL
return(chunk)
}

PETdata=GenomicFiles::reduceByYield(X=bamfile,YIELD=yield,REDUCE=c,MAP=identity,parallel=TRUE)

And another update

So the one BAM file that worked in parallel, works because it is small and while the other workers load, the one "default" worker has enough  time to do all the jobb it self. So the whole problem comes when the workers need to do their jobb, it then never finishes. However I dont have any such problems with the bplapply function in BiocParallel for example.