BiocParallel: bplapply invents data
1
0
Entering edit mode
robmaz77 ▴ 20
@dc2bd3d5
Last seen 18 months ago

I observe some weird behavior trying to read paired FASTQs in parallel. First the expected behavior without parallel execution (file1 and file2 contain exactly 1000 reads):

> fqStream <- FastqStreamerList(c(Fw=file1,Rv=file2),n=1000)
> fqChunk1 <- lapply(fqStream,yield)
> fqChunk1
$Fw class: ShortReadQ length: 1000 reads; width: 150 cycles$Rv
length: 1000 reads; width: 150 cycles

> fqChunk2 <- lapply(fqStream,yield)
> fqChunk2
$Fw class: ShortReadQ length: 0 reads; width: cycles$Rv


Now with bplapply instead of lapply:

> fqStream <- FastqStreamerList(c(Fw=file1,Rv=file2),n=1000)
> fqChunk3 <- bplapply(fqStream,yield)
> fqChunk3
$Fw class: ShortReadQ length: 1000 reads; width: 150 cycles$Rv
length: 1000 reads; width: 150 cycles

> fqChunk4 <- bplapply(fqStream,yield)
> fqChunk4
$Fw class: ShortReadQ length: 146 reads; width: 150 cycles$Rv
length: 137 reads; width: 150 cycles


As it turns out, the 146 and 137 reads are the first in each file again, and it appears this can be repeated and will never yield 0 reads.

> identical(list(Fw=compact(fqChunk1$Fw[1:146]),Rv=compact(fqChunk1$Rv[1:137])),compact(fqChunk4))
[1] TRUE

> fqChunk5 <- bplapply(fqStream,yield)
> fqChunk5
$Fw class: ShortReadQ length: 146 reads; width: 150 cycles$Rv
length: 137 reads; width: 150 cycles

> identical(compact(fqChunk4),compact(fqChunk5))
[1] TRUE


Trying to read from the same stream using lapply will now also yield the same chunk over and over again, but with a warning:

> fqChunk6 <- lapply(fqStream,yield)
Warning messages:
1: IncompleteFinalRecord
FastqStreamer yield() incomplete final record:
test/Pool_517_1000_1.fq.gz
2: IncompleteFinalRecord
FastqStreamer yield() incomplete final record:
test/Pool_517_1000_2.fq.gz


What is going on here?

> sessionInfo()
R version 4.0.4 (2021-02-15)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.10

Matrix products: default

locale:
[1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=de_AT.UTF-8
[4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=de_AT.UTF-8    LC_MESSAGES=en_US.UTF-8
[10] LC_TELEPHONE=C             LC_MEASUREMENT=de_AT.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] yaml_2.2.1                  txtplot_1.0-4               crayon_1.4.0
[7] SummarizedExperiment_1.20.0 Biobase_2.50.0              MatrixGenerics_1.2.1
[10] matrixStats_0.58.0          Rsamtools_2.6.0             GenomicRanges_1.42.0
[13] GenomeInfoDb_1.26.2         Biostrings_2.58.0           XVector_0.30.0
[16] IRanges_2.24.1              S4Vectors_0.28.1            BiocParallel_1.24.1
[19] BiocGenerics_0.36.0         docopt_0.7.1

loaded via a namespace (and not attached):
[1] rstudioapi_0.13        zlibbioc_1.36.0        lattice_0.20-41        jpeg_0.1-8.1
[5] hwriter_1.3.2          tools_4.0.4            grid_4.0.4             png_0.1-7
[9] latticeExtra_0.6-29    Matrix_1.3-2           GenomeInfoDbData_1.2.4 RColorBrewer_1.1-2
[13] codetools_0.2-17       bitops_1.0-6           RCurl_1.98-1.2         DelayedArray_0.16.1
[17] compiler_4.0.4

0
Entering edit mode
@martin-morgan-1513
Last seen 13 days ago
United States

The file is opened on the 'manager', then internal information about how many reads have been read, etc, are updated on the worker without the manager knowing. The manager is left in a confused state. Similar problems occur with, e.g., a data base connection opened on the manager, but manipulated on the workers.

The solution is to open, manipulate, and close the file on the worker only.

0
Entering edit mode

Hm. But how can that be implemented to read corresponding chunks from two files in parallel and then process them together? Obviously you have to return from the workers for processing, at which point the state is then lost before the next chunk can be read. There is no secret way to let "yield" know the offset where it should resume or something like that?

Also, it seems that some info is coming back from the workers, since subsequent calls do not return the same info. So does this mean there would be a way to update the manager object correctly, or should this be considered a bug and the actually expected behavior under these conditions would be to always return just the first chunk?

0
Entering edit mode

I guess you are processing two fastQ files from several samples. So the strategy would be to process each sample in a separate thread. The partial update of the object in the manager thread is really an artifact. There is no way to correctly retrieve the results of the partial update from the workers.

0
Entering edit mode

I was thinking specifically about forward and reverse reads of read pairs that need to be processed together. Since they happen to come in two corresponding fastq files, I was wondering whether reading from the two files could be sped up by doing it in parallel. It would be worth it because of the implied decompression step, I think. But alas, it looks like this is a no go. Many thanks anyway for the explanation.