Error using r msa::msaClustalW in parallel
1
1
Entering edit mode
Last seen 24 months ago

Dear all,

I have realized that msa::msaClustalW doesn't work when using foreach::foreach or BiocParallel::bplapply parallelization. Bellow, I send a small script that can reproduce this error.

library(Biostrings)
library(msa)
library(doParallel)
library(foreach)
library(dplyr)

seqs <- DNAStringSetList(c("A", "AT", "T")) %>%
rep(100)

registerDoParallel(cores=2)
res <- foreach(seqs_i = seqs) %dopar%
msaClustalW(seqs_i)
# ERROR: Cannot open output file [internalRsequence.dnd]
# ERROR: Wrong format in tree file internalRsequence.dnd
# Error in msaClustalW(seqs_i) :
#   task 40 failed - "There is an invalid aln file!"


But, if we don't use parallelization, it works nicely:

res <- foreach(seqs_i = seqs) %do%
msaClustalW(seqs_i)

sessionInfo()
# R version 3.6.2 (2019-12-12)
# Platform: x86_64-pc-linux-gnu (64-bit)
# Running under: Ubuntu 18.04.3 LTS
#
# Matrix products: default
# BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
# LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
#
# locale:
# [1] LC_CTYPE=en_US.UTF-8         LC_NUMERIC=C
# [3] LC_TIME=en_GB.UTF-8          LC_COLLATE=en_US.UTF-8
# [5] LC_MONETARY=en_GB.UTF-8      LC_MESSAGES=en_US.UTF-8
# [7] LC_PAPER=en_GB.UTF-8         LC_NAME=C
# [11] LC_MEASUREMENT=en_GB.UTF-8  LC_IDENTIFICATION=C
#
# attached base packages:
# [1] stats4    parallel  stats     graphics  grDevices utils     datasets
# [8] methods   base
#
# other attached packages:
# [1] dplyr_0.8.3          doParallel_1.0.15   iterators_1.0.12
# [4] foreach_1.4.7        msa_1.16.0          Biostrings_2.52.0
# [7] XVector_0.24.0       IRanges_2.18.3      S4Vectors_0.22.1
# [10] BiocGenerics_0.30.0
#
# loaded via a namespace (and not attached):
# [1] Rcpp_1.0.2           rstudioapi_0.10     magrittr_1.5
# [4] zlibbioc_1.30.0      tidyselect_0.2.5    BiocParallel_1.18.1
# [7] R6_2.4.0             rlang_0.4.0         tools_3.6.2
# [10] assertthat_0.2.1    tibble_2.1.3        crayon_1.3.4
# [13] purrr_0.3.2         codetools_0.2-16    glue_1.3.1
# [16] compiler_3.6.2      pillar_1.4.2        pkgconfig_2.0.3


I also would like to inform that sometimes it works in parallelization. It seems that the higher is the length of the object seqs, the more likely it is to occur the error.

I could use msa(method="Muscle"), it works in parallel but causes memory leaks.

Could you give me any tips on how to do that, or tell me what I am doing wrong, please?

Thank you in advance. Best wishes.

software error msa • 228 views
0
Entering edit mode
UBodenhofer ▴ 290
@ubodenhofer-5425
Last seen 12 weeks ago
University of Applied Sciences Upper Au…

Thanks for identifying this issue! I had a look at the source code and I am sure I have found the reason: msaClustalW() internally writes to temporary files with fixed file names and erases those files after completion. If you have multiple instances running in parallel, files with the same path are created and deleted by the worker processes in an undetermined order. Unfortunately, fixing this issue is not straightforward and I cannot make any promises when and how this will be done. At least, I will add a note to the documentation in the upcoming release (Apr/May 2020). However, there might be a workaround that you can use immediately: write the DNAStringSet objects to different files and pass the file names to msaClustalW() instead of DNAStringSet objects. Give this a try and please come back to this forum to report whether it worked.