Search
Question: running msa (msa) aborts R session
0
11 months ago by
ans740
ans740 wrote:

Hello there,

I'm trying to run msa (msa package, version 1.10.0) in R version 3.4.3 (2017-11-30) -- "Kite-Eating Tree" with ~71,000 260-400bp sequences.

However, R session aborts everytime I run the following:

alignment <- msa(dna, method="ClustalW")

No extra info is given, since a new R session is started.

dna looks normal by the way:


> dna
A DNAStringSet instance of length 70937
width seq                                                                        names
[1]   402 TGGGGAATATTACACAATGGAGGAAACTCTGATGTA...CTGACGCTCAGATGCGAAAGCGTGGGTAGCAAACA SV_1
[2]   427 TGGGGAATTTTGGACAATGGGCGCAAGCCTGATCCA...CTGACGCTCATGCACGAAAGCGTGGGGAGCAAACA SV_2
[3]   402 TGAGGAATATTGCACAATGGAGGAAACTCTGATGCA...CTGACGCTGAGGCACGAAAGCGTGGGGAGCAAACA SV_3
[4]   427 TGGGGAATTTTGGACAATGGGCGCAAGCCTGATCCA...CTGACGCTCATGCACGAAAGCGTGGGGAGCAAACA SV_4
[5]   427 TGGGGAATTTTGGACAATGGACGAAAGTCTGATCCA...CTGACGCTCATGCACGAAAGCGTGGGGAGCAAACA SV_5
...   ... ...
[70933]   428 TGGGGAATATTGCGCAATGGCCGAAAGGCTGACGCA...CTGACGCTCATGCACGAAAGCGTGGGGAGCAAACA SV_70933
[70934]   428 TGGGGAATATTGCGCAATGGCCGAAAGGCTGACGCA...CTGACGCTCATGCACGAAAGCGTGGGGAGCAAACA SV_70934
[70935]   403 ACGAGAATATTCGACAATGCACGAAAGTGTGATCGA...CTGACGGTCAATCACTAAAGCGTGGGGATCAAAAA SV_70935
[70936]   402 TGGGGAATATTGGACAATGGGCGCAAGCCTGATCCA...TTGACGCTCATGCACGAAAGCGTGGGGAGCAAACA SV_70936
[70937]   429 TGGGGAATTTTGGACAATGGGCGAAAGCCTGACGCA...CTGACGCTCATGCACGAAAGCGTGGGGAGCAAACA SV_70937

Any help will be appreciated!

Thanks,

André

modified 11 months ago by UBodenhofer250 • written 11 months ago by ans740
1
11 months ago by
UBodenhofer250
Johannes Kepler University, Linz, Austria
UBodenhofer250 wrote:

I'm sorry you are encountering difficulties with our package! It is actually quite difficult to guess the source of the problem. Can you provide the sequences for debugging or are they confidential? In any case, there is one thing you can first try yourself: can you use a subset of your sequences and increase the number of sequences to find out from which size on the problem appears?

1
11 months ago by
UBodenhofer250
Johannes Kepler University, Linz, Austria
UBodenhofer250 wrote:

In this case, André, I agree that it is a memory issue. If you insist on ClustalW, you may have to resort to the command line version (though I am not convinced that this will work). Maybe you better give ClustalOmega a try, since it is explicitly designed for handling larger data. Sorry that I cannot say more by now.

0
11 months ago by
ans740
ans740 wrote:

The sequences are confidential indeed, sorry.

Tried sub-sampling my dataset to 500 and it ran perfectly in ~15min. However, after subsampling it to 5000, it's still running after 2h and took up to ~15GB RAM. i have much more than that available but this might be a memory issue I guess...

Cheers,
André