Hi Bioconductor,
I've just run into an issue with a fastafile that was generated using the carriage return <cr> to demarcate end of lines. When loading this file only the first record is discovered/parsed. I believe that this is a bug and the Biostrings should either recognize <cr> characters and parse the file, or trip an error. I can replicate this behavior as follows:
1. save the following fasta using the "CR" encoding (on OSX with TextMate you can Save-As and choose CR):
>test1 AAAAAAAAAAAAAAAAA >test2 AAAAAAAAAAAAAAAAA >test3 AAAAAAAAAAAAAAAAA
2. open in Biostrings
readDNAStringSet("~/Desktop/test.fna") A DNAStringSet instance of length 1 width seq names AAAAAAAAAAA...
3. Resave using LF and you get the expected behavior:
> readDNAStringSet("~/Desktop/test2.fna") A DNAStringSet instance of length 3 width seq names [1] 17 AAAAAAAAAAAAAAAAA test1 [2] 17 AAAAAAAAAAAAAAAAA test2 [3] 17 AAAAAAAAAAAAAAAAA test3
Thanks,
zach cp
sessionInfo() R version 3.3.2 (2016-10-31) Platform: x86_64-apple-darwin13.4.0 (64-bit) Running under: macOS Sierra 10.12.2 locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] stats4 parallel stats graphics grDevices utils datasets methods base other attached packages: [1] Biostrings_2.42.1 XVector_0.14.0 IRanges_2.8.1 S4Vectors_0.12.1 [5] BiocGenerics_0.20.0 loaded via a namespace (and not attached): [1] zlibbioc_1.20.0 tools_3.3.2
Thanks Herve. I will check the files upstream but perhaps a warning is in order? In my case it took while for me to realize what was happening because readDNAStringSet passes on a useable object that will fail later in the code.