Question: Error in Biostrings with Fastafiles using the carriage return <cr>
0
2.7 years ago by
zach.charlop.powers0 wrote:

Hi Bioconductor,

I've just run into an issue with a fastafile that was generated using the carriage return <cr> to demarcate end of lines. When loading this file only the first record is discovered/parsed.  I believe that this is a bug and the Biostrings should either recognize <cr> characters and parse the file, or trip an error. I can replicate this behavior as follows:

1. save the following fasta using the "CR" encoding (on OSX with TextMate you can Save-As and choose CR):

>test1
AAAAAAAAAAAAAAAAA
>test2
AAAAAAAAAAAAAAAAA
>test3
AAAAAAAAAAAAAAAAA

2. open in Biostrings

 readDNAStringSet("~/Desktop/test.fna")
A DNAStringSet instance of length 1
width seq                                                              names
AAAAAAAAAAA...  

3. Resave using LF and you get the expected behavior:

> readDNAStringSet("~/Desktop/test2.fna")
A DNAStringSet instance of length 3
width seq                                                              names
[1]    17 AAAAAAAAAAAAAAAAA                                                test1
[2]    17 AAAAAAAAAAAAAAAAA                                                test2
[3]    17 AAAAAAAAAAAAAAAAA                                                test3

Thanks,

zach cp

sessionInfo()
R version 3.3.2 (2016-10-31)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: macOS Sierra 10.12.2

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] Biostrings_2.42.1   XVector_0.14.0      IRanges_2.8.1       S4Vectors_0.12.1
[5] BiocGenerics_0.20.0

loaded via a namespace (and not attached):
[1] zlibbioc_1.20.0 tools_3.3.2    

biostrings software error bug • 588 views
modified 2.7 years ago by Hervé Pagès ♦♦ 14k • written 2.7 years ago by zach.charlop.powers0
Answer: Error in Biostrings with Fastafiles using the carriage return <cr>
1
2.7 years ago by
Hervé Pagès ♦♦ 14k
United States
Hervé Pagès ♦♦ 14k wrote:

Hi Zach,

Using <cr> (a.k.a. \r) to terminate lines will break almost any tool, not just readDNAStringSet(). Even the wc command in Unix:

hpages@latitude:~/sandbox\$ wc test.fna
0  6 74 test.fna

That's because on all platforms I know, lines must be terminated with <LF> (a.k.a. \n), possibly preceded by \r on some platforms (e.g. Windows). So a FASTA file where lines are terminated with \r cannot be parsed correctly. Note that strictly speaking the file is still valid but, from a parser point of view, the entire content of the file is seen as a single line and thus is considered to contain a single (big) header with an empty sequence:

> width(readDNAStringSet("test.fna"))
[1] 0

Since AFAIK the FASTA spec says nothing about \r not being allowed in the header (see https://en.wikipedia.org/wiki/FASTA_format), the parser has no reason to reject this file. Furthermore, I wouldn't be surprised if some people/organisation actually put \r inside the FASTA header (the Bioinformatics community can be very imaginative ;-) ). That seems to be allowed and modifying the parser as you suggest would break it on these files.

Cheers,

H.

ADD COMMENTlink modified 2.7 years ago • written 2.7 years ago by Hervé Pagès ♦♦ 14k

Thanks Herve.  I will check the files upstream but perhaps a warning is in order? In my case it took while for me to realize what was happening because readDNAStringSet passes on a useable object that will fail later in the code.