Question

Error in Biostrings with Fastafiles using the carriage return <cr>

0

Entering edit mode

zach.charlop.powers • 0

@zachcharloppowers-12216

Last seen 5.8 years ago

Hi Bioconductor,

I've just run into an issue with a fastafile that was generated using the carriage return <cr> to demarcate end of lines. When loading this file only the first record is discovered/parsed. I believe that this is a bug and the Biostrings should either recognize <cr> characters and parse the file, or trip an error. I can replicate this behavior as follows:

1. save the following fasta using the "CR" encoding (on OSX with TextMate you can Save-As and choose CR):

>test1
AAAAAAAAAAAAAAAAA
>test2
AAAAAAAAAAAAAAAAA
>test3
AAAAAAAAAAAAAAAAA

2. open in Biostrings

 readDNAStringSet("~/Desktop/test.fna")
  A DNAStringSet instance of length 1
    width seq                                                              names               
AAAAAAAAAAA...

3. Resave using LF and you get the expected behavior:

> readDNAStringSet("~/Desktop/test2.fna")
  A DNAStringSet instance of length 3
    width seq                                                              names               
[1]    17 AAAAAAAAAAAAAAAAA                                                test1
[2]    17 AAAAAAAAAAAAAAAAA                                                test2
[3]    17 AAAAAAAAAAAAAAAAA                                                test3

Thanks,

zach cp

sessionInfo()
R version 3.3.2 (2016-10-31)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: macOS Sierra 10.12.2

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] Biostrings_2.42.1   XVector_0.14.0      IRanges_2.8.1       S4Vectors_0.12.1   
[5] BiocGenerics_0.20.0

loaded via a namespace (and not attached):
[1] zlibbioc_1.20.0 tools_3.3.2

biostrings software error bug • 1.4k views

ADD COMMENT • link updated 7.3 years ago by Hervé Pagès 16k • written 7.3 years ago by zach.charlop.powers • 0

score 1 · Answer 1 · 2017-02-01

Hi Zach,

Using <cr> (a.k.a. \r) to terminate lines will break almost any tool, not just readDNAStringSet(). Even the wc command in Unix:

hpages@latitude:~/sandbox$ wc test.fna 
 0  6 74 test.fna

That's because on all platforms I know, lines must be terminated with <LF> (a.k.a. \n), possibly preceded by \r on some platforms (e.g. Windows). So a FASTA file where lines are terminated with \r cannot be parsed correctly. Note that strictly speaking the file is still valid but, from a parser point of view, the entire content of the file is seen as a single line and thus is considered to contain a single (big) header with an empty sequence:

> width(readDNAStringSet("test.fna"))
[1] 0

Since AFAIK the FASTA spec says nothing about \r not being allowed in the header (see https://en.wikipedia.org/wiki/FASTA_format), the parser has no reason to reject this file. Furthermore, I wouldn't be surprised if some people/organisation actually put \r inside the FASTA header (the Bioinformatics community can be very imaginative ;-) ). That seems to be allowed and modifying the parser as you suggest would break it on these files.

Cheers,

H.