Question: Error in Biostrings with Fastafiles using the carriage return <cr>
gravatar for zach.charlop.powers
2.7 years ago by
zach.charlop.powers0 wrote:

Hi Bioconductor,

I've just run into an issue with a fastafile that was generated using the carriage return <cr> to demarcate end of lines. When loading this file only the first record is discovered/parsed.  I believe that this is a bug and the Biostrings should either recognize <cr> characters and parse the file, or trip an error. I can replicate this behavior as follows:

1. save the following fasta using the "CR" encoding (on OSX with TextMate you can Save-As and choose CR):


2. open in Biostrings

  A DNAStringSet instance of length 1
    width seq                                                              names               

3. Resave using LF and you get the expected behavior:

> readDNAStringSet("~/Desktop/test2.fna")
  A DNAStringSet instance of length 3
    width seq                                                              names               
[1]    17 AAAAAAAAAAAAAAAAA                                                test1
[2]    17 AAAAAAAAAAAAAAAAA                                                test2
[3]    17 AAAAAAAAAAAAAAAAA                                                test3


zach cp

R version 3.3.2 (2016-10-31)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: macOS Sierra 10.12.2

[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] Biostrings_2.42.1   XVector_0.14.0      IRanges_2.8.1       S4Vectors_0.12.1   
[5] BiocGenerics_0.20.0

loaded via a namespace (and not attached):
[1] zlibbioc_1.20.0 tools_3.3.2    


biostrings software error bug • 588 views
ADD COMMENTlink modified 2.7 years ago by Hervé Pagès ♦♦ 14k • written 2.7 years ago by zach.charlop.powers0
Answer: Error in Biostrings with Fastafiles using the carriage return <cr>
gravatar for Hervé Pagès
2.7 years ago by
Hervé Pagès ♦♦ 14k
United States
Hervé Pagès ♦♦ 14k wrote:

Hi Zach,

Using <cr> (a.k.a. \r) to terminate lines will break almost any tool, not just readDNAStringSet(). Even the wc command in Unix:

hpages@latitude:~/sandbox$ wc test.fna 
 0  6 74 test.fna

That's because on all platforms I know, lines must be terminated with <LF> (a.k.a. \n), possibly preceded by \r on some platforms (e.g. Windows). So a FASTA file where lines are terminated with \r cannot be parsed correctly. Note that strictly speaking the file is still valid but, from a parser point of view, the entire content of the file is seen as a single line and thus is considered to contain a single (big) header with an empty sequence:

> width(readDNAStringSet("test.fna"))
[1] 0

Since AFAIK the FASTA spec says nothing about \r not being allowed in the header (see, the parser has no reason to reject this file. Furthermore, I wouldn't be surprised if some people/organisation actually put \r inside the FASTA header (the Bioinformatics community can be very imaginative ;-) ). That seems to be allowed and modifying the parser as you suggest would break it on these files.



ADD COMMENTlink modified 2.7 years ago • written 2.7 years ago by Hervé Pagès ♦♦ 14k

Thanks Herve.  I will check the files upstream but perhaps a warning is in order? In my case it took while for me to realize what was happening because readDNAStringSet passes on a useable object that will fail later in the code.

ADD REPLYlink written 2.6 years ago by zach.charlop.powers0
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 270 users visited in the last hour