Question: Error in Biostrings with Fastafiles using the carriage return <cr>
gravatar for zach.charlop.powers
9 months ago by
zach.charlop.powers0 wrote:

Hi Bioconductor,

I've just run into an issue with a fastafile that was generated using the carriage return <cr> to demarcate end of lines. When loading this file only the first record is discovered/parsed.  I believe that this is a bug and the Biostrings should either recognize <cr> characters and parse the file, or trip an error. I can replicate this behavior as follows:

1. save the following fasta using the "CR" encoding (on OSX with TextMate you can Save-As and choose CR):


2. open in Biostrings

  A DNAStringSet instance of length 1
    width seq                                                              names               

3. Resave using LF and you get the expected behavior:

> readDNAStringSet("~/Desktop/test2.fna")
  A DNAStringSet instance of length 3
    width seq                                                              names               
[1]    17 AAAAAAAAAAAAAAAAA                                                test1
[2]    17 AAAAAAAAAAAAAAAAA                                                test2
[3]    17 AAAAAAAAAAAAAAAAA                                                test3


zach cp

R version 3.3.2 (2016-10-31)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: macOS Sierra 10.12.2

[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] Biostrings_2.42.1   XVector_0.14.0      IRanges_2.8.1       S4Vectors_0.12.1   
[5] BiocGenerics_0.20.0

loaded via a namespace (and not attached):
[1] zlibbioc_1.20.0 tools_3.3.2    


ADD COMMENTlink modified 9 months ago by Hervé Pagès ♦♦ 13k • written 9 months ago by zach.charlop.powers0
gravatar for Hervé Pagès
9 months ago by
Hervé Pagès ♦♦ 13k
United States
Hervé Pagès ♦♦ 13k wrote:

Hi Zach,

Using <cr> (a.k.a. \r) to terminate lines will break almost any tool, not just readDNAStringSet(). Even the wc command in Unix:

hpages@latitude:~/sandbox$ wc test.fna 
 0  6 74 test.fna

That's because on all platforms I know, lines must be terminated with <LF> (a.k.a. \n), possibly preceded by \r on some platforms (e.g. Windows). So a FASTA file where lines are terminated with \r cannot be parsed correctly. Note that strictly speaking the file is still valid but, from a parser point of view, the entire content of the file is seen as a single line and thus is considered to contain a single (big) header with an empty sequence:

> width(readDNAStringSet("test.fna"))
[1] 0

Since AFAIK the FASTA spec says nothing about \r not being allowed in the header (see, the parser has no reason to reject this file. Furthermore, I wouldn't be surprised if some people/organisation actually put \r inside the FASTA header (the Bioinformatics community can be very imaginative ;-) ). That seems to be allowed and modifying the parser as you suggest would break it on these files.



ADD COMMENTlink modified 9 months ago • written 9 months ago by Hervé Pagès ♦♦ 13k

Thanks Herve.  I will check the files upstream but perhaps a warning is in order? In my case it took while for me to realize what was happening because readDNAStringSet passes on a useable object that will fail later in the code.

ADD REPLYlink written 8 months ago by zach.charlop.powers0
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 157 users visited in the last hour