Search
Question: Error in Biostrings with Fastafiles using the carriage return <cr>
0
gravatar for zach.charlop.powers
9 months ago by
zach.charlop.powers0 wrote:

Hi Bioconductor,

I've just run into an issue with a fastafile that was generated using the carriage return <cr> to demarcate end of lines. When loading this file only the first record is discovered/parsed.  I believe that this is a bug and the Biostrings should either recognize <cr> characters and parse the file, or trip an error. I can replicate this behavior as follows:

1. save the following fasta using the "CR" encoding (on OSX with TextMate you can Save-As and choose CR):

>test1
AAAAAAAAAAAAAAAAA
>test2
AAAAAAAAAAAAAAAAA
>test3
AAAAAAAAAAAAAAAAA

2. open in Biostrings

 readDNAStringSet("~/Desktop/test.fna")
  A DNAStringSet instance of length 1
    width seq                                                              names               
AAAAAAAAAAA...  

3. Resave using LF and you get the expected behavior:

> readDNAStringSet("~/Desktop/test2.fna")
  A DNAStringSet instance of length 3
    width seq                                                              names               
[1]    17 AAAAAAAAAAAAAAAAA                                                test1
[2]    17 AAAAAAAAAAAAAAAAA                                                test2
[3]    17 AAAAAAAAAAAAAAAAA                                                test3

Thanks,

zach cp

sessionInfo()
R version 3.3.2 (2016-10-31)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: macOS Sierra 10.12.2

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] Biostrings_2.42.1   XVector_0.14.0      IRanges_2.8.1       S4Vectors_0.12.1   
[5] BiocGenerics_0.20.0

loaded via a namespace (and not attached):
[1] zlibbioc_1.20.0 tools_3.3.2    

 

ADD COMMENTlink modified 9 months ago by Hervé Pagès ♦♦ 13k • written 9 months ago by zach.charlop.powers0
1
gravatar for Hervé Pagès
9 months ago by
Hervé Pagès ♦♦ 13k
United States
Hervé Pagès ♦♦ 13k wrote:

Hi Zach,

Using <cr> (a.k.a. \r) to terminate lines will break almost any tool, not just readDNAStringSet(). Even the wc command in Unix:

hpages@latitude:~/sandbox$ wc test.fna 
 0  6 74 test.fna

That's because on all platforms I know, lines must be terminated with <LF> (a.k.a. \n), possibly preceded by \r on some platforms (e.g. Windows). So a FASTA file where lines are terminated with \r cannot be parsed correctly. Note that strictly speaking the file is still valid but, from a parser point of view, the entire content of the file is seen as a single line and thus is considered to contain a single (big) header with an empty sequence:

> width(readDNAStringSet("test.fna"))
[1] 0

Since AFAIK the FASTA spec says nothing about \r not being allowed in the header (see https://en.wikipedia.org/wiki/FASTA_format), the parser has no reason to reject this file. Furthermore, I wouldn't be surprised if some people/organisation actually put \r inside the FASTA header (the Bioinformatics community can be very imaginative ;-) ). That seems to be allowed and modifying the parser as you suggest would break it on these files.

Cheers,

H.

ADD COMMENTlink modified 9 months ago • written 9 months ago by Hervé Pagès ♦♦ 13k

Thanks Herve.  I will check the files upstream but perhaps a warning is in order? In my case it took while for me to realize what was happening because readDNAStringSet passes on a useable object that will fail later in the code.

ADD REPLYlink written 8 months ago by zach.charlop.powers0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 157 users visited in the last hour