Dear Biostrings-Experts and Developers,
thank you for supporting the Bioinformatics community with Biostrings. It is highly useful and has not left my tool-box since some years now.
In a package that I am currently writing I have encountered a possible bug though:
When using Biostrings::readAAStringSet on the latest UniprotKB/Swissprot FASTA-File many sequences are read in wrong. Only random sub-sequences remain after reading in the FASTA File.
I confirmed this manually and also with a different tool from the R package "seqinr". Its method seqinr::read.fasta reads in the AA-Sequences correctly.
Using Biostrings I get the warning:
In .Call2("fasta_index", filexp_list, nrec, skip, seek.first.rec, :
reading FASTA file /opt/share/blastdb/uniprotkb/FASTA/uniprot_sprot.fasta: ignored 68720968 invalid one-letter sequence codes
Possibly the bug has to do with this message?
I cannot easily reproduce the bug. I try to explain with an example:
Take the Uniprot/Swissprot gene accession Q5V0J7. The AA-Sequence read in with Biostrings is both shorter and thus not identical with the original one, found in the Fasta-File. As I said, I confirmed that manually and by comparison with the sequence read in with seqinr::read.fasta.
However, if I copy paste the part of Swissprot only containing Q5V0J7 thenBiostrings reads in the sequence correctly.
I am using Biostrings version 2.38.4 in R 3.2.0 on Debian Wheezy.
My package has the following DESCRIPTION file:
Title: Computes performance scores of AHRD and its competitors based on the F1-
Score. DO READ THE README FILE FOR INSTRUCTIONS
Author: Dr. Asis Hallab
Maintainer: Dr. Asis Hallab <firstname.lastname@example.org>
Description: Computes performance scores of AHRD and its competitors based on
the F1-Score. DO READ THE README FILE FOR INSTRUCTIONS
R (>= 3.2.0),
Rcpp (>= 0.12.3),
RMySQL (>= 0.10.8),
data.table (>= 1.9.6),
Biostrings (>= 2.38.4)
I hope this helps you.
Thank you very much and have a nice day!