Possible bug in Biostrings::readAAStringSet
2
0
Entering edit mode
@asishallab-11502
Last seen 8.1 years ago

Dear Biostrings-Experts and Developers,

 

thank you for supporting the Bioinformatics community with Biostrings. It is highly useful and has not left my tool-box since some years now.

 

In a package that I am currently writing I have encountered a possible bug though: 

 

When using Biostrings::readAAStringSet on the latest UniprotKB/Swissprot FASTA-File many sequences are read in wrong. Only random sub-sequences remain after reading in the FASTA File.

I confirmed this manually and also with a different tool from the R package "seqinr". Its method seqinr::read.fasta reads in the AA-Sequences correctly.

 

Using Biostrings I get the warning: 

 

Warning message:
In .Call2("fasta_index", filexp_list, nrec, skip, seek.first.rec, :
reading FASTA file /opt/share/blastdb/uniprotkb/FASTA/uniprot_sprot.fasta: ignored 68720968 invalid one-letter sequence codes

 

Possibly the bug has to do with this message?

 

I cannot easily reproduce the bug. I try to explain with an example:

 

Take the Uniprot/Swissprot gene accession Q5V0J7. The AA-Sequence read in with Biostrings is both shorter and thus not identical with the original one, found in the Fasta-File. As I said, I confirmed that manually and by comparison with the sequence read in with seqinr::read.fasta. 

However, if I copy paste the part of Swissprot only containing Q5V0J7 thenBiostrings reads in the sequence correctly.

 

I am using Biostrings version 2.38.4 in R 3.2.0 on Debian Wheezy.

 

My package has the following DESCRIPTION file:

"

Package: evaluation
Type: Package
Title: Computes performance scores of AHRD and its competitors based on the F1-
Score. DO READ THE README FILE FOR INSTRUCTIONS
Version: 1.0
Date: 2016-07-01
Author: Dr. Asis Hallab
Maintainer: Dr. Asis Hallab <asis.hallab@gmail.com>
Description: Computes performance scores of AHRD and its competitors based on
the F1-Score. DO READ THE README FILE FOR INSTRUCTIONS
License:
Depends:
R (>= 3.2.0),
Rcpp (>= 0.12.3),
RMySQL (>= 0.10.8),
data.table (>= 1.9.6),
Biostrings (>= 2.38.4)
Imports: Rcpp
LinkingTo: Rcpp
RoxygenNote: 5.0.1

"

 

I hope this helps you. 

 

Thank you very much and have a nice day!

Cheers!

bug biostrings readAAStringSet • 5.4k views
ADD COMMENT
0
Entering edit mode
@james-w-macdonald-5106
Last seen 9 hours ago
United States

We don't support outdated versions of Bioconductor. Please update before posting what you think are errors. I have no problems with this using the current release versions:

 > z <- readAAStringSet("/data/tmp2/blast_databases/uniprot/uniprot_sprot.fasta")
 > z
   A AAStringSet instance of length 545536
          width seq                                          names
      [1]   256 MAFSAEDVLKEYDRRRRMEAL...LYDDSFRKIYTDLGWKFTPL sp|Q6GZX4|001R_FR...
      [2]   320 MSIIGATRLQNDKSDTYSAGP...VMFFVAGAVLVAILISTVRW sp|Q6GZX3|002L_FR...
      [3]   458 MASNTVSAQGGSNRPVRDFSN...EDFEYDSDSEDDDSDSEDDC sp|Q197F8|002R_II...
      [4]   156 MYQAINPCPQSWYGSPQLERE...SKTRKNNYGTCRLEPLTYGI sp|Q197F7|003L_II...
      [5]   438 MARPLLGKTSSVRRRLESLSA...KRKAKIQEMFDNMVSRMVTS sp|Q6GZX2|003R_FR...
      ...   ... ...
 [545532]   100 MGNSKSKSKLSANQYEQQTVN...EELPTSIVVPIEPSAPPPED sp|Q6UY62|Z_SABVB...
 [545533]    79 MSSSLEITSFYSFIWTPHIGP...SLRLRVIVLFLAIFFPLLNI sp|P08105|Z_SHEEP...
 [545534]    95 MGNCNRTQKPSSSSNNLEKPP...CWKPLPTSIRVPLEASAPDL sp|Q88470|Z_TACVF...
 [545535]    95 MGLRYSKEVRDRHGDKDPEGR...EVLPKRLTFETTPTAPPYTP sp|A9JR22|Z_TAMVU...
 [545536]    95 MGLRYSKDVKDRYGDREPEGR...EVLPKKLVFENSPSAPPYEA sp|B2ZDY1|Z_WWAVU...
 
 > z[grep("Q5V0J7", names(z)),]
   A AAStringSet instance of length 1
     width seq                                               names
 [1]   173 MHVDIVPVGEVSAQVKREASDGL...SPTVRQVDVKEVSLCGSCQRNVL sp|Q5V0J7|AMZA_HA...
> as.character(z[grep("Q5V0J7", names(z)),])
                                         sp|Q5V0J7|AMZA_HALMA Archaemetzincin OS=Haloarcula marismortui (strain ATCC 43049 / DSM 3752 / JCM 8966 / VKM B-1809) GN=amzA PE=3 SV=2
 "MHVDIVPVGEVSAQVKREASDGLRSVYDCEVSMHEPQSIPAGAYDGDRDQYRAEEFIDLARRVGSGGKNIAITPKDLFYRRRNYVFGLAYLGGSGSVISTYRLQTSSDGGFSNRSAGEIFSQRVRKEVVHEVGHTLGLEHCDNKRCVMNFSPTVRQVDVKEVSLCGSCQRNVL"

Which is identical to what I get when I grep the FASTA file directly.

Edit:

> sessionInfo()
 R version 3.3.0 (2016-05-03)
 Platform: x86_64-pc-linux-gnu (64-bit)
 Running under: Debian GNU/Linux 8 (jessie)

 locale:
  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
  [9] LC_ADDRESS=C               LC_TELEPHONE=C
 [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

 attached base packages:
 [1] stats4    parallel  stats     graphics  grDevices utils     datasets
 [8] methods   base

 other attached packages:
 [1] Biostrings_2.40.2   XVector_0.12.1      IRanges_2.6.1
 [4] S4Vectors_0.10.3    BiocGenerics_0.18.0

 loaded via a namespace (and not attached):
 [1] zlibbioc_1.18.0 compiler_3.3.0  tools_3.3.0

 

ADD COMMENT
0
Entering edit mode
Guangchuang Yu ★ 1.2k
@guangchuang-yu-5419
Last seen 5 days ago
China/Guangzhou/Southern Medical Univer…

ignored 68720968 invalid one-letter sequence codes

 

I think you need to use readBStringSet for your fasta file contains invalid alphabet.

 

ADD COMMENT

Login before adding your answer.

Traffic: 786 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6