Question: Possible bug in Biostrings::readAAStringSet
0
gravatar for asis.hallab
3.1 years ago by
asis.hallab0 wrote:

Dear Biostrings-Experts and Developers,

 

thank you for supporting the Bioinformatics community with Biostrings. It is highly useful and has not left my tool-box since some years now.

 

In a package that I am currently writing I have encountered a possible bug though: 

 

When using Biostrings::readAAStringSet on the latest UniprotKB/Swissprot FASTA-File many sequences are read in wrong. Only random sub-sequences remain after reading in the FASTA File.

I confirmed this manually and also with a different tool from the R package "seqinr". Its method seqinr::read.fasta reads in the AA-Sequences correctly.

 

Using Biostrings I get the warning: 

 

Warning message:
In .Call2("fasta_index", filexp_list, nrec, skip, seek.first.rec, :
reading FASTA file /opt/share/blastdb/uniprotkb/FASTA/uniprot_sprot.fasta: ignored 68720968 invalid one-letter sequence codes

 

Possibly the bug has to do with this message?

 

I cannot easily reproduce the bug. I try to explain with an example:

 

Take the Uniprot/Swissprot gene accession Q5V0J7. The AA-Sequence read in with Biostrings is both shorter and thus not identical with the original one, found in the Fasta-File. As I said, I confirmed that manually and by comparison with the sequence read in with seqinr::read.fasta. 

However, if I copy paste the part of Swissprot only containing Q5V0J7 thenBiostrings reads in the sequence correctly.

 

I am using Biostrings version 2.38.4 in R 3.2.0 on Debian Wheezy.

 

My package has the following DESCRIPTION file:

"

Package: evaluation
Type: Package
Title: Computes performance scores of AHRD and its competitors based on the F1-
Score. DO READ THE README FILE FOR INSTRUCTIONS
Version: 1.0
Date: 2016-07-01
Author: Dr. Asis Hallab
Maintainer: Dr. Asis Hallab <asis.hallab@gmail.com>
Description: Computes performance scores of AHRD and its competitors based on
the F1-Score. DO READ THE README FILE FOR INSTRUCTIONS
License:
Depends:
R (>= 3.2.0),
Rcpp (>= 0.12.3),
RMySQL (>= 0.10.8),
data.table (>= 1.9.6),
Biostrings (>= 2.38.4)
Imports: Rcpp
LinkingTo: Rcpp
RoxygenNote: 5.0.1

"

 

I hope this helps you. 

 

Thank you very much and have a nice day!

Cheers!

biostrings bug readaastringset • 1.5k views
ADD COMMENTlink modified 3.1 years ago by Guangchuang Yu1.1k • written 3.1 years ago by asis.hallab0
Answer: Possible bug in Biostrings::readAAStringSet
0
gravatar for James W. MacDonald
3.1 years ago by
United States
James W. MacDonald51k wrote:

We don't support outdated versions of Bioconductor. Please update before posting what you think are errors. I have no problems with this using the current release versions:

 > z <- readAAStringSet("/data/tmp2/blast_databases/uniprot/uniprot_sprot.fasta")
 > z
   A AAStringSet instance of length 545536
          width seq                                          names
      [1]   256 MAFSAEDVLKEYDRRRRMEAL...LYDDSFRKIYTDLGWKFTPL sp|Q6GZX4|001R_FR...
      [2]   320 MSIIGATRLQNDKSDTYSAGP...VMFFVAGAVLVAILISTVRW sp|Q6GZX3|002L_FR...
      [3]   458 MASNTVSAQGGSNRPVRDFSN...EDFEYDSDSEDDDSDSEDDC sp|Q197F8|002R_II...
      [4]   156 MYQAINPCPQSWYGSPQLERE...SKTRKNNYGTCRLEPLTYGI sp|Q197F7|003L_II...
      [5]   438 MARPLLGKTSSVRRRLESLSA...KRKAKIQEMFDNMVSRMVTS sp|Q6GZX2|003R_FR...
      ...   ... ...
 [545532]   100 MGNSKSKSKLSANQYEQQTVN...EELPTSIVVPIEPSAPPPED sp|Q6UY62|Z_SABVB...
 [545533]    79 MSSSLEITSFYSFIWTPHIGP...SLRLRVIVLFLAIFFPLLNI sp|P08105|Z_SHEEP...
 [545534]    95 MGNCNRTQKPSSSSNNLEKPP...CWKPLPTSIRVPLEASAPDL sp|Q88470|Z_TACVF...
 [545535]    95 MGLRYSKEVRDRHGDKDPEGR...EVLPKRLTFETTPTAPPYTP sp|A9JR22|Z_TAMVU...
 [545536]    95 MGLRYSKDVKDRYGDREPEGR...EVLPKKLVFENSPSAPPYEA sp|B2ZDY1|Z_WWAVU...
 
 > z[grep("Q5V0J7", names(z)),]
   A AAStringSet instance of length 1
     width seq                                               names
 [1]   173 MHVDIVPVGEVSAQVKREASDGL...SPTVRQVDVKEVSLCGSCQRNVL sp|Q5V0J7|AMZA_HA...
> as.character(z[grep("Q5V0J7", names(z)),])
                                         sp|Q5V0J7|AMZA_HALMA Archaemetzincin OS=Haloarcula marismortui (strain ATCC 43049 / DSM 3752 / JCM 8966 / VKM B-1809) GN=amzA PE=3 SV=2
 "MHVDIVPVGEVSAQVKREASDGLRSVYDCEVSMHEPQSIPAGAYDGDRDQYRAEEFIDLARRVGSGGKNIAITPKDLFYRRRNYVFGLAYLGGSGSVISTYRLQTSSDGGFSNRSAGEIFSQRVRKEVVHEVGHTLGLEHCDNKRCVMNFSPTVRQVDVKEVSLCGSCQRNVL"

Which is identical to what I get when I grep the FASTA file directly.

Edit:

> sessionInfo()
 R version 3.3.0 (2016-05-03)
 Platform: x86_64-pc-linux-gnu (64-bit)
 Running under: Debian GNU/Linux 8 (jessie)

 locale:
  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
  [9] LC_ADDRESS=C               LC_TELEPHONE=C
 [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

 attached base packages:
 [1] stats4    parallel  stats     graphics  grDevices utils     datasets
 [8] methods   base

 other attached packages:
 [1] Biostrings_2.40.2   XVector_0.12.1      IRanges_2.6.1
 [4] S4Vectors_0.10.3    BiocGenerics_0.18.0

 loaded via a namespace (and not attached):
 [1] zlibbioc_1.18.0 compiler_3.3.0  tools_3.3.0

 

ADD COMMENTlink modified 3.1 years ago • written 3.1 years ago by James W. MacDonald51k
Answer: Possible bug in Biostrings::readAAStringSet
0
gravatar for Guangchuang Yu
3.1 years ago by
Guangchuang Yu1.1k
China/Guangzhou/Southern Medical University
Guangchuang Yu1.1k wrote:

ignored 68720968 invalid one-letter sequence codes

 

I think you need to use readBStringSet for your fasta file contains invalid alphabet.

 

ADD COMMENTlink written 3.1 years ago by Guangchuang Yu1.1k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 294 users visited in the last hour