Creating a Biostrings PDict object from amino-acid sequences
1
0
Entering edit mode
rubi ▴ 110
@rubi-6462
Last seen 5.7 years ago

Hi,

I'm trying to match a vector of peptide sequences against an AAStringSet to get all perfect matches.

I thought the most straightforward way to do this is to create a PDict object from the vector of peptide sequences using:

PDict(peptide.seq.vec)

And then use one of the matchPDict functions of the PDict object vs. the AAStringSet reference to get all perfect matches.

 

However, running:

PDict(peptide.seq.vec)

Already throws this error:

Error in .Call2("new_XString_from_CHARACTER", classname, x, start(solved_SEW),  : 
  key 73 (char 'I') not in lookup table

peptide.seq.vec[1] is 

"KNVSIGIVGKD"

 

Is it expecting a DNA sequence only? The documentation of PDict says it accepts a character vector, not necessarily a DNA string

Any idea?

> sessionInfo()
R version 3.3.2 (2016-10-31)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

locale:
 [1] LC_CTYPE=en_US.UTF-8          LC_NUMERIC=C                  LC_TIME=en_US.UTF-8           LC_COLLATE=en_US.UTF-8       
 [5] LC_MONETARY=en_US.UTF-8       LC_MESSAGES=en_US.UTF-8       LC_PAPER=en_US.UTF-8          LC_NAME=en_US.UTF-8          
 [9] LC_ADDRESS=en_US.UTF-8        LC_TELEPHONE=en_US.UTF-8      LC_MEASUREMENT=en_US.UTF-8    LC_IDENTIFICATION=en_US.UTF-8

attached base packages:
 [1] stats4    parallel  grid      stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] Biostrings_2.42.1    XVector_0.14.0       matrixStats_0.51.0   topGO_2.26.0         SparseM_1.72        
 [6] graph_1.50.0         fastcluster_1.1.22   cluster_2.0.5        GO.db_3.4.0          org.Hs.eg.db_3.4.0  
[11] AnnotationDbi_1.36.0 Biobase_2.34.0       gageData_2.12.0      gage_2.24.0          biomaRt_2.30.0      
[16] rtracklayer_1.34.1   GenomicRanges_1.26.2 GenomeInfoDb_1.10.0  IRanges_2.8.1        S4Vectors_0.12.1    
[21] BiocGenerics_0.20.0  doBy_4.5-15          yaml_2.1.14          doParallel_1.0.10    iterators_1.0.8     
[26] foreach_1.4.3        snpEnrichment_1.7.0  fgsea_1.0.2          Rcpp_0.12.8          data.tree_0.6.2     
[31] zoo_1.7-13           gplots_3.0.1         ggdendro_0.1-20      RColorBrewer_1.1-2   venneuler_1.1-0     
[36] rJava_0.9-8          scales_0.4.1         reshape2_1.4.2       plotrix_3.6-3        outliers_0.14       
[41] Hmisc_3.17-4         Formula_1.2-1        survival_2.40-1      lattice_0.20-34      data.table_1.9.6    
[46] edgeR_3.16.1         limma_3.30.2         ggpmisc_0.2.12       dplyr_0.5.0          plyr_1.8.4          
[51] magrittr_1.5         gridExtra_2.2.1      ggplot2_2.2.1        dendextend_1.3.0     ape_4.0             

loaded via a namespace (and not attached):
 [1] colorspace_1.2-7           class_7.3-14               modeltools_0.2-21          mclust_5.2                
 [5] rstudioapi_0.6             flexmix_2.3-13             mvtnorm_1.0-5              codetools_0.2-15          
 [9] splines_3.3.2              snpStats_1.24.0            robustbase_0.92-6          jsonlite_1.1              
[13] Rsamtools_1.26.1           kernlab_0.9-25             png_0.1-7                  DiagrammeR_0.9.0          
[17] httr_1.2.1                 assertthat_0.1             Matrix_1.2-7.1             lazyeval_0.2.0            
[21] acepack_1.4.1              visNetwork_1.0.3           htmltools_0.3.5            tools_3.3.2               
[25] igraph_1.0.1               gtable_0.2.0               fastmatch_1.0-4            rgexf_0.15.3              
[29] trimcluster_0.1-2          gdata_2.17.0               nlme_3.1-128               fpc_2.1-10                
[33] stringr_1.1.0              gtools_3.5.0               XML_3.98-1.4               DEoptimR_1.0-6            
[37] zlibbioc_1.20.0            MASS_7.3-45                SummarizedExperiment_1.2.3 rpart_4.1-10              
[41] latticeExtra_0.6-28        stringi_1.1.2              RSQLite_1.0.0              Rook_1.1-1                
[45] caTools_1.17.1             BiocParallel_1.8.1         chron_2.3-47               prabclus_2.2-6            
[49] bitops_1.0-6               GenomicAlignments_1.8.4    htmlwidgets_0.8            R6_2.2.0                  
[53] DBI_0.5-1                  whisker_0.3-2              foreign_0.8-67             KEGGREST_1.14.0           
[57] RCurl_1.95-4.8             nnet_7.3-12                tibble_1.2                 KernSmooth_2.23-15        
[61] viridis_0.3.4              locfit_1.5-9.1             influenceR_0.1.0           digest_0.6.11             
[65] diptest_0.75-7             brew_1.0-6                 munsell_0.4.3             

 

biostrings pdict • 1.1k views
ADD COMMENT
2
Entering edit mode
@herve-pages-1542
Last seen 13 hours ago
Seattle, WA, United States

Hi Rubi,

PDict objects are for DNA sequences only. See the man page:

    The PDict class is a container for storing a preprocessed
    dictionary of DNA patterns...

There are other restrictions to what PDict() can preprocess. See man page for the details.

If your set of patterns cannot be preprocessed, then don't preprocess it ;-) , i.e. use one of the matchPDict functions directly on your AAStringSet object. See D. USING A NON-PREPROCESSED DICTIONARY in examples section of ?matchPDict for some examples.

Cheers,

H.

ADD COMMENT
0
Entering edit mode

Also please check matching of AAStringSet vs. another AAStringSet for a similar question and an efficient solution for the exact matching case based on CRAN package AhoCorasickTrie.

H.

ADD REPLY

Login before adding your answer.

Traffic: 483 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6