Search
Question: Creating a Biostrings PDict object from amino-acid sequences
0
gravatar for rubi
9 months ago by
rubi70
rubi70 wrote:

Hi,

I'm trying to match a vector of peptide sequences against an AAStringSet to get all perfect matches.

I thought the most straightforward way to do this is to create a PDict object from the vector of peptide sequences using:

PDict(peptide.seq.vec)

And then use one of the matchPDict functions of the PDict object vs. the AAStringSet reference to get all perfect matches.

 

However, running:

PDict(peptide.seq.vec)

Already throws this error:

Error in .Call2("new_XString_from_CHARACTER", classname, x, start(solved_SEW),  : 
  key 73 (char 'I') not in lookup table

peptide.seq.vec[1] is 

"KNVSIGIVGKD"

 

Is it expecting a DNA sequence only? The documentation of PDict says it accepts a character vector, not necessarily a DNA string

Any idea?

> sessionInfo()
R version 3.3.2 (2016-10-31)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

locale:
 [1] LC_CTYPE=en_US.UTF-8          LC_NUMERIC=C                  LC_TIME=en_US.UTF-8           LC_COLLATE=en_US.UTF-8       
 [5] LC_MONETARY=en_US.UTF-8       LC_MESSAGES=en_US.UTF-8       LC_PAPER=en_US.UTF-8          LC_NAME=en_US.UTF-8          
 [9] LC_ADDRESS=en_US.UTF-8        LC_TELEPHONE=en_US.UTF-8      LC_MEASUREMENT=en_US.UTF-8    LC_IDENTIFICATION=en_US.UTF-8

attached base packages:
 [1] stats4    parallel  grid      stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] Biostrings_2.42.1    XVector_0.14.0       matrixStats_0.51.0   topGO_2.26.0         SparseM_1.72        
 [6] graph_1.50.0         fastcluster_1.1.22   cluster_2.0.5        GO.db_3.4.0          org.Hs.eg.db_3.4.0  
[11] AnnotationDbi_1.36.0 Biobase_2.34.0       gageData_2.12.0      gage_2.24.0          biomaRt_2.30.0      
[16] rtracklayer_1.34.1   GenomicRanges_1.26.2 GenomeInfoDb_1.10.0  IRanges_2.8.1        S4Vectors_0.12.1    
[21] BiocGenerics_0.20.0  doBy_4.5-15          yaml_2.1.14          doParallel_1.0.10    iterators_1.0.8     
[26] foreach_1.4.3        snpEnrichment_1.7.0  fgsea_1.0.2          Rcpp_0.12.8          data.tree_0.6.2     
[31] zoo_1.7-13           gplots_3.0.1         ggdendro_0.1-20      RColorBrewer_1.1-2   venneuler_1.1-0     
[36] rJava_0.9-8          scales_0.4.1         reshape2_1.4.2       plotrix_3.6-3        outliers_0.14       
[41] Hmisc_3.17-4         Formula_1.2-1        survival_2.40-1      lattice_0.20-34      data.table_1.9.6    
[46] edgeR_3.16.1         limma_3.30.2         ggpmisc_0.2.12       dplyr_0.5.0          plyr_1.8.4          
[51] magrittr_1.5         gridExtra_2.2.1      ggplot2_2.2.1        dendextend_1.3.0     ape_4.0             

loaded via a namespace (and not attached):
 [1] colorspace_1.2-7           class_7.3-14               modeltools_0.2-21          mclust_5.2                
 [5] rstudioapi_0.6             flexmix_2.3-13             mvtnorm_1.0-5              codetools_0.2-15          
 [9] splines_3.3.2              snpStats_1.24.0            robustbase_0.92-6          jsonlite_1.1              
[13] Rsamtools_1.26.1           kernlab_0.9-25             png_0.1-7                  DiagrammeR_0.9.0          
[17] httr_1.2.1                 assertthat_0.1             Matrix_1.2-7.1             lazyeval_0.2.0            
[21] acepack_1.4.1              visNetwork_1.0.3           htmltools_0.3.5            tools_3.3.2               
[25] igraph_1.0.1               gtable_0.2.0               fastmatch_1.0-4            rgexf_0.15.3              
[29] trimcluster_0.1-2          gdata_2.17.0               nlme_3.1-128               fpc_2.1-10                
[33] stringr_1.1.0              gtools_3.5.0               XML_3.98-1.4               DEoptimR_1.0-6            
[37] zlibbioc_1.20.0            MASS_7.3-45                SummarizedExperiment_1.2.3 rpart_4.1-10              
[41] latticeExtra_0.6-28        stringi_1.1.2              RSQLite_1.0.0              Rook_1.1-1                
[45] caTools_1.17.1             BiocParallel_1.8.1         chron_2.3-47               prabclus_2.2-6            
[49] bitops_1.0-6               GenomicAlignments_1.8.4    htmlwidgets_0.8            R6_2.2.0                  
[53] DBI_0.5-1                  whisker_0.3-2              foreign_0.8-67             KEGGREST_1.14.0           
[57] RCurl_1.95-4.8             nnet_7.3-12                tibble_1.2                 KernSmooth_2.23-15        
[61] viridis_0.3.4              locfit_1.5-9.1             influenceR_0.1.0           digest_0.6.11             
[65] diptest_0.75-7             brew_1.0-6                 munsell_0.4.3             

 

ADD COMMENTlink modified 9 months ago by Hervé Pagès ♦♦ 13k • written 9 months ago by rubi70
2
gravatar for Hervé Pagès
9 months ago by
Hervé Pagès ♦♦ 13k
United States
Hervé Pagès ♦♦ 13k wrote:

Hi Rubi,

PDict objects are for DNA sequences only. See the man page:

    The PDict class is a container for storing a preprocessed
    dictionary of DNA patterns...

There are other restrictions to what PDict() can preprocess. See man page for the details.

If your set of patterns cannot be preprocessed, then don't preprocess it ;-) , i.e. use one of the matchPDict functions directly on your AAStringSet object. See D. USING A NON-PREPROCESSED DICTIONARY in examples section of ?matchPDict for some examples.

Cheers,

H.

ADD COMMENTlink modified 9 months ago • written 9 months ago by Hervé Pagès ♦♦ 13k

Also please check matching of AAStringSet vs. another AAStringSet for a similar question and an efficient solution for the exact matching case based on CRAN package AhoCorasickTrie.

H.

ADD REPLYlink modified 9 months ago • written 9 months ago by Hervé Pagès ♦♦ 13k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 271 users visited in the last hour