Matching patterns in an AAStringset
1
0
Entering edit mode
Cédric • 0
@f2a900d4
Last seen 12 days ago
Belgium

I have an AAStringSet containing the name and sequences of proteins of interest. Image of the stringset

I am trying to find from this set, hom many patterns match DDVF, DEVF EDVF or EEVF. I've try to do this but only get error codes in return saying that it needs to be a vector and not an AAstringset Object.

pattern <- c("DDVF", "DEVF", "EDVF", "EEVF")
# str_detect(string = prot_interest, pattern = pattern)


sapply(getSeq(prot_interest, names(prot_interest)), str_detect, pattern)

Any idea how i can do this ?

Here is the session info if needed

R version 4.3.2 (2023-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 11 x64 (build 22631)

Matrix products: default


locale:
[1] LC_COLLATE=English_United States.utf8 
[2] LC_CTYPE=English_United States.utf8   
[3] LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.utf8    

time zone: Europe/Brussels
tzcode source: internal

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] rWSBIM1322_0.3.2                     
 [2] lubridate_1.9.4                      
 [3] forcats_1.0.0                        
 [4] stringr_1.5.1                        
 [5] dplyr_1.1.4                          
 [6] purrr_1.0.2                          
 [7] readr_2.1.5                          
 [8] tidyr_1.3.1                          
 [9] tibble_3.2.1                         
[10] tidyverse_2.0.0                      
[11] ggplot2_3.5.1                        
[12] BSgenome.Dmelanogaster.UCSC.dm2_1.4.0
[13] BSgenome_1.70.2                      
[14] rtracklayer_1.62.0                   
[15] BiocIO_1.12.0                        
[16] GenomicRanges_1.54.1                 
[17] Biostrings_2.70.3                    
[18] GenomeInfoDb_1.38.8                  
[19] XVector_0.42.0                       
[20] IRanges_2.36.0                       
[21] S4Vectors_0.40.2                     
[22] BiocGenerics_0.48.1                  

loaded via a namespace (and not attached):
 [1] SummarizedExperiment_1.32.0 gtable_0.3.6               
 [3] rjson_0.2.23                xfun_0.49                  
 [5] bslib_0.8.0                 Biobase_2.62.0             
 [7] lattice_0.21-9              tzdb_0.4.0                 
 [9] vctrs_0.6.5                 tools_4.3.2                
[11] bitops_1.0-8                generics_0.1.3             
[13] parallel_4.3.2              fansi_1.0.6                
[15] highr_0.11                  pkgconfig_2.0.3            
[17] Matrix_1.6-1.1              lifecycle_1.0.4            
[19] GenomeInfoDbData_1.2.11     compiler_4.3.2             
[21] Rsamtools_2.18.0            munsell_0.5.1              
[23] codetools_0.2-19            htmltools_0.5.8.1          
[25] sass_0.4.9                  RCurl_1.98-1.16            
[27] yaml_2.3.10                 pillar_1.9.0               
[29] crayon_1.5.3                jquerylib_0.1.4            
[31] BiocParallel_1.36.0         DelayedArray_0.28.0        
[33] cachem_1.1.0                abind_1.4-8                
[35] tidyselect_1.2.1            digest_0.6.37              
[37] stringi_1.8.4               restfulr_0.0.15            
[39] fastmap_1.2.0               grid_4.3.2                 
[41] colorspace_2.1-1            cli_3.6.2                  
[43] SparseArray_1.2.4           magrittr_2.0.3             
[45] S4Arrays_1.2.1              XML_3.99-0.17              
[47] utf8_1.2.4                  withr_3.0.2                
[49] scales_1.3.0                timechange_0.3.0           
[51] rmarkdown_2.29              matrixStats_1.4.1          
[53] hms_1.1.3                   evaluate_1.0.1             
[55] knitr_1.49                  rlang_1.1.4                
[57] glue_1.7.0                  pkgload_1.4.0              
[59] rstudioapi_0.17.1           jsonlite_1.8.9             
[61] R6_2.5.1                    MatrixGenerics_1.14.0      
[63] GenomicAlignments_1.38.2    zlibbioc_1.48.2
Biostrings • 131 views
ADD COMMENT
0
Entering edit mode
Aidan ▴ 60
@3efa9cc7
Last seen 5 days ago
United States

This question is tagged Biostrings, but str_detect isn't part of the Biostrings package. If you're trying to use the tidyverse str_detect, then it certainly won't work because getSeq returns an XString or XStringSet object, whereas str_detect expects a character vector. You could convert it to character first, but that's more work than is necessary.

You can do this in just standard Biostrings with vmatchPattern assuming you're looking for exact matches. If you just need counts, you can use vcountPattern. Those approaches should be faster than str_detect with XString -> character conversion.

# setting up some random sequences that each contain one of the patterns
set.seed(123L)
aas <- character(100L)
patterns <- c("DDVF", "DEVF", "EDVF", "EEVF")
for(i in seq_len(100)){
  len <- sample(100L, 1L) + 1000L
  pos <- sample(len, 1)
  seqs <- sample(AA_STANDARD, len, replace=TRUE)
  seqs <- c(seqs[seq_len(pos)], sample(patterns, 1), seqs[seq(pos+1, len)])
  aas[i] <- paste0(seqs, collapse='')
}
aas <- AAStringSet(aas)

# getting the matches for each
all_matches <- lapply(patterns, \(x) vmatchPattern(x, aas))
names(all_matches) <- patterns

# getting just counts
all_counts <- lapply(patterns, \(x) vcountPattern(x, aas))
names(all_counts) <- patterns
ADD COMMENT

Login before adding your answer.

Traffic: 405 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6