Question

Biostrings multiple pattern match

0

Entering edit mode

fortytwo.teacups • 0

@31a52686

Last seen 18 months ago

Ireland

Hi,

Aim: I am trying to get the positions of all stop codons and type of the stop codon given a DNAstring object or a character string.

stops <- c("TAG","TAA","TGA")
vmatchPattern(stop, stringObj)

I also tried to define stops as "TAA|TAG|TGA" and I know non is supported by vmatchPattern function. Then I tried:

stop1 <-matchPattern(c("TAG"), as(trx, "character")) %>% 
  as.data.frame()
stop2<- matchPattern(c("TAA"), as(trx, "character")) %>% 
  as.data.frame()
stop3<-matchPattern(c("TGA"), as(trx, "character")) %>% 
  as.data.frame()
stops <- rbind(stop1,stop2,stop3)

Outcome below is very much satisfying, I wish I could find a much clever solution.

start  end width seq
1    178  180     3 TAG
2    400  402     3 TAG
3    427  429     3 TAG
4    574  576     3 TAG
5    344  346     3 TAA
6    443  445     3 TAA
7    692  694     3 TAA
8     48   50     3 TGA
9     88   90     3 TGA
10   437  439     3 TGA
11   455  457     3 TGA
12   496  498     3 TGA
13   509  511     3 TGA
14   538  540     3 TGA
15   649  651     3 TGA
16   746  748     3 TGA

Can we find another solution to this problem of mine?

sessionInfo( )
R version 4.2.1 (2022-06-23)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.5 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0

locale:
 [1] LC_CTYPE=en_IE.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_IE.UTF-8        LC_COLLATE=en_IE.UTF-8    
 [5] LC_MONETARY=en_IE.UTF-8    LC_MESSAGES=en_IE.UTF-8   
 [7] LC_PAPER=en_IE.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_IE.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] Biostrings_2.64.1   GenomeInfoDb_1.32.4 XVector_0.36.0      IRanges_2.30.1     
 [5] S4Vectors_0.34.0    BiocGenerics_0.42.0 gridExtra_2.3       forcats_0.5.2      
 [9] stringr_1.4.1       dplyr_1.0.10        purrr_0.3.4         readr_2.1.3        
[13] tidyr_1.2.1         tibble_3.1.8        ggplot2_3.3.6       tidyverse_1.3.2    

loaded via a namespace (and not attached):
 [1] lubridate_1.8.0        assertthat_0.2.1       digest_0.6.29         
 [4] utf8_1.2.2             R6_2.5.1               cellranger_1.1.0      
 [7] backports_1.4.1        reprex_2.0.2           evaluate_0.16         
[10] httr_1.4.4             pillar_1.8.1           zlibbioc_1.42.0       
[13] rlang_1.0.6            googlesheets4_1.0.1    readxl_1.4.1          
[16] rstudioapi_0.14        rmarkdown_2.16         labeling_0.4.2        
[19] googledrive_2.0.0      bit_4.0.4              RCurl_1.98-1.8        
[22] munsell_0.5.0          broom_1.0.1            compiler_4.2.1        
[25] modelr_0.1.9           xfun_0.33              pkgconfig_2.0.3       
[28] htmltools_0.5.3        tidyselect_1.1.2       GenomeInfoDbData_1.2.8
[31] fansi_1.0.3            crayon_1.5.2           tzdb_0.3.0            
[34] dbplyr_2.2.1           withr_2.5.0            bitops_1.0-7          
[37] grid_4.2.1             jsonlite_1.8.2         gtable_0.3.1          
[40] lifecycle_1.0.2        DBI_1.1.3              magrittr_2.0.3        
[43] scales_1.2.1           vroom_1.6.0            cli_3.4.1             
[46] stringi_1.7.8          farver_2.1.1           fs_1.5.2              
[49] xml2_1.3.3             ellipsis_0.3.2         generics_0.1.3        
[52] vctrs_0.4.2            tools_4.2.1            bit64_4.0.5           
[55] glue_1.6.2             hms_1.1.2              parallel_4.2.1        
[58] fastmap_1.1.0          yaml_2.3.5             colorspace_2.0-3      
[61] gargle_1.2.1           rvest_1.0.3            knitr_1.40            
[64] haven_2.5.1

pattern Biostrings • 1.4k views

ADD COMMENT • link updated 18 months ago by Hervé Pagès 16k • written 18 months ago by fortytwo.teacups • 0

0

Entering edit mode

Cross-posted: https://www.biostars.org/p/9540421/

ADD REPLY • link 18 months ago Kevin Blighe ★ 3.9k

score 2 · Accepted Answer · 2022-10-03

Hi,

You can use matchPDict() to match a set of patterns against a reference sequence (see ?matchPDict):

> library(BSgenome.Hsapiens.UCSC.hg19)
> chr1 <- BSgenome.Hsapiens.UCSC.hg19$chr1
> stops <- c("TAG","TAA","TGA")
> pdict <- PDict(stops)
> mindex <- matchPDict(pdict, chr1)
> mindex
MIndex object of length 3
[[1]]
IRanges object with 2893538 ranges and 0 metadata columns:
                start       end     width
            <integer> <integer> <integer>
        [1]     10808     10810         3
        [2]     10958     10960         3
        [3]     11133     11135         3
        [4]     11243     11245         3
        [5]     11269     11271         3
        ...       ...       ...       ...
  [2893534] 249240586 249240588         3
  [2893535] 249240592 249240594         3
  [2893536] 249240598 249240600         3
  [2893537] 249240611 249240613         3
  [2893538] 249240617 249240619         3

[[2]]
IRanges object with 4503830 ranges and 0 metadata columns:
                start       end     width
            <integer> <integer> <integer>
        [1]     10001     10003         3
        [2]     10007     10009         3
        [3]     10013     10015         3
        [4]     10019     10021         3
        [5]     10025     10027         3
        ...       ...       ...       ...
  [4503826] 249239520 249239522         3
  [4503827] 249239548 249239550         3
  [4503828] 249239702 249239704         3
  [4503829] 249239727 249239729         3
  [4503830] 249240604 249240606         3

[[3]]
IRanges object with 4387595 ranges and 0 metadata columns:
                start       end     width
            <integer> <integer> <integer>
        [1]     10503     10505         3
        [2]     10508     10510         3
        [3]     10600     10602         3
        [4]     11564     11566         3
        [5]     11569     11571         3
        ...       ...       ...       ...
  [4387591] 249239544 249239546         3
  [4387592] 249239611 249239613         3
  [4387593] 249239713 249239715         3
  [4387594] 249239752 249239754         3
  [4387595] 249239758 249239760         3

If you prefer all the matches in a single IRanges object:

> matches <- unlist(mindex)
> mcols(matches)$seq <- rep(stops, lengths(mindex))
> matches
IRanges object with 11784963 ranges and 1 metadata column:
                 start       end     width |         seq
             <integer> <integer> <integer> | <character>
         [1]     10808     10810         3 |         TAG
         [2]     10958     10960         3 |         TAG
         [3]     11133     11135         3 |         TAG
         [4]     11243     11245         3 |         TAG
         [5]     11269     11271         3 |         TAG
         ...       ...       ...       ... .         ...
  [11784959] 249239544 249239546         3 |         TGA
  [11784960] 249239611 249239613         3 |         TGA
  [11784961] 249239713 249239715         3 |         TGA
  [11784962] 249239752 249239754         3 |         TGA
  [11784963] 249239758 249239760         3 |         TGA

Finally use as.data.frame(matches) if you need the result in a data.frame.

Hope this helps,

H.