Biostrings multiple pattern match
1
0
Entering edit mode
@31a52686
Last seen 2.2 years ago
Ireland

Hi,

Aim: I am trying to get the positions of all stop codons and type of the stop codon given a DNAstring object or a character string.

stops <- c("TAG","TAA","TGA")
vmatchPattern(stop, stringObj)

I also tried to define stops as "TAA|TAG|TGA" and I know non is supported by vmatchPattern function. Then I tried:

stop1 <-matchPattern(c("TAG"), as(trx, "character")) %>% 
  as.data.frame()
stop2<- matchPattern(c("TAA"), as(trx, "character")) %>% 
  as.data.frame()
stop3<-matchPattern(c("TGA"), as(trx, "character")) %>% 
  as.data.frame()
stops <- rbind(stop1,stop2,stop3)

Outcome below is very much satisfying, I wish I could find a much clever solution.

start  end width seq
1    178  180     3 TAG
2    400  402     3 TAG
3    427  429     3 TAG
4    574  576     3 TAG
5    344  346     3 TAA
6    443  445     3 TAA
7    692  694     3 TAA
8     48   50     3 TGA
9     88   90     3 TGA
10   437  439     3 TGA
11   455  457     3 TGA
12   496  498     3 TGA
13   509  511     3 TGA
14   538  540     3 TGA
15   649  651     3 TGA
16   746  748     3 TGA

Can we find another solution to this problem of mine?

sessionInfo( )
R version 4.2.1 (2022-06-23)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.5 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0

locale:
 [1] LC_CTYPE=en_IE.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_IE.UTF-8        LC_COLLATE=en_IE.UTF-8    
 [5] LC_MONETARY=en_IE.UTF-8    LC_MESSAGES=en_IE.UTF-8   
 [7] LC_PAPER=en_IE.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_IE.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] Biostrings_2.64.1   GenomeInfoDb_1.32.4 XVector_0.36.0      IRanges_2.30.1     
 [5] S4Vectors_0.34.0    BiocGenerics_0.42.0 gridExtra_2.3       forcats_0.5.2      
 [9] stringr_1.4.1       dplyr_1.0.10        purrr_0.3.4         readr_2.1.3        
[13] tidyr_1.2.1         tibble_3.1.8        ggplot2_3.3.6       tidyverse_1.3.2    

loaded via a namespace (and not attached):
 [1] lubridate_1.8.0        assertthat_0.2.1       digest_0.6.29         
 [4] utf8_1.2.2             R6_2.5.1               cellranger_1.1.0      
 [7] backports_1.4.1        reprex_2.0.2           evaluate_0.16         
[10] httr_1.4.4             pillar_1.8.1           zlibbioc_1.42.0       
[13] rlang_1.0.6            googlesheets4_1.0.1    readxl_1.4.1          
[16] rstudioapi_0.14        rmarkdown_2.16         labeling_0.4.2        
[19] googledrive_2.0.0      bit_4.0.4              RCurl_1.98-1.8        
[22] munsell_0.5.0          broom_1.0.1            compiler_4.2.1        
[25] modelr_0.1.9           xfun_0.33              pkgconfig_2.0.3       
[28] htmltools_0.5.3        tidyselect_1.1.2       GenomeInfoDbData_1.2.8
[31] fansi_1.0.3            crayon_1.5.2           tzdb_0.3.0            
[34] dbplyr_2.2.1           withr_2.5.0            bitops_1.0-7          
[37] grid_4.2.1             jsonlite_1.8.2         gtable_0.3.1          
[40] lifecycle_1.0.2        DBI_1.1.3              magrittr_2.0.3        
[43] scales_1.2.1           vroom_1.6.0            cli_3.4.1             
[46] stringi_1.7.8          farver_2.1.1           fs_1.5.2              
[49] xml2_1.3.3             ellipsis_0.3.2         generics_0.1.3        
[52] vctrs_0.4.2            tools_4.2.1            bit64_4.0.5           
[55] glue_1.6.2             hms_1.1.2              parallel_4.2.1        
[58] fastmap_1.1.0          yaml_2.3.5             colorspace_2.0-3      
[61] gargle_1.2.1           rvest_1.0.3            knitr_1.40            
[64] haven_2.5.1
pattern Biostrings • 1.8k views
ADD COMMENT
0
Entering edit mode
ADD REPLY
2
Entering edit mode
@herve-pages-1542
Last seen 7 days ago
Seattle, WA, United States

Hi,

You can use matchPDict() to match a set of patterns against a reference sequence (see ?matchPDict):

> library(BSgenome.Hsapiens.UCSC.hg19)
> chr1 <- BSgenome.Hsapiens.UCSC.hg19$chr1
> stops <- c("TAG","TAA","TGA")
> pdict <- PDict(stops)
> mindex <- matchPDict(pdict, chr1)
> mindex
MIndex object of length 3
[[1]]
IRanges object with 2893538 ranges and 0 metadata columns:
                start       end     width
            <integer> <integer> <integer>
        [1]     10808     10810         3
        [2]     10958     10960         3
        [3]     11133     11135         3
        [4]     11243     11245         3
        [5]     11269     11271         3
        ...       ...       ...       ...
  [2893534] 249240586 249240588         3
  [2893535] 249240592 249240594         3
  [2893536] 249240598 249240600         3
  [2893537] 249240611 249240613         3
  [2893538] 249240617 249240619         3

[[2]]
IRanges object with 4503830 ranges and 0 metadata columns:
                start       end     width
            <integer> <integer> <integer>
        [1]     10001     10003         3
        [2]     10007     10009         3
        [3]     10013     10015         3
        [4]     10019     10021         3
        [5]     10025     10027         3
        ...       ...       ...       ...
  [4503826] 249239520 249239522         3
  [4503827] 249239548 249239550         3
  [4503828] 249239702 249239704         3
  [4503829] 249239727 249239729         3
  [4503830] 249240604 249240606         3

[[3]]
IRanges object with 4387595 ranges and 0 metadata columns:
                start       end     width
            <integer> <integer> <integer>
        [1]     10503     10505         3
        [2]     10508     10510         3
        [3]     10600     10602         3
        [4]     11564     11566         3
        [5]     11569     11571         3
        ...       ...       ...       ...
  [4387591] 249239544 249239546         3
  [4387592] 249239611 249239613         3
  [4387593] 249239713 249239715         3
  [4387594] 249239752 249239754         3
  [4387595] 249239758 249239760         3

If you prefer all the matches in a single IRanges object:

> matches <- unlist(mindex)
> mcols(matches)$seq <- rep(stops, lengths(mindex))
> matches
IRanges object with 11784963 ranges and 1 metadata column:
                 start       end     width |         seq
             <integer> <integer> <integer> | <character>
         [1]     10808     10810         3 |         TAG
         [2]     10958     10960         3 |         TAG
         [3]     11133     11135         3 |         TAG
         [4]     11243     11245         3 |         TAG
         [5]     11269     11271         3 |         TAG
         ...       ...       ...       ... .         ...
  [11784959] 249239544 249239546         3 |         TGA
  [11784960] 249239611 249239613         3 |         TGA
  [11784961] 249239713 249239715         3 |         TGA
  [11784962] 249239752 249239754         3 |         TGA
  [11784963] 249239758 249239760         3 |         TGA

Finally use as.data.frame(matches) if you need the result in a data.frame.

Hope this helps,

H.

ADD COMMENT
0
Entering edit mode

Hi,

This helped, and that is beautiful, thanks a million! There is only one complication. I subset the individual transcript by name using trx <- genome[grepl("GABP01113048.1", names(genome))]. If I understand correctly this subsetting does not make the object a single transcript, as it is still a XStringSet object (multiple sequence). If I subset by index using obj[[2]] then I can successfullly apply your steps above.

Now, the question becomes, how can I subset individual transcript by name? subseq does not allow it as far as my search goes.

Hope you can help in this too.

ADD REPLY
1
Entering edit mode

Maybe genome[[grep("GABP01113048.1", names(genome))]] is what you are after? I don't know what your genome object is, where it's coming from, or what it looks like, so I'm just shooting in the dark here.

More generally speaking an XStringSet object of length 1 x can always be turned into an XString object with x[[1]].

Finally note that if you're only interested in counting the number of occurrences of each of your stop codons in each transcript then you can use vcountPDict(pdict, genome). This will return a matrix of counts with 1 row per stop codon and 1 column per transcript. Will be by far the most efficient solution.

Best,

H.

ADD REPLY
0
Entering edit mode

I found a solution but please let me know if there is a better way.

I had to change my names(StringObj) to transcipt ids(names was FASTA header with additional info(i.e. GABP01000993.1 TSA: Lingulodinium polyedrum comp101457_c0_seq1 mRNA sequence"). So, I convert them to trx ids like GABP01000993.1 using names(StringObj)<-str_split(names(StringObj)," ") %>% map_chr(`[`, 1). Then I subset by transcript id by StringObj$GABP01000993.1.

ADD REPLY

Login before adding your answer.

Traffic: 364 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6