Question: vmatchpattern with indexed fasta files
0
gravatar for giuliobarth
3.7 years ago by
Germany
giuliobarth0 wrote:

Hey there,

I am new to Biostrings - any help appreciated.

Target:

I would like to scan all WIPO patent sequences from lens.org (https://www.lens.org/lens/bio/patseqdata#globe/WO/, aa-all.fa)  and check which patent is using a given sequence.

Begin of fasta file:

gnl|patseq|WO_2012_007458-2 Sequence 2 from pre-grant Patent WO_2012_007458

GCAGCTGCGCGCTCGCTCGCTCACTGAGGCCGCCCGGGCAAAGCCCGGGCGTCGGGCGACCTTTGGTCGCCCGGCCTCAGTGAGCGAGCGAGCGCGCAGAGAGGGAGTGGCCAACTCCATCACTAGGGGTTCCTTGTAGTTAATGATTAACCCGCCATGCTACTTATCTACGTAGCCATGCTCTAGACATGGCTCGACAGATCTCAATATTGGCCATTAGCCATATTATTCATTGGTTATATAGCATAAATCAATATTGGCTATTGGCCATTGCATACGTTGTATCTATATCATAATATGTACATTTATATTGGCTCATGTCCAATATGACCGCCATGTTGGCATTGATTATTGACTAGTTATTAATAGTAATCAATTACGGGGTCATTAGTTCATAGCCCATATATGGAGTTCCGCGTTACATAACTTACGGTAAATGGCCCGCCTGGCTGACCGCCCAACGACCCCCGCCCATTGACGTCAATAATGACGTATGTTCCCATAGTAACGCCAATAGGGACTTTCCATTGACGTCAATGGGTGGAGTATTTACGGTAAACTGCCCACTTGGCAGTACATCAAGTGTATCATATGCCAAGTCCGCCCCCTATTGACGTCAATGACGGTAAATGGCCCGCCTGGCATTATGCCCAGTACATGACCTTACGGGACTTTCCTACTTGGCAGTACATCTACGTATTAGTCATCGCTATTACCATGGTGATGCGGTTTTGGCAGTACACCAATGGGCGTGGATAGCGGTTTGACTCACGGGGATTTCCAAGTCTCCACCCCATTGACGTCAATGGGAGTTTGTTTTGGCACCAAAATCAACGGGACTTTCCAAAATGTCGTAACAACTGCGATCGCCCGCCCCGTTGACGCAAATGGGCGGTAGGCGTGTACGGTGGGAGGTCTATATAAGCAGAGCTCGTTTAGTGAACCGTCAGATCACTAGAAGCTTTATTGCGGTAGTTTATCACAGTTAAATTGCTAACGCAGTCAGTGCTTCTGACACAACAGTCTCGAACTTAAGCTGCAGTGACTCTCTTAAGGTAGCCTTGCAGAAGTTGGTCGTGAGGCACTGGGCAGGTAAGTATCAAGGTTACAAGACAGGTTTAAGGAGACCAATAGAAACTGGGCTTGTCGAGACAGAGAAGACTCTTGCGTTTCTGATAGGCACCTATTGGTCTTACTGACATCCACTTTGCCTTTCTCTCCACAGGTGTCCACTCCCAGTTCAATTACAGCTCTTAAGGCTAGAGTACTTAATACGACTCACTATAGGCTAGCCTCGAGAATTCCCTCAGCCAGACAGTCCTTACCTGCAACAGGTGGCCTCAGGAGTCAGGAACATCTCTACTTCCCCAACGACCCCTGGGTTGTCCTCTCAGAGATGGCTATGGATACTACAAGGTGTGGAGCCCAGTTGTTGACTCTGGTCGAGCAGATCCTGGCAGAGTTCCAGCTGCAGGAGGAAGACCTGAAGAAGGTGATGAGCCGGATGCAGAAGGAGATGGACCGTGGCCTGAGGCTGGAGACCCACGAGGAGGCCAGTGTAAAGATGTTACCCACCTACGTGCGTTCCACCCCAGAAGGCTCAGAAGTCGGAGACTTTCTCTCCTTAGACCTGGGAGGAACCAACTTCAGAGTGATGCTGGTCAAAGTGGGAGAGGGGGAGGCAGGGCAGTGGAGCGTGAAGACAAAACACCAGATGTACTCCATCCCCGAGGACGCCATGACGGGCACTGCCGAGATGCTCTTTGACTACATCTCTGAATGCATCTCTGACTTCCTTGACAAGCATCAGATGAAGCACAAGAAACTGCCCCTGGGCTTCACCTTCTCCTTCCCTGTGAGGCACGAAGACCTAGACAAGGGCATCCTCCTCAATTGGACCAAGGGCTTCAAGGCCTCTGGAGCAGAAGGGAACAACATCGTAGGACTTCTCCGAGATGCTATCAAGAGGAGAGGGGACTTTGAGATGGATGTGGTGGCAATGGTGAACGACACAGTGGCCACAATGATCTCCTGCTACTATGAAGACCGCCAATGTGAGGTCGGCATGATTGTGGGCACTGGCTGCAATGCCTGCTACATGGAGGAAATGCAGAATGTGGAGCTGGTGGAAGGGGATGAGGGACGCATGTGCGTCAACACGGAGTGGGGCGCCTTCGGGGACTCGGGCGAGCTGGATGAGTTCCTACTGGAGTATGACCGGATGGTGGATGAAAGCTCAGCGAACCCCGGTCAGCAGCTGTACGAGAAGATCATCGGTGGGAAGTATATGGGCGAGCTGGTACGACTTGTGCTGCTTAAGCTGGTGGACGAGAACCTTCTGTTCCACGGAGAGGCCTCGGAGCAGCTGCGCACGCGTGGTGCTTTTGAGACCCGTTTCGTGTCACAAGTGGAGAGCGACTCCGGGGACCGAAAGCAGATCCACAACATCCTAAGCACTCTGGGGCTTCGACCCTCTGTCACCGACTGCGACATTGTGCGCCGTGCCTGTGAAAGCGTGTCCACTCGCGCCGCCCATATGTGCTCCGCAGGACTAGCTGGGGTCATAAATCGCATGCGCGAAAGCCGCAGTGAGGACGTGATGCGCATCACTGTGGGCGTGGATGGCTCCGTGTACAAGCTGCACCCGAGCTTCAAGGAGCGGTTTCACGCCAGTGTGCGCAGGCTGACACCCAACTGCGAAATCACCTTCATCGAATCAGAGGAGGGCAGCGGCAGGGGAGCCGCACTGGTCTCTGCGGTGGCCTGCAAGAAGGCTTGCATGCTGGCCCAGTGAAATCCAGGTCATATGGACCGGGACCTGGGTTCCACGGGGACTCCACACACCACAAATGCTCCCAGCCCACCGGGGCAGGAGACCTATTCTGCTGCTACCCCTGGAAAATGGGGAGAGGCCCCTGCAAGCCGAGTCGGCCAGTGGGACAGCCCTAGGGCTCTCAGCCTGGGGCAGGGGGCTGGGAGGAAGAAGAGGATCAGAGGCGCCAAGGCCTTTCTTGCTAGAATCAACTACAGAAAATGGCGGAAAATACTCAGGACTTGCACTTTCACGATTCTTGCTTCCCAAGCGTGGGTCTGGCCTCCCAAGGGAATGCTTCCTGGACCTTGCAATGGCCTGGCTTCCCTGGGGGGGACACACCTTCATGGGGAGGTAACTTCAGCAGTTCGGCCAGACCAGACCCCAGGAGAGTAAGGGCTGCTAGTCACCCAGACCTGGCTGTTTTCTTGTCTGTGGCTGAAGAGGCCGGGGAGCCATGAGAGACTGACTATCCGGCTACATGGAGAGGACTTTCCAGGCATGAACATGCCAGAGACTGTTGCCTTCATATACCTCCACCCGAGTGGCTTACAGTTCTGGGATGAACCCTCCCAGGAGATGCCAGAGGTTAGAGCCCCAGAGTCCTTGCTCTAAGGGGACCAGAAAGGGGAGGCCTCACTCTGCACTATTCAAGCAGGAATCATCTCCAACACTCAGGTCCCTGACCCAGGAGGAAGAAGCCACCCTCAGTGTCCCTCCAAGAGACCACCCAGGTCCTTCTCTCCCTCGTTCCCAAATGCCAGCCTCTCTACCTGGGACTGTGGGGGAGTTTTTAATTAAATATTTAAAACTACTTCAAAAAAAAAAAAAGGAATTCACGCGTGGTACCTCTAGAGTCGACCCGGGCGGCCGCTTCCCTTTAGTGAGGGTTAATGCTTCGAGCAGACATGATAAGATACATTGATGAGTTTGGACAAACCACAACTAGAATGCAGTGAAAAAAATGCTTTATTTGTGAAATTTGTGATGCTATTGCTTTATTTGTAACCATTATAAGCTGCAATAAACAAGTTAACAACAACAATTGCATTCATTTTATGTTTCAGGTTCAGGGGGAGATGTGGGAGGTTTTTTAAAGCAAGTAAAACCTCTACAAATGTGGTAAAATCCGATAAGGGACTAGAGCATGGCTACGTAGATAAGTAGCATGGCGGGTTAATCATTAACTACAAGGAACCCCTAGTGATGGAGTTGGCCACTCCCTCTCTGCGCGCTCGCTCGCTCACTGAGGCCGGGCGACCAAAGGTCGCCCGACGCCCGGGCTTTGCCCGGGCGGCCTCAGTGAGCGAGCGAGCGCGC
 

As my computer cannot handle the whole fasta.file (8GB RAM) I tried to use an indexed fasta file.

What I have done so far:

library(Rsamtools)

indexFa("Lens PatSeq/aa-all.fa")   # create an index of file 'aa-all.fa'
fa = FaFile("Lens PatSeq/aa-all.fa")  # reference the fasta file and it's index
gr = as(seqinfo(fa), "GRanges")

searchpattern <- DNAString("GGGCCCAAGTTCACTTAAAAAGGAGATCAACAATGAAAGCAATTTTCGTACTGAAACATCTTAATCATGCACAGGAGACTTTCTAATG")

sourcepattern <- getSeq(fa, gr[2:1])
mindex <- vmatchPattern(searchpattern, sourcepattern)
nmatch_per_seq <- elementLengths(mindex)  # Get the number of matches per # subject element.
sum(nmatch_per_seq)  # Total number of matches.
table(nmatch_per_seq)

My Question:

Is this the correct code to use vmatchPattern in order to compare strings with an indexed file?

Many thanks in advance!

 

Warm regards,

Giulio

> sessionInfo()
R version 3.2.2 (2015-08-14)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 8 x64 (build 9200)

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252 LC_NUMERIC=C                           LC_TIME=English_United States.1252    

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] Rsamtools_1.22.0     Biostrings_2.38.0    XVector_0.10.0       GenomicRanges_1.22.1 GenomeInfoDb_1.6.1   IRanges_2.4.1        S4Vectors_0.8.1      BiocGenerics_0.16.1 

loaded via a namespace (and not attached):
[1] zlibbioc_1.16.0      futile.logger_1.4.1  tools_3.2.2          lambda.r_1.1.7       futile.options_1.0.0 BiocParallel_1.4.0   bitops_1.0-6        
 
 
ADD COMMENTlink modified 3.5 years ago by Hervé Pagès ♦♦ 14k • written 3.7 years ago by giuliobarth0
Answer: vmatchpattern with indexed fasta files
0
gravatar for Hervé Pagès
3.5 years ago by
Hervé Pagès ♦♦ 14k
United States
Hervé Pagès ♦♦ 14k wrote:

Hi Giulio,

Sorry for the late answer. If you only want to compare strings (i.e. see whether they're the same or not), vmatchPattern() is overkill and not the most efficient. Just use == instead:

searchpattern == sourcepattern

That will return you a logical vector of the length of the longest of the 2 operands (sourcepattern in your case).

Otherwise (i.e. if you want to find the strings in sourcepattern that contain searchpattern) yes vmatchPattern() works and the way you're using it looks fine to me. However if you're not interested in the exact location of the match within each string in sourcepattern, then vcountPattern() is a slightly better choice (more efficient and more user-friendly returned value).

Cheers,

H.

ADD COMMENTlink modified 3.5 years ago • written 3.5 years ago by Hervé Pagès ♦♦ 14k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 219 users visited in the last hour