vmatchpattern with indexed fasta files
1
0
Entering edit mode
@giuliobarth-9178
Last seen 8.5 years ago
Germany

Hey there,

I am new to Biostrings - any help appreciated.

Target:

I would like to scan all WIPO patent sequences from lens.org (https://www.lens.org/lens/bio/patseqdata#globe/WO/, aa-all.fa)  and check which patent is using a given sequence.

Begin of fasta file:

gnl|patseq|WO_2012_007458-2 Sequence 2 from pre-grant Patent WO_2012_007458

GCAGCTGCGCGCTCGCTCGCTCACTGAGGCCGCCCGGGCAAAGCCCGGGCGTCGGGCGACCTTTGGTCGCCCGGCCTCAGTGAGCGAGCGAGCGCGCAGAGAGGGAGTGGCCAACTCCATCACTAGGGGTTCCTTGTAGTTAATGATTAACCCGCCATGCTACTTATCTACGTAGCCATGCTCTAGACATGGCTCGACAGATCTCAATATTGGCCATTAGCCATATTATTCATTGGTTATATAGCATAAATCAATATTGGCTATTGGCCATTGCATACGTTGTATCTATATCATAATATGTACATTTATATTGGCTCATGTCCAATATGACCGCCATGTTGGCATTGATTATTGACTAGTTATTAATAGTAATCAATTACGGGGTCATTAGTTCATAGCCCATATATGGAGTTCCGCGTTACATAACTTACGGTAAATGGCCCGCCTGGCTGACCGCCCAACGACCCCCGCCCATTGACGTCAATAATGACGTATGTTCCCATAGTAACGCCAATAGGGACTTTCCATTGACGTCAATGGGTGGAGTATTTACGGTAAACTGCCCACTTGGCAGTACATCAAGTGTATCATATGCCAAGTCCGCCCCCTATTGACGTCAATGACGGTAAATGGCCCGCCTGGCATTATGCCCAGTACATGACCTTACGGGACTTTCCTACTTGGCAGTACATCTACGTATTAGTCATCGCTATTACCATGGTGATGCGGTTTTGGCAGTACACCAATGGGCGTGGATAGCGGTTTGACTCACGGGGATTTCCAAGTCTCCACCCCATTGACGTCAATGGGAGTTTGTTTTGGCACCAAAATCAACGGGACTTTCCAAAATGTCGTAACAACTGCGATCGCCCGCCCCGTTGACGCAAATGGGCGGTAGGCGTGTACGGTGGGAGGTCTATATAAGCAGAGCTCGTTTAGTGAACCGTCAGATCACTAGAAGCTTTATTGCGGTAGTTTATCACAGTTAAATTGCTAACGCAGTCAGTGCTTCTGACACAACAGTCTCGAACTTAAGCTGCAGTGACTCTCTTAAGGTAGCCTTGCAGAAGTTGGTCGTGAGGCACTGGGCAGGTAAGTATCAAGGTTACAAGACAGGTTTAAGGAGACCAATAGAAACTGGGCTTGTCGAGACAGAGAAGACTCTTGCGTTTCTGATAGGCACCTATTGGTCTTACTGACATCCACTTTGCCTTTCTCTCCACAGGTGTCCACTCCCAGTTCAATTACAGCTCTTAAGGCTAGAGTACTTAATACGACTCACTATAGGCTAGCCTCGAGAATTCCCTCAGCCAGACAGTCCTTACCTGCAACAGGTGGCCTCAGGAGTCAGGAACATCTCTACTTCCCCAACGACCCCTGGGTTGTCCTCTCAGAGATGGCTATGGATACTACAAGGTGTGGAGCCCAGTTGTTGACTCTGGTCGAGCAGATCCTGGCAGAGTTCCAGCTGCAGGAGGAAGACCTGAAGAAGGTGATGAGCCGGATGCAGAAGGAGATGGACCGTGGCCTGAGGCTGGAGACCCACGAGGAGGCCAGTGTAAAGATGTTACCCACCTACGTGCGTTCCACCCCAGAAGGCTCAGAAGTCGGAGACTTTCTCTCCTTAGACCTGGGAGGAACCAACTTCAGAGTGATGCTGGTCAAAGTGGGAGAGGGGGAGGCAGGGCAGTGGAGCGTGAAGACAAAACACCAGATGTACTCCATCCCCGAGGACGCCATGACGGGCACTGCCGAGATGCTCTTTGACTACATCTCTGAATGCATCTCTGACTTCCTTGACAAGCATCAGATGAAGCACAAGAAACTGCCCCTGGGCTTCACCTTCTCCTTCCCTGTGAGGCACGAAGACCTAGACAAGGGCATCCTCCTCAATTGGACCAAGGGCTTCAAGGCCTCTGGAGCAGAAGGGAACAACATCGTAGGACTTCTCCGAGATGCTATCAAGAGGAGAGGGGACTTTGAGATGGATGTGGTGGCAATGGTGAACGACACAGTGGCCACAATGATCTCCTGCTACTATGAAGACCGCCAATGTGAGGTCGGCATGATTGTGGGCACTGGCTGCAATGCCTGCTACATGGAGGAAATGCAGAATGTGGAGCTGGTGGAAGGGGATGAGGGACGCATGTGCGTCAACACGGAGTGGGGCGCCTTCGGGGACTCGGGCGAGCTGGATGAGTTCCTACTGGAGTATGACCGGATGGTGGATGAAAGCTCAGCGAACCCCGGTCAGCAGCTGTACGAGAAGATCATCGGTGGGAAGTATATGGGCGAGCTGGTACGACTTGTGCTGCTTAAGCTGGTGGACGAGAACCTTCTGTTCCACGGAGAGGCCTCGGAGCAGCTGCGCACGCGTGGTGCTTTTGAGACCCGTTTCGTGTCACAAGTGGAGAGCGACTCCGGGGACCGAAAGCAGATCCACAACATCCTAAGCACTCTGGGGCTTCGACCCTCTGTCACCGACTGCGACATTGTGCGCCGTGCCTGTGAAAGCGTGTCCACTCGCGCCGCCCATATGTGCTCCGCAGGACTAGCTGGGGTCATAAATCGCATGCGCGAAAGCCGCAGTGAGGACGTGATGCGCATCACTGTGGGCGTGGATGGCTCCGTGTACAAGCTGCACCCGAGCTTCAAGGAGCGGTTTCACGCCAGTGTGCGCAGGCTGACACCCAACTGCGAAATCACCTTCATCGAATCAGAGGAGGGCAGCGGCAGGGGAGCCGCACTGGTCTCTGCGGTGGCCTGCAAGAAGGCTTGCATGCTGGCCCAGTGAAATCCAGGTCATATGGACCGGGACCTGGGTTCCACGGGGACTCCACACACCACAAATGCTCCCAGCCCACCGGGGCAGGAGACCTATTCTGCTGCTACCCCTGGAAAATGGGGAGAGGCCCCTGCAAGCCGAGTCGGCCAGTGGGACAGCCCTAGGGCTCTCAGCCTGGGGCAGGGGGCTGGGAGGAAGAAGAGGATCAGAGGCGCCAAGGCCTTTCTTGCTAGAATCAACTACAGAAAATGGCGGAAAATACTCAGGACTTGCACTTTCACGATTCTTGCTTCCCAAGCGTGGGTCTGGCCTCCCAAGGGAATGCTTCCTGGACCTTGCAATGGCCTGGCTTCCCTGGGGGGGACACACCTTCATGGGGAGGTAACTTCAGCAGTTCGGCCAGACCAGACCCCAGGAGAGTAAGGGCTGCTAGTCACCCAGACCTGGCTGTTTTCTTGTCTGTGGCTGAAGAGGCCGGGGAGCCATGAGAGACTGACTATCCGGCTACATGGAGAGGACTTTCCAGGCATGAACATGCCAGAGACTGTTGCCTTCATATACCTCCACCCGAGTGGCTTACAGTTCTGGGATGAACCCTCCCAGGAGATGCCAGAGGTTAGAGCCCCAGAGTCCTTGCTCTAAGGGGACCAGAAAGGGGAGGCCTCACTCTGCACTATTCAAGCAGGAATCATCTCCAACACTCAGGTCCCTGACCCAGGAGGAAGAAGCCACCCTCAGTGTCCCTCCAAGAGACCACCCAGGTCCTTCTCTCCCTCGTTCCCAAATGCCAGCCTCTCTACCTGGGACTGTGGGGGAGTTTTTAATTAAATATTTAAAACTACTTCAAAAAAAAAAAAAGGAATTCACGCGTGGTACCTCTAGAGTCGACCCGGGCGGCCGCTTCCCTTTAGTGAGGGTTAATGCTTCGAGCAGACATGATAAGATACATTGATGAGTTTGGACAAACCACAACTAGAATGCAGTGAAAAAAATGCTTTATTTGTGAAATTTGTGATGCTATTGCTTTATTTGTAACCATTATAAGCTGCAATAAACAAGTTAACAACAACAATTGCATTCATTTTATGTTTCAGGTTCAGGGGGAGATGTGGGAGGTTTTTTAAAGCAAGTAAAACCTCTACAAATGTGGTAAAATCCGATAAGGGACTAGAGCATGGCTACGTAGATAAGTAGCATGGCGGGTTAATCATTAACTACAAGGAACCCCTAGTGATGGAGTTGGCCACTCCCTCTCTGCGCGCTCGCTCGCTCACTGAGGCCGGGCGACCAAAGGTCGCCCGACGCCCGGGCTTTGCCCGGGCGGCCTCAGTGAGCGAGCGAGCGCGC
 

As my computer cannot handle the whole fasta.file (8GB RAM) I tried to use an indexed fasta file.

What I have done so far:

library(Rsamtools)

indexFa("Lens PatSeq/aa-all.fa")   # create an index of file 'aa-all.fa'
fa = FaFile("Lens PatSeq/aa-all.fa")  # reference the fasta file and it's index
gr = as(seqinfo(fa), "GRanges")

searchpattern <- DNAString("GGGCCCAAGTTCACTTAAAAAGGAGATCAACAATGAAAGCAATTTTCGTACTGAAACATCTTAATCATGCACAGGAGACTTTCTAATG")

sourcepattern <- getSeq(fa, gr[2:1])
mindex <- vmatchPattern(searchpattern, sourcepattern)
nmatch_per_seq <- elementLengths(mindex)  # Get the number of matches per # subject element.
sum(nmatch_per_seq)  # Total number of matches.
table(nmatch_per_seq)

My Question:

Is this the correct code to use vmatchPattern in order to compare strings with an indexed file?

Many thanks in advance!

 

Warm regards,

Giulio

> sessionInfo()
R version 3.2.2 (2015-08-14)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 8 x64 (build 9200)

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252 LC_NUMERIC=C                           LC_TIME=English_United States.1252    

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] Rsamtools_1.22.0     Biostrings_2.38.0    XVector_0.10.0       GenomicRanges_1.22.1 GenomeInfoDb_1.6.1   IRanges_2.4.1        S4Vectors_0.8.1      BiocGenerics_0.16.1 

loaded via a namespace (and not attached):
[1] zlibbioc_1.16.0      futile.logger_1.4.1  tools_3.2.2          lambda.r_1.1.7       futile.options_1.0.0 BiocParallel_1.4.0   bitops_1.0-6        
 
 
fasta fai vmatchpattern biostrings • 1.5k views
ADD COMMENT
0
Entering edit mode
@herve-pages-1542
Last seen 18 hours ago
Seattle, WA, United States

Hi Giulio,

Sorry for the late answer. If you only want to compare strings (i.e. see whether they're the same or not), vmatchPattern() is overkill and not the most efficient. Just use == instead:

searchpattern == sourcepattern

That will return you a logical vector of the length of the longest of the 2 operands (sourcepattern in your case).

Otherwise (i.e. if you want to find the strings in sourcepattern that contain searchpattern) yes vmatchPattern() works and the way you're using it looks fine to me. However if you're not interested in the exact location of the match within each string in sourcepattern, then vcountPattern() is a slightly better choice (more efficient and more user-friendly returned value).

Cheers,

H.

ADD COMMENT

Login before adding your answer.

Traffic: 1058 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6