Question

Splitting a fasta file based on specific Amino acid for plotting

0

Entering edit mode

Assa Yeroslaviz ★ 1.5k

@assa-yeroslaviz-1597

Last seen 3 months ago

Germany

Hi all after a few days of searching and trial & error I would like to ask for your help.

I have a protein sequence, let's say this one (one line):

>protein1
MKLSVNEAQLGFYLGSIDPRSSEDQPESLKTGQMMDESDEDFKELCASFFQRVKKHGIKE
VSGERKTQKAASNGTQIRSKLKRTKQTATKTKTLQGPAEKKPPSGSQAPRTKQRVTKWQ

I would like to split the protein after each occurrence of a specific AA, let's say "K" ( the cleavage point of trypsin) so that I will get a list or an IRanges object with the start and end positions) with these elements:

MK
LSVNEAQLGFYLGSIDPRSSEDQPESLK
TGQMMDESDEDFK
ELCASFFQRVK
...
PPSGSQAPRTK
QRVTK
WQ...

Using IRanges and matchPattern(), I was only able to create an object of the pattern I'm looking for, but not of the sub-sequence.

Than I would like to plot these subsequences onto the complete sequence of the protein

The end goal of my analysis is to plot the protein sequence (x-axis) against all cleavage patterns (Y-axis)

which should then looks like the attached image

the bottom line represents the protein, each of rows above stands for one specific peptide. The Idea is to calculate which protease or combination of proteases gives the highest coverage of the protein in question.

I would really like to know if there are any packages out there dealing with this kind of questions/problems as I have not found any.

Hope I have made myself clear enough and of course for any help I can get. Thanks a lot in advance

Assa

iranges biostrings fasta protein readaastringset • 1.4k views

ADD COMMENT • link updated 7.5 years ago by Michael Lawrence ★ 11k • written 7.5 years ago by Assa Yeroslaviz ★ 1.5k

score 0 · Answer 1 · 2016-10-20

0

Entering edit mode

Michael Lawrence ★ 11k

@michael-lawrence-3846

Last seen 2.4 years ago

United States

matchPattern() returns a Views object, which is the combination of the underlying sequence and the matching ranges. We need to convert the ranges so that the sequence is partitioned, where the position of the match is the end of each partition.

library(Biostrings)
protein1 <- AAString("MKLSVNEAQLGFYLGSIDPRSSEDQPESLKTGQMMDESDEDFKELCASFFQRVKKHGIKEVSGERKTQKAASNGTQIRSKLKRTKQTATKTKTLQGPAEKKPPSGSQAPRTKQRVTKWQ")
hits <- matchPattern("K", protein1)
subseqs <- PartitioningByEnd(end(hits))

As for visualizing it, I don't have any good solutions. I tried using ggbio and Gviz but both are pretty disappointing.

ADD COMMENT • link 7.5 years ago Michael Lawrence ★ 11k

0

Entering edit mode

thanks for the response. That works very nicely. I do have though a follow-up question.

Is it possible to use multiple pattern in this case?

I have tried this code, but I get NONE as a result:

>toMatch <-c("K", "R")
>hits <- matchPattern(pattern=paste(toMatch,collapse="|"), protein1)
>hits
  Views on a 119-letter AAString subject
subject: MKLSVNEAQLGFYLGSIDPRSSEDQPESLKTGQM...QTATKTKTLQGPAEKKPPSGSQAPRTKQRVTKWQ
views: NONE

I have also tried to convert the two letters into a AAStringSet object, the matchpattern() takes only XString object and not a set.

Is there a different way to do it?

thanks again

ADD REPLY • link 7.5 years ago Assa Yeroslaviz ★ 1.5k

0

Entering edit mode

Do you want to know where each individual pattern matches, or is there actual ambiguity? It looks like Biostrings does not have a lot of support for matching amino acid strings. For example, neither matchPDict() nor matchPattern(fixed=FALSE) support AAStringSet. I think the latter would be easy to support?

ADD REPLY • link 7.5 years ago Michael Lawrence ★ 11k

0

Entering edit mode

Some protease cleave the sequence after two different AA, in my case these two are "K" and "R". So I basically want to find all the positions where either K or R are and cut there.

Is there a way to also extract the sequences themselves when doing it (for now for one pattern)? I would like to have not only the IRanges object with the positions, but also the sub-sequences if possible.

ADD REPLY • link 7.5 years ago Assa Yeroslaviz ★ 1.5k

0

Entering edit mode

Getting the subsequences is just:

extractList(protein1, subseqs)

ADD REPLY • link 7.5 years ago Michael Lawrence ★ 11k