Question

Match peptide within protein sequence fasta and extract position information based on protein sequence

0

Entering edit mode

rabalski • 0

@rabalski-17233

Last seen 5.6 years ago

I would like to match a string(1) in another string(2) and based on the sequence information contained in string(1), extract the position information based on string(2). I have a dataframe containing peptide (amino acid) sequences with information of additional chemical modification. These occur at M or C positions. I would like to be able to match these strings to the file of origin that has all of the sequences of proteins that were matched against using spectral match algorithms and output the amino acid and the position from that protein.

I've used the seqinr package to read in a .fasta file which contains 20320 entries and the entries look like this:

   $`sp|Q9Y478|AAKB1_HUMAN` [1]"MGNTSSERAALERHGGHKTPRRDSSGGTKDGDRPKILMDSPEDADLFHSEEIKAPEKEEFLAWQHDLEVNDKAPAQARPTVFRWTGGGKEVYLSGSFNNWSKLPLTRSHNNFVAILDLPEGEHQYKFFVDGQWTHDPSEPIVTSQLGTVNNIIQVKKTDFEVFDALMVDSQKCSDVSELSSSPPGPYHQEPYVCKPEERFRAPPILPPHLLQVILNKDTGISCDPALLPEPNHVMLNHLYALSIKDGVMVLSATHRYKKKYVTTLLYKPI"

I have a separate dataframe containing a list of peptides, example:

               ptm_probability                    ptm_peptide            protein_ID protein_description
    1 C(1.000)SDFTEEIC(1.000)R K.C[478.99]SDFTEEIC[478.99]R.R sp|P50213|IDH3A_HUMAN Isocitrate dehydrogenase [NAD] subunit alpha, mitochondrial OS=Homo sapiens GN=IDH3A PE=1 SV=1

The amino acid sequence in ptm_probability shows the score and likelihood that the modification is there. The sequence in ptm_peptide has the amino acids before and after the sequence denoted by "." while the modification is contained within the brackets [478.99] The modification can contain different numbers.

Ideally I would like the output to contain a column for the list of peptides which shows the amino acid one letter code followed by the numerical position within the protein:

   position
    C32
    C16, C20

Which packages/functions would enable me to do this? Can I try to match the sequence as is and give a command to ignore the modification [478.99] to fit the format in which the fasta file currently is? Or should stripping the mods and then coming up a way to calculate the relative position based on the start/end positions of the peptide? What is a fast way to do this If I have to match several hundreds/thousands of peptide sequences against a 20k list? Any suggestions would be greatly appreciated.

R seqinr biostrings • 1.1k views

ADD COMMENT • link 5.6 years ago rabalski • 0