I would like to match a string(1) in another string(2) and based on the sequence information contained in string(1), extract the position information based on string(2). I have a dataframe containing peptide (amino acid) sequences with information of additional chemical modification. These occur at M or C positions. I would like to be able to match these strings to the file of origin that has all of the sequences of proteins that were matched against using spectral match algorithms and output the amino acid and the position from that protein.
I've used the seqinr package to read in a .fasta file which contains 20320 entries and the entries look like this:
$`sp|Q9Y478|AAKB1_HUMAN` [1]"MGNTSSERAALERHGGHKTPRRDSSGGTKDGDRPKILMDSPEDADLFHSEEIKAPEKEEFLAWQHDLEVNDKAPAQARPTVFRWTGGGKEVYLSGSFNNWSKLPLTRSHNNFVAILDLPEGEHQYKFFVDGQWTHDPSEPIVTSQLGTVNNIIQVKKTDFEVFDALMVDSQKCSDVSELSSSPPGPYHQEPYVCKPEERFRAPPILPPHLLQVILNKDTGISCDPALLPEPNHVMLNHLYALSIKDGVMVLSATHRYKKKYVTTLLYKPI"
I have a separate dataframe containing a list of peptides, example:
ptm_probability ptm_peptide protein_ID protein_description 1 C(1.000)SDFTEEIC(1.000)R K.C[478.99]SDFTEEIC[478.99]R.R sp|P50213|IDH3A_HUMAN Isocitrate dehydrogenase [NAD] subunit alpha, mitochondrial OS=Homo sapiens GN=IDH3A PE=1 SV=1
The amino acid sequence in ptm_probability shows the score and likelihood that the modification is there. The sequence in ptm_peptide has the amino acids before and after the sequence denoted by "." while the modification is contained within the brackets [478.99] The modification can contain different numbers.
Ideally I would like the output to contain a column for the list of peptides which shows the amino acid one letter code followed by the numerical position within the protein:
position C32 C16, C20
Which packages/functions would enable me to do this? Can I try to match the sequence as is and give a command to ignore the modification [478.99] to fit the format in which the fasta file currently is? Or should stripping the mods and then coming up a way to calculate the relative position based on the start/end positions of the peptide? What is a fast way to do this If I have to match several hundreds/thousands of peptide sequences against a 20k list? Any suggestions would be greatly appreciated.