I have in a fasta file with reference sequences for strains of bacteria and I want to align sequences I have from another source, in a data frame, to their corresponding reference sequences in the fasta. I know in theory I could put them all in the same file (fasta or data frame) but wondered if there was a more elegant way to do it in order to keep the reference sequences separate?
Specific questions in bullet points:
My fasta with reference sequences was read in using
> allemmDNA <- readDNAStringSet("trimmed.fasta")
and looks like this (with a couple of hundred more sequences):
>EMM1.0 emm1.0. (emm cluster A-C3) usually T1, opacity factor negative.
- Just a small question here, is it possible to import these with just the ">EMM1.0" as the name rather than the whole line?
I then translate this:
> allemmAA <- translate(allemmDNA)
My sequences are in a data frame like this (in the "50aa_HVR_peptide", column) again with hundreds of sequences.
id emm CLINICAL_2 `50aa_HVR_peptide`
1 GAS05134 emm1.0 RF NGDGNPREVIEDLAANNPAIQNIRLRHENKDLKARLENAMEVAGRDFKRA
2 GAS12252 emm1.0 iGAS NGDGNPREVIEDLAANNPAIQNIRLRHENKDLKARLENAMEVAGRDFKRA
3 GAS12271 emm1.0 iGAS NGDGNPREVIEDLAANNPAIQNIRLRHENKDLKARLENAMEVAGRDFKRA
4 GAS06220 emm1.0 iGAS NGDGNPREVIEDLAANNPAIQNIRLRHENKDLKARLENAMEVAGRDFKRA
5 GAS12233 emm1.0 iGAS NGDGNPREVIEDLAANNPAIQNIRLRHENKDLKARLENAMEVAGRDFKRA
6 GAS12243 emm1.0 Pharyngitis NGDGNPREVIEDLAANNPAIQNIRLRHENKDLKARLENAMEVAGRDFKRA
So I can tell DECIPHER where the sequences and names are found:
seq <- NZ_50_fullseq$`50aa_HVR_peptide`
names(seq) <- NZ_50_fullseq$id
As you can see there are several sequences for "emm1.0" strains in this data.frame, which should be identical to the translation of the reference sequence for >"EMM1.0" above.
- Is it possible to do an alignment with all "emm1.0" strains in the data.frame, using the "EMM1.0" from the reference fasta file as the reference sequence to align to?
Thank you so much for any suggestions to make my life easier.