Background:
I have in a fasta file with reference sequences for strains of bacteria and I want to align sequences I have from another source, in a data frame, to their corresponding reference sequences in the fasta. I know in theory I could put them all in the same file (fasta or data frame) but wondered if there was a more elegant way to do it in order to keep the reference sequences separate?
Specific questions in bullet points:
My fasta with reference sequences was read in using
> allemmDNA <- readDNAStringSet("trimmed.fasta")
and looks like this (with a couple of hundred more sequences):
>EMM1.0 emm1.0. (emm cluster A-C3) usually T1, opacity factor negative.
GGTTTTGCGAATCAAACAGAGGTTAAGGCTAACGGTGATGGTAATCCTAGGGAAGTTATA
GAAGATCTTGCAGCAAACAATCCCGCAATACAAAATATACGTTTACGTCACGAAAACAAG
GACTTAAAAGCGAGATTAGAGAATGCAATGGAAGTTGCAGGAAGAGATTTTAAGAGAGCT
- Just a small question here, is it possible to import these with just the ">EMM1.0" as the name rather than the whole line?
I then translate this:
> allemmAA <- translate(allemmDNA)
My sequences are in a data frame like this (in the "50aa_HVR_peptide", column) again with hundreds of sequences.
id emm CLINICAL_2 `50aa_HVR_peptide`
1 GAS05134 emm1.0 RF NGDGNPREVIEDLAANNPAIQNIRLRHENKDLKARLENAMEVAGRDFKRA
2 GAS12252 emm1.0 iGAS NGDGNPREVIEDLAANNPAIQNIRLRHENKDLKARLENAMEVAGRDFKRA
3 GAS12271 emm1.0 iGAS NGDGNPREVIEDLAANNPAIQNIRLRHENKDLKARLENAMEVAGRDFKRA
4 GAS06220 emm1.0 iGAS NGDGNPREVIEDLAANNPAIQNIRLRHENKDLKARLENAMEVAGRDFKRA
5 GAS12233 emm1.0 iGAS NGDGNPREVIEDLAANNPAIQNIRLRHENKDLKARLENAMEVAGRDFKRA
6 GAS12243 emm1.0 Pharyngitis NGDGNPREVIEDLAANNPAIQNIRLRHENKDLKARLENAMEVAGRDFKRA
So I can tell DECIPHER where the sequences and names are found:
seq <- NZ_50_fullseq$`50aa_HVR_peptide`
names(seq) <- NZ_50_fullseq$id
As you can see there are several sequences for "emm1.0" strains in this data.frame, which should be identical to the translation of the reference sequence for >"EMM1.0" above.
- Is it possible to do an alignment with all "emm1.0" strains in the data.frame, using the "EMM1.0" from the reference fasta file as the reference sequence to align to?
Thank you so much for any suggestions to make my life easier.
Thank you Erik, as always this is a very comprehensive and useful answer, and I can see I will have to get familiar with this kind of looping in order to cut down on time.
Whilst I can see in theory what you are suggesting whenever I run the code:
combAA <- c(allemmAA[1], seq[w])
I get the error:
Error in seq[w] : object of type 'closure' is not subsettable
I have looked up the possible reasons but can not identify the cause?
The error essentially says that you are trying to index a function. Probably your sequences are not in a variable called seq. Since seq() is a base function in R, it thinks you are trying to subset this function rather than an XStringSet. Note that I inherited the variable named seq from your question, where you state:
Ah yes, I had a bit of code further down my R script re-assinging the "seq" variable. I re -ran as above and now get the error:
Error in unlist_list_of_XVectorList(class(x), args, use.names = FALSE, :
all elements in 'x' must be AAStringSet objects (or NULLs)
As mentioned above I got the AAStringSet from the translate function in DECIPHER so am pretty sure they are all AAStringSet?
Thanks again and sorry for the indecent questioning
This error means that you are trying to combine disparate classes. Probably because the variable seq needs to be converted to an AAStringSet:
Yes this has now worked,
I now have a slight issue that when I run the loop I get the error:
Error in AlignSeqs(combAA) : At least two sequences are required.
There are some sequences in the "allemmAA" that are nor present in my list of sequences, and I am guessing the error is coming from there as it will br trying to align the reference sequence with one not present in my data.frame of sequences and thus only has one sequence. Is there a away to bypass this and ask DECIPHER to skip these instances?
Thanks
Of course, simply add a conditional above the AlignSeqs() call:
if (length(combAA) > 1) aligned_list[[i]] <- AlignSeqs(combAA)