Is it possible to do an alignment to a chosen reference sequence in DECIPHER?
1
0
Entering edit mode
@reubenmcgregor88-13722
Last seen 18 months ago

Background:

I have in a fasta file with reference sequences for strains of bacteria and I want to align sequences I have from another source, in a data frame, to their corresponding reference sequences in the fasta. I know in theory I could put them all in the same file (fasta or data frame) but wondered if there was a more elegant way to do it in order to keep the reference sequences separate?

Specific questions in bullet points:

My fasta with reference sequences was read in using

> allemmDNA <- readDNAStringSet("trimmed.fasta")

and looks like this (with a couple of hundred more sequences):

>EMM1.0 emm1.0. (emm cluster A-C3) usually T1, opacity factor negative.
GGTTTTGCGAATCAAACAGAGGTTAAGGCTAACGGTGATGGTAATCCTAGGGAAGTTATA
GAAGATCTTGCAGCAAACAATCCCGCAATACAAAATATACGTTTACGTCACGAAAACAAG
GACTTAAAAGCGAGATTAGAGAATGCAATGGAAGTTGCAGGAAGAGATTTTAAGAGAGCT

• Just a small question here, is it possible to import these with just the ">EMM1.0" as the name rather than the whole line?

I then translate this:

> allemmAA <- translate(allemmDNA)

My sequences are in a data frame like this (in the "50aa_HVR_peptide", column) again with hundreds of sequences.

id    emm  CLINICAL_2                                 50aa_HVR_peptide
1 GAS05134 emm1.0          RF NGDGNPREVIEDLAANNPAIQNIRLRHENKDLKARLENAMEVAGRDFKRA
2 GAS12252 emm1.0        iGAS NGDGNPREVIEDLAANNPAIQNIRLRHENKDLKARLENAMEVAGRDFKRA
3 GAS12271 emm1.0        iGAS NGDGNPREVIEDLAANNPAIQNIRLRHENKDLKARLENAMEVAGRDFKRA
4 GAS06220 emm1.0        iGAS NGDGNPREVIEDLAANNPAIQNIRLRHENKDLKARLENAMEVAGRDFKRA
5 GAS12233 emm1.0        iGAS NGDGNPREVIEDLAANNPAIQNIRLRHENKDLKARLENAMEVAGRDFKRA
6 GAS12243 emm1.0 Pharyngitis NGDGNPREVIEDLAANNPAIQNIRLRHENKDLKARLENAMEVAGRDFKRA

So I can tell DECIPHER where the sequences and names are found:

seq <- NZ_50_fullseq$50aa_HVR_peptide names(seq) <- NZ_50_fullseq$id

As you can see there are several sequences for "emm1.0" strains in this data.frame, which should be identical to the translation of the reference sequence for >"EMM1.0" above.

• Is it possible to do an alignment with all "emm1.0" strains in the data.frame, using the "EMM1.0" from the reference fasta file as the reference sequence to align to?

Thank you so much for any suggestions to make my life easier.

R decipher biostrings • 868 views
2
Entering edit mode
Erik Wright ▴ 150
@erik-wright-14386
Last seen 11 months ago
United States

Regarding your first bulleted question, you can alter the names of your XStringSet after import:

names(allemmAA) <- gsub("(.*?) .*", "\\1", names(allemmAA))

Regarding your second bulleted question, you can make a new XStringSet with the sequences that you want to align, and provide that to AlignSeqs().

w <- which(NZ_50_fullseqemm=="emm1.0") combAA <- c(allemmAA[1], seq[w]) aligned_aa <- AlignSeqs(combAA) Then you could put this into a loop to go through all of the names: u_ns <- unique(names(allemmAA)) aligned_list <- setNames(vector("list", length(u_ns)), u_ns) for (i in seq_along(u_ns)) { w1 <- which(names(allemmAA)==u_ns[i]) w2 <- which(NZ_50_fullseqemm==tolower(u_ns[i]))

combAA <- c(allemmAA[w1], seq[w2])

aligned_list[[i]] <- AlignSeqs(combAA)

}

I hope that helps.

Erik

0
Entering edit mode

Thank you Erik, as always this is a very comprehensive and useful answer, and I can see I will have to get familiar with this kind of looping in order to cut down on time.

Whilst I can see in theory what you are suggesting whenever I run the code:

combAA <- c(allemmAA[1], seq[w])

I get the error:

Error in seq[w] : object of type 'closure' is not subsettable

I have looked up the possible reasons but can not identify the cause?

1
Entering edit mode

The error essentially says that you are trying to index a function.  Probably your sequences are not in a variable called seq.  Since seq() is a base function in R, it thinks you are trying to subset this function rather than an XStringSet.  Note that I inherited the variable named seq from your question, where you state:

seq <- NZ_50_fullseq$50aa_HVR_peptide names(seq) <- NZ_50_fullseq$id​
0
Entering edit mode

Ah yes, I had a bit of code further down my R script re-assinging the "seq" variable. I re -ran as above and now get the error:

Error in unlist_list_of_XVectorList(class(x), args, use.names = FALSE,  :
all elements in 'x' must be AAStringSet objects (or NULLs)

As mentioned above I got the AAStringSet from the translate function in DECIPHER so am pretty sure they are all AAStringSet?

Thanks again and sorry for the indecent questioning

1
Entering edit mode

This error means that you are trying to combine disparate classes.  Probably because the variable seq needs to be converted to an AAStringSet:

seq <- AAStringSet(seq)
0
Entering edit mode

Yes this has now worked,

I now have a slight issue that when I run the loop I get the error:

Error in AlignSeqs(combAA) : At least two sequences are required.

There are some sequences in the "allemmAA" that are nor present in my list of sequences, and I am guessing the error is coming from there as it will br trying to align the reference sequence with one not present in my data.frame of sequences and thus only has one sequence. Is there a away to bypass this and ask DECIPHER to skip these instances?

Thanks

1
Entering edit mode

Of course, simply add a conditional above the AlignSeqs() call:

if (length(combAA) > 1)
aligned_list[[i]] <- AlignSeqs(combAA)