Question

Is it possible to do an alignment to a chosen reference sequence in DECIPHER?

0

Entering edit mode

reubenmcgregor88 • 0

@reubenmcgregor88-13722

Last seen 5.0 years ago

Background:

I have in a fasta file with reference sequences for strains of bacteria and I want to align sequences I have from another source, in a data frame, to their corresponding reference sequences in the fasta. I know in theory I could put them all in the same file (fasta or data frame) but wondered if there was a more elegant way to do it in order to keep the reference sequences separate?

Specific questions in bullet points:

My fasta with reference sequences was read in using

> allemmDNA <- readDNAStringSet("trimmed.fasta")

and looks like this (with a couple of hundred more sequences):

>EMM1.0 emm1.0. (emm cluster A-C3) usually T1, opacity factor negative.
GGTTTTGCGAATCAAACAGAGGTTAAGGCTAACGGTGATGGTAATCCTAGGGAAGTTATA
GAAGATCTTGCAGCAAACAATCCCGCAATACAAAATATACGTTTACGTCACGAAAACAAG
GACTTAAAAGCGAGATTAGAGAATGCAATGGAAGTTGCAGGAAGAGATTTTAAGAGAGCT

Just a small question here, is it possible to import these with just the ">EMM1.0" as the name rather than the whole line?

I then translate this:

> allemmAA <- translate(allemmDNA)

My sequences are in a data frame like this (in the "50aa_HVR_peptide", column) again with hundreds of sequences.

id emm CLINICAL_2 `50aa_HVR_peptide`
1 GAS05134 emm1.0 RF NGDGNPREVIEDLAANNPAIQNIRLRHENKDLKARLENAMEVAGRDFKRA
2 GAS12252 emm1.0 iGAS NGDGNPREVIEDLAANNPAIQNIRLRHENKDLKARLENAMEVAGRDFKRA
3 GAS12271 emm1.0 iGAS NGDGNPREVIEDLAANNPAIQNIRLRHENKDLKARLENAMEVAGRDFKRA
4 GAS06220 emm1.0 iGAS NGDGNPREVIEDLAANNPAIQNIRLRHENKDLKARLENAMEVAGRDFKRA
5 GAS12233 emm1.0 iGAS NGDGNPREVIEDLAANNPAIQNIRLRHENKDLKARLENAMEVAGRDFKRA
6 GAS12243 emm1.0 Pharyngitis NGDGNPREVIEDLAANNPAIQNIRLRHENKDLKARLENAMEVAGRDFKRA

So I can tell DECIPHER where the sequences and names are found:

seq <- NZ_50_fullseq$`50aa_HVR_peptide`
names(seq) <- NZ_50_fullseq$id

As you can see there are several sequences for "emm1.0" strains in this data.frame, which should be identical to the translation of the reference sequence for >"EMM1.0" above.

Is it possible to do an alignment with all "emm1.0" strains in the data.frame, using the "EMM1.0" from the reference fasta file as the reference sequence to align to?

Thank you so much for any suggestions to make my life easier.

R decipher biostrings • 2.5k views

ADD COMMENT • link updated 8.1 years ago by Erik Wright ▴ 150 • written 8.1 years ago by reubenmcgregor88 • 0

score 2 · Accepted Answer · 2018-01-06

2

Entering edit mode

Erik Wright ▴ 150

@erik-wright-14386

Last seen 6 weeks ago

United States

Regarding your first bulleted question, you can alter the names of your XStringSet after import:

names(allemmAA) <- gsub("(.*?) .*", "\\1", names(allemmAA))

Regarding your second bulleted question, you can make a new XStringSet with the sequences that you want to align, and provide that to AlignSeqs().

w <- which(NZ_50_fullseq$emm=="emm1.0")

combAA <- c(allemmAA[1], seq[w])

aligned_aa <- AlignSeqs(combAA)

Then you could put this into a loop to go through all of the names:

u_ns <- unique(names(allemmAA))

aligned_list <- setNames(vector("list", length(u_ns)), u_ns)

for (i in seq_along(u_ns)) {

    w1 <- which(names(allemmAA)==u_ns[i])

    w2 <- which(NZ_50_fullseq$emm==tolower(u_ns[i]))

    combAA <- c(allemmAA[w1], seq[w2])

    aligned_list[[i]] <- AlignSeqs(combAA)

}

I hope that helps.

Erik

ADD COMMENT • link 8.1 years ago Erik Wright ▴ 150

0

Entering edit mode

Thank you Erik, as always this is a very comprehensive and useful answer, and I can see I will have to get familiar with this kind of looping in order to cut down on time.

Whilst I can see in theory what you are suggesting whenever I run the code:

combAA <- c(allemmAA[1], seq[w])

I get the error:

Error in seq[w] : object of type 'closure' is not subsettable

I have looked up the possible reasons but can not identify the cause?

ADD REPLY • link 8.0 years ago reubenmcgregor88 • 0

1

Entering edit mode

The error essentially says that you are trying to index a function. Probably your sequences are not in a variable called seq. Since seq() is a base function in R, it thinks you are trying to subset this function rather than an XStringSet. Note that I inherited the variable named seq from your question, where you state:

seq <- NZ_50_fullseq$`50aa_HVR_peptide`
names(seq) <- NZ_50_fullseq$id

ADD REPLY • link 8.0 years ago Erik Wright ▴ 150

0

Entering edit mode

Ah yes, I had a bit of code further down my R script re-assinging the "seq" variable. I re -ran as above and now get the error:

Error in unlist_list_of_XVectorList(class(x), args, use.names = FALSE, :
all elements in 'x' must be AAStringSet objects (or NULLs)

As mentioned above I got the AAStringSet from the translate function in DECIPHER so am pretty sure they are all AAStringSet?

Thanks again and sorry for the indecent questioning

ADD REPLY • link 8.0 years ago reubenmcgregor88 • 0

1

Entering edit mode

This error means that you are trying to combine disparate classes. Probably because the variable seq needs to be converted to an AAStringSet:

seq <- AAStringSet(seq)

ADD REPLY • link 8.0 years ago Erik Wright ▴ 150

0

Entering edit mode

Yes this has now worked,

I now have a slight issue that when I run the loop I get the error:

Error in AlignSeqs(combAA) : At least two sequences are required.

There are some sequences in the "allemmAA" that are nor present in my list of sequences, and I am guessing the error is coming from there as it will br trying to align the reference sequence with one not present in my data.frame of sequences and thus only has one sequence. Is there a away to bypass this and ask DECIPHER to skip these instances?

Thanks

ADD REPLY • link 8.0 years ago reubenmcgregor88 • 0

1

Entering edit mode

Of course, simply add a conditional above the AlignSeqs() call:

if (length(combAA) > 1)
    aligned_list[[i]] <- AlignSeqs(combAA)

ADD REPLY • link 8.0 years ago Erik Wright ▴ 150