Replace a specific nucleotide sequence, at a specific position, with another in Biostrings
1
Entering edit mode
Benjamin • 20
@Benjamin-24571
Last seen 4 weeks ago

I have a dataframe with a column 'sequences'. It contains stretches of nucleotides in this column. I would like to create a new column which essentially introduces a mutation in the sequence, for example replacing "ACA" with "ATA". Importantly, I would like to do this at a specific position, for example, position 2. Therefore, the sequence: "ACAACA" would become "ATAACA". If the sequence did not contain the pattern "ACA" I would like the sequence to remain unchanged.

I can see the replace replaceAt() you can specify x (in this case a DNAstringSet object which is the 'sequences' column) and you can set the position (IRanges(1, 3) would be the range for position 1 to 3 in the sequence) and the replacement ("ATA") but it will replace any sequence at this position. Any idea of how to make this specific to a sequence? I think I could possibly write an ifelse statement with a grep/regex in it to achieve this, but eventually, I would like to build a loop and replace the static "ACA" and "ATA" with either vectors or dataframe columns with lists of mutations to iterate through. Any help would be very welcome!!

#example data for convenience
Sequences <- c("ATACCCACG", "AAAGGGAAT", "GCCGATGCG", "ACCAAATCC")

# Almost works
df$Mut <- replaceAt(DNAStringSet(df$Sequences), IRanges(1, 3), "ATA")

#sessionInfo()
R version 4.0.3 (2020-10-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19041)




DNASeqData Biostrings • 115 views
1
Entering edit mode
@konstantinos-yeles-8961
Last seen 4 days ago
University of Salerno, Salerno, Italy

Hello Benjamin for your example, I can suggest you this code:

Read <- c("1","2","3","4")
Sequences <- c("ACACCCACG", "AAAACAAAT", "GCCGATACA", "AACAAATCC","ATACCCACG")# I changed here a bit the sequences in order to have the example with ACA->ATA
DS_df <- DNAStringSet(df$Sequences) ## Replace bases 1:3 "ACA" with "ATA: at <- subseq(DS_df, start = 1, width = 3) midx1_3 <- vmatchPattern("ACA", at, fixed=FALSE) DS_df2 <- replaceAt(DS_df, midx1_3, value="ATA") df$Mut <- as.data.frame(DS_df2,)
df
1    1 ACACCCACG ATACCCACG
2    2 AAAACAAAT AAAACAAAT
3    3 GCCGATACA GCCGATACA
4    4 AACAAATCC AACAAATCC


As you can see it changed that first sequence.
I just followed the instructions in ?replaceAt of
(C) ADVANCED EXAMPLES In case you need to perform this with different kind of "mutations" and lists,
I suggest you read about ?matchPattern as there are a lot of different function and may they help you.

If you can't find a solution there is always the way of dplyr:mutate with str_replace`,
it depends on how you want to make your code to perform the iterations but in the end you should try what assists you better.

1
Entering edit mode

Thank you this achieves exactly what I needed. I really appreciate your time!