Replace a specific nucleotide sequence, at a specific position, with another in Biostrings
1
1
Entering edit mode
Benjamin ▴ 20
@Benjamin-24571
Last seen 3.2 years ago

I have a dataframe with a column 'sequences'. It contains stretches of nucleotides in this column. I would like to create a new column which essentially introduces a mutation in the sequence, for example replacing "ACA" with "ATA". Importantly, I would like to do this at a specific position, for example, position 2. Therefore, the sequence: "ACAACA" would become "ATAACA". If the sequence did not contain the pattern "ACA" I would like the sequence to remain unchanged.

I can see the replace replaceAt() you can specify x (in this case a DNAstringSet object which is the 'sequences' column) and you can set the position (IRanges(1, 3) would be the range for position 1 to 3 in the sequence) and the replacement ("ATA") but it will replace any sequence at this position. Any idea of how to make this specific to a sequence? I think I could possibly write an ifelse statement with a grep/regex in it to achieve this, but eventually, I would like to build a loop and replace the static "ACA" and "ATA" with either vectors or dataframe columns with lists of mutations to iterate through. Any help would be very welcome!!

#example data for convenience 
Read <- c("1","2","3","4")
Sequences <- c("ATACCCACG", "AAAGGGAAT", "GCCGATGCG", "ACCAAATCC")
df <- data.frame(Read,Sequences)

# Almost works
df$Mut <- replaceAt(DNAStringSet(df$Sequences), IRanges(1, 3), "ATA")

#sessionInfo()
R version 4.0.3 (2020-10-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19041)

```

DNASeqData Biostrings • 1.7k views
ADD COMMENT
1
Entering edit mode
@konstantinos-yeles-8961
Last seen 4 months ago
Italy

Hello Benjamin for your example, I can suggest you this code:

Read <- c("1","2","3","4")
Sequences <- c("ACACCCACG", "AAAACAAAT", "GCCGATACA", "AACAAATCC","ATACCCACG")# I changed here a bit the sequences in order to have the example with ACA->ATA 
df <- data.frame(Read,Sequences)
DS_df <- DNAStringSet(df$Sequences)
## Replace bases 1:3 "ACA" with "ATA:
at <- subseq(DS_df, start = 1, width = 3)
midx1_3 <- vmatchPattern("ACA", at, fixed=FALSE)
DS_df2 <- replaceAt(DS_df, midx1_3, value="ATA")

df$Mut <- as.data.frame(DS_df2,)
df
Read Sequences         x
1    1 ACACCCACG ATACCCACG
2    2 AAAACAAAT AAAACAAAT
3    3 GCCGATACA GCCGATACA
4    4 AACAAATCC AACAAATCC

As you can see it changed that first sequence.
I just followed the instructions in ?replaceAt of
(C) ADVANCED EXAMPLES In case you need to perform this with different kind of "mutations" and lists,
I suggest you read about ?matchPattern as there are a lot of different function and may they help you.

If you can't find a solution there is always the way of dplyr:mutate with str_replace,
it depends on how you want to make your code to perform the iterations but in the end you should try what assists you better.

ADD COMMENT
1
Entering edit mode

Thank you this achieves exactly what I needed. I really appreciate your time!

ADD REPLY

Login before adding your answer.

Traffic: 632 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6