Dereplicate DNA sequences while taking into account the corresponding taxonomy
0
0
Entering edit mode
Gilles • 0
@84bd9038
Last seen 21 months ago
Belgium

I have a reference database of barcode sequences with a species name/taxonomy associated with each sequence.
I would like to remove any duplicated sequence but only when the duplicates are associated with the same taxon.

The minimal reproducible example below shows how to do that with a very naïve approach (simply pasting the sequence and species names then looking for duplicates...).

My question is : are there better ways to do that with any bioconductor or R packages ?
I know that I could do that for example with qiime rescript dereplicate --p-mode 'unique' but I'm looking for an R solution

# Create a temporary file for a toy reference database
temp_sequences <- tempfile("temp_sequences", fileext = ".fasta")

# Add a few sequences to this temporary file
writeLines(
">SpeciesA
TTTTTTTTTTAAAAAAAAAA
>SpeciesA
TTTTTTTTTTAAAAAAAAAA
>SpeciesB
TTTTTTTTTTAAAAAAAAAA
>SpeciesC
GGGGGGGGGGCCCCCCCCCC",
temp_sequences)

# read the sequences
my_fasta <- Biostrings::readDNAStringSet(temp_sequences)
my_fasta

# dereplication without taking into account the taxon 
# --> not what I want
my_fasta[!duplicated(my_fasta)]

# dereplication taking into account the taxon 
# --> this is what I want but probably sub-optimal solution
my_fasta[!duplicated(paste(as.vector(my_fasta), names(my_fasta)))]

The same code chunck run with output :

> # Add a few sequences to this temporary file
> writeLines(
+ ">SpeciesA
+ TTTTTTTTTTAAAAAAAAAA
+ >SpeciesA
+ TTTTTTTTTTAAAAAAAAAA
+ >SpeciesB
+ TTTTTTTTTTAAAAAAAAAA
+ >SpeciesC
+ GGGGGGGGGGCCCCCCCCCC",
+ temp_sequences)
> 
> # read the sequences
> my_fasta <- Biostrings::readDNAStringSet(temp_sequences)
> my_fasta
DNAStringSet object of length 4:
    width seq                                                                           names               
[1]    20 TTTTTTTTTTAAAAAAAAAA                                                          SpeciesA
[2]    20 TTTTTTTTTTAAAAAAAAAA                                                          SpeciesA
[3]    20 TTTTTTTTTTAAAAAAAAAA                                                          SpeciesB
[4]    20 GGGGGGGGGGCCCCCCCCCC                                                          SpeciesC
> 
> # dereplication without taking into account the taxon 
> # --> not what I want
> my_fasta[!duplicated(my_fasta)]
DNAStringSet object of length 2:
    width seq                                                                           names               
[1]    20 TTTTTTTTTTAAAAAAAAAA                                                          SpeciesA
[2]    20 GGGGGGGGGGCCCCCCCCCC                                                          SpeciesC
> 
> # dereplication taking into account the taxon 
> # --> this is what I want but probably sub-optimal solution
> my_fasta[!duplicated(paste(as.vector(my_fasta), names(my_fasta)))]
DNAStringSet object of length 3:
    width seq                                                                           names               
[1]    20 TTTTTTTTTTAAAAAAAAAA                                                          SpeciesA
[2]    20 TTTTTTTTTTAAAAAAAAAA                                                          SpeciesB
[3]    20 GGGGGGGGGGCCCCCCCCCC                                                          SpeciesC
Biostrings • 542 views
ADD COMMENT

Login before adding your answer.

Traffic: 971 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6