Question

Dereplicate DNA sequences while taking into account the corresponding taxonomy

0

Entering edit mode

Gilles • 0

@84bd9038

Last seen 2.4 years ago

Belgium

I have a reference database of barcode sequences with a species name/taxonomy associated with each sequence.
I would like to remove any duplicated sequence but only when the duplicates are associated with the same taxon.

The minimal reproducible example below shows how to do that with a very naïve approach (simply pasting the sequence and species names then looking for duplicates...).

My question is : are there better ways to do that with any bioconductor or R packages ?
I know that I could do that for example with qiime rescript dereplicate --p-mode 'unique' but I'm looking for an R solution

# Create a temporary file for a toy reference database
temp_sequences <- tempfile("temp_sequences", fileext = ".fasta")

# Add a few sequences to this temporary file
writeLines(
">SpeciesA
TTTTTTTTTTAAAAAAAAAA
>SpeciesA
TTTTTTTTTTAAAAAAAAAA
>SpeciesB
TTTTTTTTTTAAAAAAAAAA
>SpeciesC
GGGGGGGGGGCCCCCCCCCC",
temp_sequences)

# read the sequences
my_fasta <- Biostrings::readDNAStringSet(temp_sequences)
my_fasta

# dereplication without taking into account the taxon 
# --> not what I want
my_fasta[!duplicated(my_fasta)]

# dereplication taking into account the taxon 
# --> this is what I want but probably sub-optimal solution
my_fasta[!duplicated(paste(as.vector(my_fasta), names(my_fasta)))]

The same code chunck run with output :

> # Add a few sequences to this temporary file
> writeLines(
+ ">SpeciesA
+ TTTTTTTTTTAAAAAAAAAA
+ >SpeciesA
+ TTTTTTTTTTAAAAAAAAAA
+ >SpeciesB
+ TTTTTTTTTTAAAAAAAAAA
+ >SpeciesC
+ GGGGGGGGGGCCCCCCCCCC",
+ temp_sequences)
> 
> # read the sequences
> my_fasta <- Biostrings::readDNAStringSet(temp_sequences)
> my_fasta
DNAStringSet object of length 4:
    width seq                                                                           names               
[1]    20 TTTTTTTTTTAAAAAAAAAA                                                          SpeciesA
[2]    20 TTTTTTTTTTAAAAAAAAAA                                                          SpeciesA
[3]    20 TTTTTTTTTTAAAAAAAAAA                                                          SpeciesB
[4]    20 GGGGGGGGGGCCCCCCCCCC                                                          SpeciesC
> 
> # dereplication without taking into account the taxon 
> # --> not what I want
> my_fasta[!duplicated(my_fasta)]
DNAStringSet object of length 2:
    width seq                                                                           names               
[1]    20 TTTTTTTTTTAAAAAAAAAA                                                          SpeciesA
[2]    20 GGGGGGGGGGCCCCCCCCCC                                                          SpeciesC
> 
> # dereplication taking into account the taxon 
> # --> this is what I want but probably sub-optimal solution
> my_fasta[!duplicated(paste(as.vector(my_fasta), names(my_fasta)))]
DNAStringSet object of length 3:
    width seq                                                                           names               
[1]    20 TTTTTTTTTTAAAAAAAAAA                                                          SpeciesA
[2]    20 TTTTTTTTTTAAAAAAAAAA                                                          SpeciesB
[3]    20 GGGGGGGGGGCCCCCCCCCC                                                          SpeciesC

Biostrings • 704 views

ADD COMMENT • link 2.4 years ago Gilles • 0