I have a reference database of barcode sequences with a species name/taxonomy associated with each sequence.
I would like to remove any duplicated sequence but only when the duplicates are associated with the same taxon.
The minimal reproducible example below shows how to do that with a very naïve approach (simply pasting the sequence and species names then looking for duplicates...).
My question is : are there better ways to do that with any bioconductor or R packages ?
I know that I could do that for example with qiime rescript dereplicate --p-mode 'unique'
but I'm looking for an R solution
# Create a temporary file for a toy reference database
temp_sequences <- tempfile("temp_sequences", fileext = ".fasta")
# Add a few sequences to this temporary file
writeLines(
">SpeciesA
TTTTTTTTTTAAAAAAAAAA
>SpeciesA
TTTTTTTTTTAAAAAAAAAA
>SpeciesB
TTTTTTTTTTAAAAAAAAAA
>SpeciesC
GGGGGGGGGGCCCCCCCCCC",
temp_sequences)
# read the sequences
my_fasta <- Biostrings::readDNAStringSet(temp_sequences)
my_fasta
# dereplication without taking into account the taxon
# --> not what I want
my_fasta[!duplicated(my_fasta)]
# dereplication taking into account the taxon
# --> this is what I want but probably sub-optimal solution
my_fasta[!duplicated(paste(as.vector(my_fasta), names(my_fasta)))]
The same code chunck run with output :
> # Add a few sequences to this temporary file
> writeLines(
+ ">SpeciesA
+ TTTTTTTTTTAAAAAAAAAA
+ >SpeciesA
+ TTTTTTTTTTAAAAAAAAAA
+ >SpeciesB
+ TTTTTTTTTTAAAAAAAAAA
+ >SpeciesC
+ GGGGGGGGGGCCCCCCCCCC",
+ temp_sequences)
>
> # read the sequences
> my_fasta <- Biostrings::readDNAStringSet(temp_sequences)
> my_fasta
DNAStringSet object of length 4:
width seq names
[1] 20 TTTTTTTTTTAAAAAAAAAA SpeciesA
[2] 20 TTTTTTTTTTAAAAAAAAAA SpeciesA
[3] 20 TTTTTTTTTTAAAAAAAAAA SpeciesB
[4] 20 GGGGGGGGGGCCCCCCCCCC SpeciesC
>
> # dereplication without taking into account the taxon
> # --> not what I want
> my_fasta[!duplicated(my_fasta)]
DNAStringSet object of length 2:
width seq names
[1] 20 TTTTTTTTTTAAAAAAAAAA SpeciesA
[2] 20 GGGGGGGGGGCCCCCCCCCC SpeciesC
>
> # dereplication taking into account the taxon
> # --> this is what I want but probably sub-optimal solution
> my_fasta[!duplicated(paste(as.vector(my_fasta), names(my_fasta)))]
DNAStringSet object of length 3:
width seq names
[1] 20 TTTTTTTTTTAAAAAAAAAA SpeciesA
[2] 20 TTTTTTTTTTAAAAAAAAAA SpeciesB
[3] 20 GGGGGGGGGGCCCCCCCCCC SpeciesC