Hi,
I am trying to get a taxonomic assignment for Amplicon sequences of a COI barcode. (Insect specific primers, but quite degenerate, BF3, BR2, Elbrecht et al, 2019).
As there are not really any well curated reference databases around as for COI I am trying to adapt existing databases for use with DECIPHER:I:dTaxa
and or dada2::assignTaxonomy
.
I am trying two approaches:
1) I downloaded all BINs for Arthropods from BOLD 2) I downloaded this database that mines COI genes from GenBank.
Both database have a lot of sequences (~4M for BOLD, 1.2M for GenBankDB) but much fewer unique species (defined as BIN in BOLD and same taxonomy in GenBankDB). (400 K for BOLD, 110 K for GenBankDB)
My plan was thus to
1) Align the seqs within each cluster / BIN / "species" (or a random subsample of 200 seqs if there are more)
2) Find a majority consensus sequence ( DECIPHER::ConsensusSequence( , threshold = 0.5)
)
3) Assign consensus taxonomy to consensus seq
use this as reference file for taxonomic assignment.
While trying this some question arose:
1) Is that approach useful / legitimate?
2) Even the majority consensus seq can have ambiguous bases. Can a reference database have ambiguous bases for dada2::assignTaxonomy
/ DECIPHER::IdTaxa
?
3) for DECIPHER::LearnTaxa
, is there any information about how it scales (timewise) with the size of the database?
Thank you for your help!
Fabian
Thank you! This is really helpful.