I am trying to get a taxonomic assignment for Amplicon sequences of a COI barcode. (Insect specific primers, but quite degenerate, BF3, BR2, Elbrecht et al, 2019).
As there are not really any well curated reference databases around as for COI I am trying to adapt existing databases for use with
DECIPHER:I:dTaxa and or
I am trying two approaches:
1) I downloaded all BINs for Arthropods from BOLD 2) I downloaded this database that mines COI genes from GenBank.
Both database have a lot of sequences (~4M for BOLD, 1.2M for GenBankDB) but much fewer unique species (defined as BIN in BOLD and same taxonomy in GenBankDB). (400 K for BOLD, 110 K for GenBankDB)
My plan was thus to
1) Align the seqs within each cluster / BIN / "species" (or a random subsample of 200 seqs if there are more)
2) Find a majority consensus sequence (
DECIPHER::ConsensusSequence( , threshold = 0.5) )
3) Assign consensus taxonomy to consensus seq
use this as reference file for taxonomic assignment.
While trying this some question arose:
1) Is that approach useful / legitimate?
2) Even the majority consensus seq can have ambiguous bases. Can a reference database have ambiguous bases for
DECIPHER::LearnTaxa, is there any information about how it scales (timewise) with the size of the database?
Thank you for your help!