Do DECIPHER:IdTaxa / dada2::assignTaxonomy work with ambigious reference training set?
3
0
Entering edit mode
@fabianroger-13931
Last seen 2.3 years ago

Hi,

I am trying to get a taxonomic assignment for Amplicon sequences of a COI barcode. (Insect specific primers, but quite degenerate, BF3, BR2, Elbrecht et al, 2019).

As there are not really any well curated reference databases around as for COI I am trying to adapt existing databases for use with DECIPHER:I:dTaxa and or dada2::assignTaxonomy.

I am trying two approaches:

Both database have a lot of sequences (~4M for BOLD, 1.2M for GenBankDB) but much fewer unique species (defined as BIN in BOLD and same taxonomy in GenBankDB). (400 K for BOLD, 110 K for GenBankDB)

My plan was thus to

1) Align the seqs within each cluster / BIN / "species" (or a random subsample of 200 seqs if there are more) 2) Find a majority consensus sequence ( DECIPHER::ConsensusSequence( , threshold = 0.5) ) 3) Assign consensus taxonomy to consensus seq

use this as reference file for taxonomic assignment.

While trying this some question arose:

1) Is that approach useful / legitimate?

2) Even the majority consensus seq can have ambiguous bases. Can a reference database have ambiguous bases for dada2::assignTaxonomy / DECIPHER::IdTaxa?

3) for DECIPHER::LearnTaxa, is there any information about how it scales (timewise) with the size of the database?

Fabian

0
Entering edit mode
@benjaminjcallahan-9771
Last seen 2.3 years ago

1) Is that approach useful / legitimate?

Yes. In fact, the naive Bayesian classifier algorithm that DADA2 implements in assignTaxonomy has largely been evaluated on reference databases that have been subsetted in a similar fashion, i.e. by "clustering" identical or highly similar sequences and choosing a representative from the cluster to be in the reference database. Raw databases with large numbers of identical sequences have the potential to induce inaccurate taxonomic assignment by overwhelming the bootstrap-based confidence evaluation step with sheer numeric replication.

That said, thoughtful consideration and perhaps evaluation of the details of your clustering method/thresholds would not be unwarranted.

2) Even the majority consensus seq can have ambiguous bases. Can a reference database have ambiguous bases for dada2::assignTaxonomy

Yes. The reference sequences are shredded into kmers as part of the assignTaxonomy method, and the kmers with ambiguous nucleotides are simply ignored.

0
Entering edit mode

Thank you! This is really helpful.

0
Entering edit mode
Erik Wright ▴ 150
@erik-wright-14386
Last seen 8 months ago
United States

1) Is that approach useful / legitimate?

It is legitimate to cluster sequences and select a representative. However, I suggest clustering sequences and selecting a subset of each cluster for building the reference database. There is no need to only input one consensus sequence, and this would likely work worse than having a limited number (10-100) of representatives of a group. Note that all k-mer based algorithms (RDP and IDTAXA included) ignore ambiguous k-mers.

2) Even the majority consensus seq can have ambiguous bases. Can a reference database have ambiguous bases for dada2::assignTaxonomy / DECIPHER::IdTaxa?

Yes, both programs allow ambiguous bases.

3) for DECIPHER::LearnTaxa, is there any information about how it scales (timewise) with the size of the database?

LearnTaxa() scales in time roughly with the size of the reference taxonomy (i.e., taxonomic tree). Note that you only need to run LearnTaxa() once per reference set, and the output can be reused with IdTaxa() for classification.

0
Entering edit mode

Thank you! This is really helpful.

0
Entering edit mode
@fabianroger08-11956
Last seen 11 months ago

Hi again,

Sorry for coming back to this 15 month later but I have a follow-up questions related to the question above. When I have a training set were some species have missing taxonomies at Species / Genus level, how should this be formatted?

Should it be NA or just missing (;genus;; - for missing species) or must all sequences have a name at all ranks?