Why does IDTaxa trainingSet for SILVA contain "_2" suffixes in the taxonomy names?
0
0
Entering edit mode
Korneel • 0
@0ef980c5
Last seen 17 days ago
Belgium

Hi,

I am using the IDTaxa function to classify some 16S sequences (ASVs that I got out of dada2). I am using the SILVA trainingSet that is available on the DECIPHER website ("SILVA SSU r138.2 (modified) (299 MB)")

In the result I see things like "Pseudomonas_2" and "Acinetobacter_2", but if I search in SILVA these "_2" suffixes are not there. I am wondering why these suffixes are there? Is this documented somewhere?

I am considering to trim them because they are causing me some trouble in my pipeline.

Thanks!

silva decipher DECIPHER • 441 views
ADD COMMENT
1
Entering edit mode

This has to do with the fact that some taxonomies reuse names at the same rank level. Common examples in the SILVA taxonomy are "Incertae Sedis" and "uncultured". If only considering a single rank level (e.g., genus), these names would incorrectly collapse to the same taxon when they belong to different taxonomic lineages. Such taxa are appended with a unique number to avoid this issue.

The latest SILVA (v138.2) taxonomy contains many similarly named taxa (and likely sequences) belonging to different taxonomic lineages. I am guessing these are due to major taxonomic reassignments that left behind some sequences with the previous taxonomic name. For example, "Proteobacteria" and "Pseudomonadota" both contain "Gammaproteobacteria".

Some redundancies could also be due to alternative spellings (e.g., "Cyanobacteriia" or "Halobacterota") that result in bifurcating the same taxonomic lineage. SILVA has a lot of these, unfortunately.

You can trim the appended numbers as you suggested, but you will need to be careful with collapsing taxa from distinct taxonomic lineages.

ADD REPLY
0
Entering edit mode

Thanks for the quick reply!

If I understand correctly, I think if I keep track of the full lineage, I shouldn't have any risk of collapsing distinct taxa. So I will go for trimming the numbers.

I did some more investigation and although Proteobacteria is still listed in the SILVA taxonomy, it's not actually used in any of the lineages of the fasta file headers. So it sort of feels to me like you are "contaminating" a lot of your database with these suffixes for no reason.

ADD REPLY
0
Entering edit mode

You are correct. Thank you for noticing some of this 'pollution' with "_2" suffixes was unnecessary.

I posted an updated SILVA classifier on the DECIPHER website (here). Note that some "_2" suffixes are still required for the reasons mentioned above.

I hope that helps. Please let me know how it goes.

ADD REPLY

Login before adding your answer.

Traffic: 1223 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6