edgeR/Deseq2 for differential OTU abundance in a mock 16s community?
1
0
Entering edit mode
Nick N ▴ 60
@nick-n-6370
Last seen 8.6 years ago
United Kingdom

We have a bacterial mock community consisting of 22 bacterial taxa. The mock community exists in 2 basic versions - one in which each taxon is (supposed to be) equally abundant, and one in which one of the species has inflated abundance (at several different levels). 

We would like to use the mock community datasets to test our wet lab and bioinformatics pipeline. Our idea was to construct the OTU abundance table (this is, IMHO, the equivalent of the gene counts in RNA-Seq) and to use edgeR to perform an exact test or to do the equivalent analysis in Deseq2. When the comparison is made between samples which contain equally distributed taxons we expect that no taxon would be flagged as being differentially expressed. In this way we hope to verify the precision of our pipeline. In a next series of tests we plan to compare the equally distributed samples to the ones containing inflated abundance for one of the taxons. In this way we hope to establish the sensitivity threshold for our pipeline. 

Question 1: Does such an experimental setup make sense methodologically? Please feel free to offer suggestions.

Question 2 (related): Are Deseq2/edgeR suitable to test low diversity samples?

I was told by someone with a statistical background that neither edgeR nor Deseq2 are suitable to perform the differential abundance test in our experimental design. The explanation was something along the lines that: "Neither Deseq2, nor edgeR are appropriate for extremely low diversity microbial communities like the above mock communities because they require at least a conditional independence of each feature. This is not the case where you have just 20 or so different features (OTUs, genes, etc.). Changes in the true proportion of any of these features 'causes' changes in the proportion of the others, even if there is no biologically meaningful change."

This comment leaves us in a methodological limbo. Shall we go for massive expansion of our mock community in order to simulate "conditional independence"? And, if so, how many shall we include - 100, 500, 1000?

Isn't there still a chance to make meaningful inference based on the mock community which we already have?

edger deseq2 • 2.5k views
ADD COMMENT
2
Entering edit mode
Aaron Lun ★ 28k
@alun
Last seen 2 minutes ago
The city by the bay

The proportion issue is not, in and of itself, a problem; TMM normalization in edgeR (and its equivalent in DESeq2) will remove the composition bias that is introduced by differential abundances of a subset of features. The real problem is that these normalization strategies assume that most features are not differentially abundant. Is this true for your data? Maybe, if there's only one feature (i.e., species) in your mock community that is differentially abundant.

The next question is whether the remaining features have counts that are precise enough to estimate the normalization factor. For RNA-seq data, low-precision observations aren't too problematic, because we have lots of genes and we make up for the lack of quality with quantity. That isn't an option when you've only got 21 (excluding the known differential feature) species to play with. If you make a MA plot between the two versions of your community, and find that the M-values for the 21 "non-differential" species are all over the place, then you might have a problem.

You could explore other options like spike-ins for normalization - but, if I understand correctly, your community is artificially constructed, so your 21 non-differential species are effectively acting as spike-ins already.

ADD COMMENT
0
Entering edit mode

Yes, the mock community is a sort of spike-in. We were thinking of adding it to each run but we don't know what probabilistic framework to incorporate it. If I remember correctly, there were control probes on the microarrays which could be used for normalisation, We were thinking of using the mock community in the same way.

Back to the lack of conditional independence - the thing is the same problem exists, in principle, with RNA-Seq, too. But it is moderated by the fact that we have tens of thousands of genes. Here we have just 22 "genes". What I would like to know is at what level we shall consider the "non-independence" issue to be a non-issue - do we need a mock community of 100, 500 or 1000? What level is sufficient?

ADD REPLY
0
Entering edit mode

Well, if the variability of your counts is low enough, then ~20 species may well be sufficient. If your counts are more variable, then you'll need more features to estimate the normalization factor precisely. It all depends on your data.

ADD REPLY
0
Entering edit mode

The problem is that there is a technical variability between the mock replicates.

ps. apologies for posting multiple times the same comment - somehow the submit button did not seem to respond.

ADD REPLY
0
Entering edit mode

Well, there's always going to be variability. The question is whether that variability is too large for effective normalization. If we only had to consider technical variability (i.e., Poisson variance from repeated sequencing runs), there wouldn't be a problem because that's quite small. However, because you're comparing between different communities, you need to be more concerned with the biological variability. This could be quite large, depending on how reproducible your procedures are for constructing the community and its corresponding sequencing library.

Look, long story short; just do it and see whether the 21 non-differential species have consistent M-values between your two communities. If they do, then that's good enough. If not, then you'll have to reconsider the community's structure.

ADD REPLY

Login before adding your answer.

Traffic: 743 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6