I'm analysing a 16S microbial community dataset, and am using DESeq2 to test for differential abundances. When I do this, I supply raw count information to DESeq() as per the vignette rationale that the model fitting implicitly assumes raw count data. If I want to later try e.g. ordination of my samples, I might consider using rLog transformation or VST to standardise my data first.
The PICRUSt package provides inferred gene content for microbial communities, by referencing 16S taxon abundances against known (or extrapolated) genome content, and providing a table of likely microbial gene abundances for your 16S dataset. The accuracy of this prediction is reflected by a Nearest Sequenced Taxon Index (NSTI), with scores below 0.05 being 'good' and above 0.15 being 'undesirable'.
I would like to use DESeq2 to test for differential abundance of PICRUSt-inferred genes/gene pathways, but:
- DESeq2 was intended for RNA expression data, although it is often extended to 16S analysis - what are likely conditions under which DESeq2's suitability for a data type becomes questionable? How broadly suitable is it to large, sparse count datasets?
- DESeq2 takes raw counts - this conflicts with PICRUSt, where the original 16S dataset is normalised by copy number, before the new predicted gene set is calculated. Are there any thoughts on how this should be dealt with?
A DESeq2 16S copy number correction elsewhere, but I'm not sure that it addresses this issue in the same way.