I am interested in quantifying the microbial content of an RNAseq experiment in a way that allows for statistically comparing abundances of species between samples. There are a few tools that have been built for this that I am using, but I was curious to hear if anyone has thoughts on using pseudo-alignment (Salmon) and DESeq2 for this. The workflow I've tried is as follows:
1) Map reads to the model organism that was used in the experiment and extract all unmapped reads 2) De novo assemble unmapped reads (with Trinity) 3) Build index for Salmon with contigs assembled from unmapped reads 4) Estimate counts with Salmon 5) Use blastn to identify prospective species ids for each contig. Then make a contig-to-species map to use with tximport and DESeq2 such that counts from each contig are combined at the species level.
So far the results from this approach agree with other methods I have tried, but I'm concerned that I might be grossly abusing the assumptions of the DESeq2 models. For example, is treating all transcripts (contigs) from a species as a single gene problematic? Or is this perhaps even a feature that minimizes the potential impact of zero-inflated distributions arising from low sequencing depth of microbial samples mixed in with a much deeper sampling of host sequence, and also a more conservative measurement of dispersion?