I am interested in quantifying the microbial content of an RNAseq experiment in a way that allows for statistically comparing abundances of species between samples. There are a few tools that have been built for this that I am using, but I was curious to hear if anyone has thoughts on using pseudo-alignment (Salmon) and DESeq2 for this. The workflow I've tried is as follows:
1) Map reads to the model organism that was used in the experiment and extract all unmapped reads 2) De novo assemble unmapped reads (with Trinity) 3) Build index for Salmon with contigs assembled from unmapped reads 4) Estimate counts with Salmon 5) Use blastn to identify prospective species ids for each contig. Then make a contig-to-species map to use with tximport and DESeq2 such that counts from each contig are combined at the species level.
So far the results from this approach agree with other methods I have tried, but I'm concerned that I might be grossly abusing the assumptions of the DESeq2 models. For example, is treating all transcripts (contigs) from a species as a single gene problematic? Or is this perhaps even a feature that minimizes the potential impact of zero-inflated distributions arising from low sequencing depth of microbial samples mixed in with a much deeper sampling of host sequence, and also a more conservative measurement of dispersion?
Thank you for the input, Michael. Here are some rows of the counts matrix:
There are ~6,000 dimensions to the matrix (~900 species and 6 samples). Most of the counts correspond to zebrafish (the host species in these experiments), and the resulting baseMean for this species is much larger than the rest (~12 million compared to an average of 600 for all others with > 10, or a median of ~11).
So these can be rounded and provided to DESeq2. The caveat is that you will be finding differences after in silico normalization of the counts. The best we can do is attempt to minimize the differences, so any systematic differences are removed because they are indistinguishable from library size.