Question: Using Salmon and DESeq2 to compare metagenomic content between samples
0
9 weeks ago by
kballa0
kballa0 wrote:

I am interested in quantifying the microbial content of an RNAseq experiment in a way that allows for statistically comparing abundances of species between samples. There are a few tools that have been built for this that I am using, but I was curious to hear if anyone has thoughts on using pseudo-alignment (Salmon) and DESeq2 for this. The workflow I've tried is as follows:

1) Map reads to the model organism that was used in the experiment and extract all unmapped reads 2) De novo assemble unmapped reads (with Trinity) 3) Build index for Salmon with contigs assembled from unmapped reads 4) Estimate counts with Salmon 5) Use blastn to identify prospective species ids for each contig. Then make a contig-to-species map to use with tximport and DESeq2 such that counts from each contig are combined at the species level.

So far the results from this approach agree with other methods I have tried, but I'm concerned that I might be grossly abusing the assumptions of the DESeq2 models. For example, is treating all transcripts (contigs) from a species as a single gene problematic? Or is this perhaps even a feature that minimizes the potential impact of zero-inflated distributions arising from low sequencing depth of microbial samples mixed in with a much deeper sampling of host sequence, and also a more conservative measurement of dispersion?

deseq2 salmon • 119 views
modified 9 weeks ago by Michael Love22k • written 9 weeks ago by kballa0
Answer: Using Salmon and DESeq2 to compare metagenomic content between samples
1
9 weeks ago by
Michael Love22k
United States
Michael Love22k wrote:

I can't speak to the validity of the method for counting species, but if you have counts across samples and want to find differences, then it sounds like the DESeq2 model applies. Can you give an example of some rows of this counts matrix? What are the dimensions of the resulting count matrix?

Thank you for the input, Michael. Here are some rows of the counts matrix:

> head(txi_sp\$counts)
Run1   Run2  Run3   Run4   Run5   Run6
[Polyangium]_brachysporum                     2.000  0.000  2.00  0.000  0.000  0.000
Acacia_frigescens                             0.000  0.000  0.00  0.000  0.000  6.066
Acanthochromis_polyacanthus                  22.268 20.497 20.45 32.866 43.226 14.000
Acetobacter_ghanensis                         0.000  0.000  0.00  0.000  0.000  0.000
Acetobacter_pasteurianus_subsp._pasteurianus  0.000  0.000  0.00  0.000  0.000  1.490
Achlya_hypogyna                               0.000  0.000  0.00  0.000  1.000  8.000


There are ~6,000 dimensions to the matrix (~900 species and 6 samples). Most of the counts correspond to zebrafish (the host species in these experiments), and the resulting baseMean for this species is much larger than the rest (~12 million compared to an average of 600 for all others with > 10, or a median of ~11).