Question: Using Salmon and DESeq2 to compare metagenomic content between samples
gravatar for kballa
4 months ago by
kballa0 wrote:

I am interested in quantifying the microbial content of an RNAseq experiment in a way that allows for statistically comparing abundances of species between samples. There are a few tools that have been built for this that I am using, but I was curious to hear if anyone has thoughts on using pseudo-alignment (Salmon) and DESeq2 for this. The workflow I've tried is as follows:

1) Map reads to the model organism that was used in the experiment and extract all unmapped reads 2) De novo assemble unmapped reads (with Trinity) 3) Build index for Salmon with contigs assembled from unmapped reads 4) Estimate counts with Salmon 5) Use blastn to identify prospective species ids for each contig. Then make a contig-to-species map to use with tximport and DESeq2 such that counts from each contig are combined at the species level.

So far the results from this approach agree with other methods I have tried, but I'm concerned that I might be grossly abusing the assumptions of the DESeq2 models. For example, is treating all transcripts (contigs) from a species as a single gene problematic? Or is this perhaps even a feature that minimizes the potential impact of zero-inflated distributions arising from low sequencing depth of microbial samples mixed in with a much deeper sampling of host sequence, and also a more conservative measurement of dispersion?

deseq2 salmon • 170 views
ADD COMMENTlink modified 4 months ago by Michael Love23k • written 4 months ago by kballa0
Answer: Using Salmon and DESeq2 to compare metagenomic content between samples
gravatar for Michael Love
4 months ago by
Michael Love23k
United States
Michael Love23k wrote:

I can't speak to the validity of the method for counting species, but if you have counts across samples and want to find differences, then it sounds like the DESeq2 model applies. Can you give an example of some rows of this counts matrix? What are the dimensions of the resulting count matrix?

ADD COMMENTlink written 4 months ago by Michael Love23k

Thank you for the input, Michael. Here are some rows of the counts matrix:

> head(txi_sp$counts)
                                               Run1   Run2  Run3   Run4   Run5   Run6
[Polyangium]_brachysporum                     2.000  0.000  2.00  0.000  0.000  0.000
Acacia_frigescens                             0.000  0.000  0.00  0.000  0.000  6.066
Acanthochromis_polyacanthus                  22.268 20.497 20.45 32.866 43.226 14.000
Acetobacter_ghanensis                         0.000  0.000  0.00  0.000  0.000  0.000
Acetobacter_pasteurianus_subsp._pasteurianus  0.000  0.000  0.00  0.000  0.000  1.490
Achlya_hypogyna                               0.000  0.000  0.00  0.000  1.000  8.000

There are ~6,000 dimensions to the matrix (~900 species and 6 samples). Most of the counts correspond to zebrafish (the host species in these experiments), and the resulting baseMean for this species is much larger than the rest (~12 million compared to an average of 600 for all others with > 10, or a median of ~11).

ADD REPLYlink modified 4 months ago • written 4 months ago by kballa0

So these can be rounded and provided to DESeq2. The caveat is that you will be finding differences after in silico normalization of the counts. The best we can do is attempt to minimize the differences, so any systematic differences are removed because they are indistinguishable from library size.

ADD REPLYlink written 4 months ago by Michael Love23k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 287 users visited in the last hour