Question

Using Salmon and DESeq2 to compare metagenomic content between samples

0

Entering edit mode

kballa • 0

@kballa-19347

Last seen 6.0 years ago

I am interested in quantifying the microbial content of an RNAseq experiment in a way that allows for statistically comparing abundances of species between samples. There are a few tools that have been built for this that I am using, but I was curious to hear if anyone has thoughts on using pseudo-alignment (Salmon) and DESeq2 for this. The workflow I've tried is as follows:

1) Map reads to the model organism that was used in the experiment and extract all unmapped reads 2) De novo assemble unmapped reads (with Trinity) 3) Build index for Salmon with contigs assembled from unmapped reads 4) Estimate counts with Salmon 5) Use blastn to identify prospective species ids for each contig. Then make a contig-to-species map to use with tximport and DESeq2 such that counts from each contig are combined at the species level.

So far the results from this approach agree with other methods I have tried, but I'm concerned that I might be grossly abusing the assumptions of the DESeq2 models. For example, is treating all transcripts (contigs) from a species as a single gene problematic? Or is this perhaps even a feature that minimizes the potential impact of zero-inflated distributions arising from low sequencing depth of microbial samples mixed in with a much deeper sampling of host sequence, and also a more conservative measurement of dispersion?

salmon deseq2 • 1.8k views

ADD COMMENT • link updated 6.0 years ago by Michael Love 43k • written 6.0 years ago by kballa • 0

score 1 · Answer 1 · 2019-01-11

1

Entering edit mode

Michael Love 43k

@mikelove

Last seen 1 day ago

United States

I can't speak to the validity of the method for counting species, but if you have counts across samples and want to find differences, then it sounds like the DESeq2 model applies. Can you give an example of some rows of this counts matrix? What are the dimensions of the resulting count matrix?

ADD COMMENT • link 6.0 years ago Michael Love 43k

0

Entering edit mode

Thank you for the input, Michael. Here are some rows of the counts matrix:

> head(txi_sp$counts)
                                               Run1   Run2  Run3   Run4   Run5   Run6
[Polyangium]_brachysporum                     2.000  0.000  2.00  0.000  0.000  0.000
Acacia_frigescens                             0.000  0.000  0.00  0.000  0.000  6.066
Acanthochromis_polyacanthus                  22.268 20.497 20.45 32.866 43.226 14.000
Acetobacter_ghanensis                         0.000  0.000  0.00  0.000  0.000  0.000
Acetobacter_pasteurianus_subsp._pasteurianus  0.000  0.000  0.00  0.000  0.000  1.490
Achlya_hypogyna                               0.000  0.000  0.00  0.000  1.000  8.000

There are ~6,000 dimensions to the matrix (~900 species and 6 samples). Most of the counts correspond to zebrafish (the host species in these experiments), and the resulting baseMean for this species is much larger than the rest (~12 million compared to an average of 600 for all others with > 10, or a median of ~11).

ADD REPLY • link 6.0 years ago kballa • 0

0

Entering edit mode

So these can be rounded and provided to DESeq2. The caveat is that you will be finding differences after in silico normalization of the counts. The best we can do is attempt to minimize the differences, so any systematic differences are removed because they are indistinguishable from library size.

ADD REPLY • link 6.0 years ago Michael Love 43k