Question

humann2 output appropriate for EdgeR or DESeq2 analysis

0

Entering edit mode

nyoungb2 • 0

@nyoungb2-14667

Last seen 6.2 years ago

humann2 is a comprehensive pipeline for metagenome profiling. The method is similar to standard eukaryotic transcriptomics approaches (map reads to genome/genes), but humann2 calculates gene family abundances as weighted sums of the alignments from each read, normalized by gene length and alignment quality.

There's a lot of posts stating that EdgeR and DESeq2 should (usually) only be used on raw counts, but that is in the context of standard transcriptomics workflows, not metagenomics.

In the context of metagenomics, would it be appropriate to use the humann2 output for differential abundance analysis with EdgeR or DESeq2? What if the data was just normalized by alignment quality but not gene length?

I would like to use humann2 if possible, but not if it limits downstream analysis options such as DESeq2 & EdgeR.

edger deseq2 • 2.0k views

ADD COMMENT • link updated 8 months ago by marjim • 0 • written 6.3 years ago by nyoungb2 • 0

0

Entering edit mode

Hello!

I find myself in a situation very similar to yours. I have conducted a metatranscriptomic analysis of a set of RNA samples using the humann2 tool. My primary objective is to determine whether the administration of a treatment affects the gene expression of the microbiome in my samples. Consequently, I aim to perform a differential expression analysis in R. However, as you are aware, humann2 provides normalized data (RPK or CPM), rendering it unsuitable for utilization with standard analysis tools such as EdgeR or DESeq2. Could you please guide me on which tool I can employ to conduct the analysis using normalized data? Is there a specific option within humann2 that offers raw counts? Given my objectives, should I contemplate using an alternative tool instead of humann2?

Thank you very much!

ADD REPLY • link 8 months ago marjim • 0

score 2 · Answer 1 · 2017-12-20

I've used edgeR in a variety of genomics contexts - RNA-seq, ChIP-seq, Hi-C - and for mass cytometry data. In all cases, the NB model only makes sense when applied to count(-like) data. This is because the absolute value matters when modeling counts. For example, a count of 1 in library A is quite different, in terms of its relative variance, to a count of 100 in library B that has been sequenced 100-fold deeper than A. (Namely, under Poisson sampling, a mean count of 1 has a standard deviation of 1, while a mean count of 100 has a standard deviation of 10.) So, if you give edgeR values that aren't counts, the results won't make a lot of sense. I expect this to be true regardless of the origin of the counts, at least for sequencing data that is minimally Poisson-distributed.

Any normalization based on gene length yields even less suitable values for edgeR, as this breaks the mean-variance model. Long genes with large counts and low variances get their means scaled down, while short genes with small counts and high variances get their means scaled up. This effectively shuffles the points along the x-axis of the BCV plot and prevents us from doing any sensible modeling. As for alignment quality normalization; I don't know what this entails, but if the results are interpretable as counts, it might be okay. For example, RSEM output is often used as input to edgeR.