Is edgeR suitable for differential abundance of UNIref features from metatranscriptomics from HUMANnN2?
0
Entering edit mode
Nick • 0
@nick-24226
Last seen 3 months ago

Hi,

My question is whether the statistical methods used by edgeR are suitable for detecting significant differences in abundance of UNIref protein family features as generated by HUMAnN2?

HUMAnN2 maps all the microbial (non-host) reads from my metatranscriptome samples onto the uniref protein families, and outputs the counts for each family, normalised to average gene family length (in RPK units).

I have a continuous metadata variable which I would like to correlate with gene family abundance using edgeR/Limma-Voom. I can't see why this wouldn't be a suitable method, but would welcome outside input on this.

HUMAnN also recommends additional tools for normalising data to library size. I wouldn't normally do any normalisation before importing read counts into edgeR though, so I'm thinking this isn't necessary, as it's handled by edgeR normalisation. Would anyone agree with this?

Finally, the data from HUMMAnN comes normalised to "gene length". Would it be better to try to recalculate read counts from the RPK output (i.e. multiple by mean length of gene family) before importing into edgeR, as edgeR input is assumed to be non-normalised, or would it be possible to get away without doing this (and just don't try any kind of gene length normalisation in edgeR.

All input on this is much appreciated

edgeR HUMAnN2.0 • 189 views
1
Entering edit mode
@gordon-smyth
Last seen 4 minutes ago
WEHI, Melbourne, Australia

I'm the senior author of the edgeR project. The HUMAnN2.0 data is not suitable for input to edgeR as it comes out. However I've been in discussions with the HUMANnN2 Lab recently about how we might extract data suitable for edgeR from the HUMANnN2 calculations. So it is something we are exploring, an area of current research.

While not ideal, the best approach for the time being might be a limma-trend (not limma-voom) analysis on log(RPK + 0.01) where RPK is the HUMANnN2 relative abundance values.

0
Entering edit mode

Thanks so much for this qualified answer! Very promising to hear that this is being discussed. I would certainly think HUMANnN2 -> edgeR would be a very intuitive and powerful way to analyse this type of data.

Thanks for your suggestion, I will try this approach

0
Entering edit mode

Would you include a library size (RPKM) normalisation in this, or just log(RPK + 0.01) as you suggest?

My code is either:

# load packages

library(edgeR)
library(limma)

# No lib size normalisation

# normalise data

data = log2(data + 0.01)

# generate model matrix

mmp <- model.matrix(~meta$Parab) # fit linear model lmfitp <- lmFit(data, mmp) lmfitp <- eBayes(lmfitp, trend=TRUE)  OR # read data into edgeR dds <- DGEList(data) # normalisation with library size logCPM = cpm(dds, log=T, prior.count = 0.01) # generate model matrix mmp <- model.matrix(~meta$Parab)

# fit linear model

lmfitp <- lmFit(logCPM, mmp)
lmfitp <- eBayes(lmfitp, trend=TRUE)


It seems more sensible to normalise to library size

0
Entering edit mode

I mean the first. As far as I know, whatever comes out of HUMAnN2 is already normalized for library size. Try computing column sums to see. In any case, you can't use cpm or DGEList (or any edgeR functions) because you don't have counts.

0
Entering edit mode

I think the HUMAnN manual states that by default abundances are returned in RPK units, not normalised for library size, and ColSums of the output varies between samples. I think there is a humann2 accessory script to normalise for library size though, so in this case the best approach is.

1. normalise to library size using accessory script
2. take log using logData <- log(data + 0.01)
3. fit linear model with
lmfit <- lmFit(logData, mmp)
lmfit <- eBayes(lmfit, trend=TRUE)