Hi,
My question is whether the statistical methods used by edgeR are suitable for detecting significant differences in abundance of UNIref protein family features as generated by HUMAnN2?
HUMAnN2 maps all the microbial (non-host) reads from my metatranscriptome samples onto the uniref protein families, and outputs the counts for each family, normalised to average gene family length (in RPK units).
I have a continuous metadata variable which I would like to correlate with gene family abundance using edgeR/Limma-Voom. I can't see why this wouldn't be a suitable method, but would welcome outside input on this.
HUMAnN also recommends additional tools for normalising data to library size. I wouldn't normally do any normalisation before importing read counts into edgeR though, so I'm thinking this isn't necessary, as it's handled by edgeR normalisation. Would anyone agree with this?
Finally, the data from HUMMAnN comes normalised to "gene length". Would it be better to try to recalculate read counts from the RPK output (i.e. multiple by mean length of gene family) before importing into edgeR, as edgeR input is assumed to be non-normalised, or would it be possible to get away without doing this (and just don't try any kind of gene length normalisation in edgeR.
All input on this is much appreciated
Thanks so much for this qualified answer! Very promising to hear that this is being discussed. I would certainly think HUMANnN2 -> edgeR would be a very intuitive and powerful way to analyse this type of data.
Thanks for your suggestion, I will try this approach
Would you include a library size (RPKM) normalisation in this, or just log(RPK + 0.01) as you suggest?
My code is either:
OR
It seems more sensible to normalise to library size
I mean the first. As far as I know, whatever comes out of HUMAnN2 is already normalized for library size. Try computing column sums to see. In any case, you can't use cpm or DGEList (or any edgeR functions) because you don't have counts.
I think the HUMAnN manual states that by default abundances are returned in RPK units, not normalised for library size, and ColSums of the output varies between samples. I think there is a humann2 accessory script to normalise for library size though, so in this case the best approach is.
logData <- log(data + 0.01)