My question is whether the statistical methods used by edgeR are suitable for detecting significant differences in abundance of UNIref protein family features as generated by HUMAnN2?
HUMAnN2 maps all the microbial (non-host) reads from my metatranscriptome samples onto the uniref protein families, and outputs the counts for each family, normalised to average gene family length (in RPK units).
I have a continuous metadata variable which I would like to correlate with gene family abundance using edgeR/Limma-Voom. I can't see why this wouldn't be a suitable method, but would welcome outside input on this.
HUMAnN also recommends additional tools for normalising data to library size. I wouldn't normally do any normalisation before importing read counts into edgeR though, so I'm thinking this isn't necessary, as it's handled by edgeR normalisation. Would anyone agree with this?
Finally, the data from HUMMAnN comes normalised to "gene length". Would it be better to try to recalculate read counts from the RPK output (i.e. multiple by mean length of gene family) before importing into edgeR, as edgeR input is assumed to be non-normalised, or would it be possible to get away without doing this (and just don't try any kind of gene length normalisation in edgeR.
All input on this is much appreciated