Hello,
I am undertaking a metatranscriptomic analysis on a set of soil samples. These samples are from the same original soil, which was split into containers and underwent a wetting and drying experiment with one sample per condition per timepoint.
By mapping the metatranscriptome reads to co-assembled metagenomes, I will have a table of read counts per sample for predicted open reading frames, e.g.
Sample 1 | Sample 2 |
orf1 104 | 56 |
orf2 21 | 3 |
My goal here is to descriptively show how gene expression for certain genes of interest changes across the samples, by describing fold-changes or by plotting normalised units of gene expression. I am interested in eyeballing relative change without formal differential expression analysis as there are no replicates here.
I would like to use the GeTMM method described here - essentially edgeR's TMM normalisation but using reads per kilobase (RPK) instead of raw read counts as the input. This is because I need a unit of gene expression comparable across samples that accounts not only for differences in library size and library composition, but also gene length. In a single species, the length of a particular gene is the same in every sample; but for a microbial community of several different species this is not the case.
I have two questions regarding this analysis:
Should the library size that I provide to edgeR be the number of reads sequenced, or the sum of all the RPK values in a sample? (i.e. the difference between 'per million reads sequenced' and 'per million reads mapped')
Once I have an ORFs x samples table where the values have been adjusted by TMM normalisation, with:
dge <- calcNormFactors(dge, method = "TMM")
tmm <- cpm(dge)
Is it appropriate to add TMM values together for a set of ORFs? The tmm
data above will be TMM-adjusted counts per ORF per sample, but to compare the expression of gene X between Community A (Sample 1) and Community B (Sample 2), I need the total gene expression for the community (all ORFs annotated as gene X). By database alignment, I will have a list of which ORFs correspond to my gene X, but I am not annotating all ORFs.
Rachael
Thank you Gordon. It appears that the GeTMM method is simply "input RPK values into edgeR instead of raw counts". If this isn't compatible with the workings of edgeR then I won't use it.
So do you think the ideal approach here would be to :
These values represent gene expression per annotated gene, that is comparable across samples because it has been adjusted for gene length, library size and library composition.
Is this correct? To account for compositionality with TMM, is it possible to annotate and aggregate only the ORFs corresponding to my genes of interest, or is it essential to annotate all ORFs and have everything aggregated at the "annotated gene" level before calculating TMM?
If you use cqn, then you can't use TMM. You must only normalize once.
Also, if you want to consolidate ORFs into genes then you must do it before cqn, otherwise the cqn output will be useless to edgeR. I advised in my answer above how I would do that.
I have never tried adjusting for gene length myself so can't give more advice. I suggest following the cqn documentation. You might consider posting a new question with a cqn tag.