Question

TMM normalisation on metatranscriptome data: for visualisation/fold-change only

0

Entering edit mode

rachael.lappan • 0

@rachaellappan-22834

Last seen 5.2 years ago

Hello,

I am undertaking a metatranscriptomic analysis on a set of soil samples. These samples are from the same original soil, which was split into containers and underwent a wetting and drying experiment with one sample per condition per timepoint.

By mapping the metatranscriptome reads to co-assembled metagenomes, I will have a table of read counts per sample for predicted open reading frames, e.g.

       Sample 1 | Sample 2 |
 orf1     104   |   56     |
 orf2     21    |    3     |

My goal here is to descriptively show how gene expression for certain genes of interest changes across the samples, by describing fold-changes or by plotting normalised units of gene expression. I am interested in eyeballing relative change without formal differential expression analysis as there are no replicates here.

I would like to use the GeTMM method described here - essentially edgeR's TMM normalisation but using reads per kilobase (RPK) instead of raw read counts as the input. This is because I need a unit of gene expression comparable across samples that accounts not only for differences in library size and library composition, but also gene length. In a single species, the length of a particular gene is the same in every sample; but for a microbial community of several different species this is not the case.

I have two questions regarding this analysis:

Should the library size that I provide to edgeR be the number of reads sequenced, or the sum of all the RPK values in a sample? (i.e. the difference between 'per million reads sequenced' and 'per million reads mapped')
Once I have an ORFs x samples table where the values have been adjusted by TMM normalisation, with:

dge <- calcNormFactors(dge, method = "TMM")
tmm <- cpm(dge)

Is it appropriate to add TMM values together for a set of ORFs? The tmm data above will be TMM-adjusted counts per ORF per sample, but to compare the expression of gene X between Community A (Sample 1) and Community B (Sample 2), I need the total gene expression for the community (all ORFs annotated as gene X). By database alignment, I will have a list of which ORFs correspond to my gene X, but I am not annotating all ORFs.

Rachael

normalization edgeR metatranscriptome • 1.5k views

ADD COMMENT • link updated 5.2 years ago by Gordon Smyth 52k • written 5.2 years ago by rachael.lappan • 0

score 1 · Answer 1 · 2020-02-04

1

Entering edit mode

Gordon Smyth 52k

@gordon-smyth

Last seen 3 hours ago

WEHI, Melbourne, Australia

To apply GeTMM, you would have to follow the instructions in the GeTMM paper. It isn't an edgeR normalization method, so the edgeR authors can't advise you how to do it.

From the little of I know of GeTMM, however, I doubt it is compatible with edgeR. The calcNormFactors function only normalizes the library sizes, not anything else. There is no way to account for sample-specific gene lengths by normalizing the library sizes. If you want to allow for sample-specific gene lengths in edgeR, it has to be done with the offset matrix. The cqn package provides a normalization method that allows for sample-specific gene lengths and is compatible with edgeR.

Also, if you want to merge ORFs into genes, it is safer to define genes as meta-features when you count the reads in the first place rather than trying get ORF counts and summing later on. Summing the reads or CPMs over ORFs later on may double-count some reads if there are RNA-seq reads or read-pairs that overlap more than one ORF.

ADD COMMENT • link 5.2 years ago Gordon Smyth 52k

0

Entering edit mode

Thank you Gordon. It appears that the GeTMM method is simply "input RPK values into edgeR instead of raw counts". If this isn't compatible with the workings of edgeR then I won't use it.

So do you think the ideal approach here would be to :

Adjust for gene length first with the cqn package
Annotate all ORFs and sum those that have been annotated as the same gene (step 1 must happen first because the same gene in different organisms can be different lengths)
Bring these values into edgeR
Calculate TMM values in edgeR

These values represent gene expression per annotated gene, that is comparable across samples because it has been adjusted for gene length, library size and library composition.

Is this correct? To account for compositionality with TMM, is it possible to annotate and aggregate only the ORFs corresponding to my genes of interest, or is it essential to annotate all ORFs and have everything aggregated at the "annotated gene" level before calculating TMM?

ADD REPLY • link 5.2 years ago rachael.lappan • 0

0

Entering edit mode

If you use cqn, then you can't use TMM. You must only normalize once.

Also, if you want to consolidate ORFs into genes then you must do it before cqn, otherwise the cqn output will be useless to edgeR. I advised in my answer above how I would do that.

I have never tried adjusting for gene length myself so can't give more advice. I suggest following the cqn documentation. You might consider posting a new question with a cqn tag.

ADD REPLY • link 5.2 years ago Gordon Smyth 52k