Search
Question: Can edgeR TMM normalization be used for other count data?
0
17 months ago by
jol.espinoz10
jol.espinoz10 wrote:

This question is split into 2 parts:

(1) Can TMM normalization through edgeR be used for other count data like OTU counts and contig counts?

(2) After you calculate TMM, would converting to RPKM be more useful for looking at contig counts data since the range of lengths is pretty wide; if the above is a Yes: (pseudo-ish code below)

scalefactors = (normfactors * libsize)/1e6

df_tmm = df_counts / scalefactors # In Python

# scale(df_counts, center=FALSE, scale=(normfactors * libsize / 1e6)) # In R

df_tmm_rpkm = (df_tmm / seq_lengths)*1000

modified 17 months ago by James W. MacDonald47k • written 17 months ago by jol.espinoz10
4
17 months ago by
United States
James W. MacDonald47k wrote:

TMM normalization is based on the idea that normalizing by library size makes sense. In other words, for RNA-Seq, the library size is correlated with the total amount of mRNA that you started with, so if you have two samples and one has 20M reads and the other only has 10M reads, we assume that the second library only had half as much mRNA to begin with, so we want to account for that technical difference.

Things get a bit more murky when you start talking about things like OTU or contig counts. Without knowing anything about your experiment it's difficult to make any recommendation (which I would be hesitant to do regardless), but the main thing to consider is if the total number of counts is independent of your study design.

As a trivial example, say you were doing some microbiome study with gut bacteria and were comparing antibiotic treated vs control subjects. You expect far fewer total OTU counts in the antibiotic treated samples as a consequence of the treatment, so you wouldn't want to normalize by total counts, because that would erase much of your signal.

1

To add to James' answer: in addition to sequencing depth, TMM normalization tries to remove "composition biases" between samples. Consider two RNA-seq samples sequenced to the same depth. If a gene is strongly upregulated in the second sample, it will suppress the coverage of all other genes (as each gene competes for a fixed amount of sequencing resources), leading to spurious "downregulation" when compared to the first sample. TMM aims to get rid of this composition bias.

However, this comes at the cost of assuming that most genes are not differentially expressed. So, in order to use TMM, you need to be sure that most of your features are not differentially expressed/covered/whatever between samples. For example, it would not be appropriate to use TMM in James' antibiotic example, as you can be fairly confident that most OTU counts will change upon treatment. You also need to be sure that composition biases are actually present (and not desirable) in the counts for your features. If the counts represent a summary statistic other than read coverage, it's not clear whether the bias is present or not.

If these criteria are fulfilled, it's possible to use TMM normalization in a range of scenarios. For example, the csaw and DiffBind packages use it to normalize coverage in genomic windows/bins and binding peaks in ChIP-seq data.

These answers are great.  Thank you James & Aaron.  So could it be simplified to the idea that TMM normalization can be used if the counts from one attribute influence the counts for another attribute and if most of the attributes should not be dramatically different between samples? For example, if one gene takes up all the probes in a transcriptome sample then there are less probes available for the other genes.  What would be the composition biases in the antibiotic OTU example?  Would it be the presence or absence of antibiotic? In the antibiotic example, would it be informative to look at the ratio of genes to each other? (i.e. scale by library size)

1

The explanation I gave above was the simplest form. TMM was designed to handle composition biases; it is not guaranteed to remove arbitrary "influences" between attributes (I assume these refer to genomic features of some sort). I don't know what you consider to be a dramatic difference; all I can say is that TMM will not be accurate if the majority of attributes are different, though loss of accuracy may be tolerable for small differences.

For the antibiotic example, you're misunderstanding what composition biases are. The presence or absence of the antibiotic is not the bias, it's the experimental design. The composition bias would be something like the loss of the majority of susceptible OTUs resulting in a proportional increase of a handful of resistant OTUs, simply due to the fact that more sequencing resources are available when the competition dies out. However, the point of the antibiotic example is that TMM normalization will fail when most OTUs change. Scaling by library size won't help much either, as James has described.