Question

Method to get normalized counts in edgeR without cpm

2

Entering edit mode

Pauline ▴ 20

@pauline-18194

Last seen 5.5 years ago

Dear all,

I am currently using the edgeR package for my research on 16S RNA metabarcoding.

At the moment, I am focusing on TMM normalization, and I am quite confused by the way people use the calcNormFactors function.

Indeed, in the edgeR vignette, it is written : " The normalization factors of all the libraries multiply to unity. A normalization factor below one indicates that a small number of high count genes are monopolizing the sequencing, causing the counts for other genes to be lower than would be usual given the library size. As a result, the library size will be scaled down, analogous to scaling the counts upwards in that library. Conversely, a factor above one scales up the library size, analogous to downscaling the counts."

I understand with this section that raw counts of each sample should be multiplied by size factor.

On the other hand, I found an article also using the calcNormFactors function : https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4625728/

In part "Normalization methods", it's written " Scaling factors were calculated using the calcNormFactors function in the package, and then rescaled gene counts were obtained by dividing gene counts by each scaling factor for each run. TMM is the sum of rescaled gene counts of all runs"

With this article, I understand that raw counts should be divided by size factors.

Finally, this code from MetaLonDA seems to use another approach : https://github.com/aametwally/MetaLonDA/blob/master/R/Normalization.R

(lines 22 to 26)

What is the appropriate way to optained normalized counts within edgeR package for TMM normalization ?

I tried the cpm function but I am not interested in a count per million value, I would like to have the normalized value.

Best,

Pauline

edger normalization • 5.4k views

ADD COMMENT • link updated 5.5 years ago by Gordon Smyth 50k • written 5.5 years ago by Pauline ▴ 20

score 3 · Answer 1 · 2018-11-05

I think you might be confused. The whole idea of normalization is to control for library size. TMM is just a method to do so without being affected by compositional bias. You cannot have a normalized value without computing CPM, because if you don't divide by the library size, you haven't normalized anything!

Put another way, let's say you have two samples, one with a total of 20M reads, and one with 10M reads. All things equal, you expect about twice as many counts/gene for the first sample as compared to the second because there are twice as many reads. If you Edit: divide all the read counts for the first sample by 20 and all the read counts in the second one by 10, then you have adjusted them so you might expect that any genes that are expressed at the same level will now have the same number of counts per gene (but do note that these will now be CPM values, because you have divided by the total count for each sample, in millions).

This is a bit naive, as there may be some genes that are really highly expressed in one of the samples, and ends up hogging up all the space on the lane (this is called compositional bias), and you might not really have 2x the reads in the first sample vs the second, but it appears so because of those highly expressed genes. The TMM method is intended to filter those out, so you get a better estimate of the library size. But in the end if you don't compute CPM, you haven't normalized.

score 1 · Answer 2 · 2018-11-05

1

Entering edit mode

Gordon Smyth 50k

@gordon-smyth

Last seen 1 hour ago

WEHI, Melbourne, Australia

I agree with James MacDonald's answer. I have long argued the whole idea of a "normalized count" is not a meaningful concept, see for example:

http://seqanswers.com/forums/showthread.php?t=50935

When people talk of "normalized counts" they almost always really mean CPM or FPKM values, see for example

https://www.biostars.org/p/317701

In edgeR, we try our best to avoid this needless confusion by being specific about what is computed.

In edgeR, the calcNormFactors() function normalizes the library sizes, not the counts themselves. At no stage are the counts multiplied or divided by the normalization factors in an edgeR DE analysis.

Have you read Section 2.7.6 "Model-based normalization, not transformation" in the edgeR User's Guide? I tried to make it clear in that section that counts themselves are never normalized, nor can they be.

The quote you give from the BMC Bioinformatics article describes a procedure that is not a valid part of a negative binomial based DE analysis.

Anyway, you don't explain why you want "normalized counts". What analysis are you trying to do? I think it is very likely that edgeR already provides what you need, if you can clarify what the end result is that you are after.

ADD COMMENT • link 5.5 years ago Gordon Smyth 50k

0

Entering edit mode

James and Gordon, thank you very much for your answers.

Here is why I wanted the normalized counts in the first place : In my metabarcoding study, I would like to put in perspective two kinds of results :

- Results from DE analysis, run with edgeR, with TMM normalization

- Results from beta diversity analysis (throught PCoA analysis, by calculating distance matrix)

In my mind, it would be a lot more justifiable to use the same count data in entry. This is why I wanted the normalized data, to be able to calculate a distance matrix on it.

ADD REPLY • link 5.5 years ago Pauline ▴ 20

0

Entering edit mode

What you need for the PCoA analysis is logCPM. That's what edgeR does when you ask for PCoA by way of the plotMDS() function -- it automatically computes logCPM from the counts using cpm() and applies PCoA to them. And note that the TMM library size normalization is utilized when the logCPM are computed, so these are complementary rather than exclusive things.

You can't input the same data values both to both edgeR and PCoA because edgeR works on counts and PCoA does not.

ADD REPLY • link 5.5 years ago Gordon Smyth 50k