Question

[EdgeR] question about the y$samples$lib.size and y$samples$norm.factors

0

Entering edit mode

wbli2012 • 0

@wbli2012-11496

Last seen 9.3 years ago

Hi All,

When I am analyzing some RNA-seq data from noncoding regions, I bumped into this question.

I only focused on noncoding regions with ncRNA transcripts detected, which only constituted 5-10% of all the sequencing reads. So for examples, if there are 20million reads of a RNA-seq sample, the reads from noncoding region is perhaps 1million.

So now the question becomes, shall I manually input the total sequencing read number for a specific sample (e.g 20million) as the y$samples$lib.size when I am using edgeR, or shall I let edgeR do it automatically (as far as I understand, edgeR will automatically sum all reads in the DGElist of a sample, which in my case will be 1 million as only ncRNAs were calculated here).

The logic behind this is that - I assume the total RNA-seq reads of different samples are comparable, rather than the reads from noncoding region only. So I will consider that it is correct to manually input the specific reads number of a sample. e.g. (y$samples$lib.size <- c(12624935, 13276916, 20222212, 21997722, 16789741, 29073257))

This question will also affects the y$samples$norm.factors. (y <- calcNormFactors(y)). If I let the edgeR do it automatically, the norm.factors will be only take into consideration the variability of noncoding RNAs in different samples.

Thank you so much for and suggestions!

wenbo

edger • 3.4k views

ADD COMMENT • link 9.3 years ago wbli2012 • 0

score 2 · Answer 1 · 2016-09-16

It doesn't really matter as long as you run calcNormFactors afterwards. This is because the computed normalization factors will automatically scale themselves down if you manually supply larger library sizes for the same set of counts. To illustrate, here's an example with some mocked-up data:

y <- DGEList(matrix(rpois(100000, lambda=1:10*100), ncol=10, byrow=TRUE))
y1 <- calcNormFactors(y)
y2 <- y
y2$samples$lib.size <- mean(y$samples$lib.size) # nonsense library sizes
y2 <- calcNormFactors(y2)

Now, the relevant scaling factor that is used in downstream normalization is not actually the normalization factor itself, but rather, the effective library size (i.e., the product of the normalization factor and library size for each sample):

eff1 <- y1$samples$norm.factors*y1$samples$lib.size
eff2 <- y2$samples$norm.factors*y2$samples$lib.size
var(eff1/eff2) # almost zero

You can see that the two sets of effective library sizes only differ by a common scaling factor, which has no effect on the relative scaling between libraries (which is what normalization is all about). Thus, both sets of values will have almost the same effect during normalization, despite the fact that y2 has library sizes that are totally wrong. So, there's really no need for or benefit from putting in your own library sizes, unless you want to do something particular with the interpretation of the normalization factors.

I should also point out that calcNormFactors (and TMM normalization generally) assumes that most features are not DE between samples. Whether this is sensible for the non-coding regions in your experiment is something you'll have to consider.

score 0 · Answer 2 · 2016-10-04

0

Entering edit mode

wbli2012 • 0

@wbli2012-11496

Last seen 9.3 years ago

Thank you so much, Aaron! Sorry for late reply as I just came back from a vacation.

In theory I get you and agree with you. But one more question following that is that we usually perform filtering of the data, e.g. by cpm>0.5. This will remove many reads from the data from genes with low cpm, and therefore affect the y$samples$lib.size. Should it be right that after filtering reads, we should not run calcNormFactors anymore, but instead manually put the y$samples$lib.size and y$samples$norm.factors before filtering?

Again, appreciate your input!

wenbo

ADD COMMENT • link 9.3 years ago wbli2012 • 0

1

Entering edit mode

What you should be doing is:

Create a DGEList object from your count data.
Filter the object to remove low-abundance genes.
Run calcNormFactors on the filtered object.

There's no need to manually change lib.size or norm.factors. In fact, you shouldn't have any meaningful norm.factors before filtering, because you shouldn't be running calcNormFactors on the unfiltered data. As I said before, calcNormFactors will adjust for the library sizes that are present in the supplied DGEList object, so changes in the total count after filtering will automatically be considered without requiring any further intervention.

P.S. Add comments to answers via "add comment", unless you're answering your own question.

ADD REPLY • link 9.3 years ago Aaron Lun ★ 29k