EdgeR cpm: factor of 2 when adjusting library size?
Entering edit mode
mikaelhc ▴ 10
Last seen 6.5 years ago

In EdgeR, the 'log cpm' values are calculated as:
log2(t( (t(x)+prior.count.scaled) / lib.size ))

However, before that the library size is offset as:
lib.size <- lib.size+2*prior.count.scaled

I do not understand the factor of '2'. We add the prior to all counts, so I could understand it if we adjusted the library size by adding the number of genes times the prior, e.g.:
lib.size <- lib.size+(numberOfGenes)*prior.count.scaled

I noticed that voom does something similar:
t(log2(t(counts + 0.5)/(lib.size + 1) * 1e+06))
where we again have a library offset of '1', which is twice the prior.

Can anyone help me out here?

Best, Mikael Christensen

edger cpm • 2.4k views
Entering edit mode
Last seen 4 hours ago
WEHI, Melbourne, Australia

The voom offsets ensure that count/lib.size can never be exactly zero nor exactly one, regardless of the data. Same motivation for edgeR. The offsets are symmetric in that the minimum value for prop =(count+0.5)/(lib.size+1) (the proportion of the library from this gene) is the same as the minimum possible value for 1-prop.

It may help to think of it like this. Let count2 be the number of reads mapping to other genes, count2 = lib.size-count. We offset both count and count2 by the same amount, count.off = count+0.5 and count2.off = count2+0.5, then compute prop = count.off / (count.off + count2.off).

Hence the factor of 2 in edgeR -- this arises because the offset is being applied both to count and to count2.

If you want more statistical background, look up the "empirical logistic transformation".

From an inferential point of view, the prior count applies to each count observation separately. It does not translate into a global increase of the library size.

Entering edit mode
Aaron Lun ★ 27k
Last seen 8 hours ago
The city by the bay

That's a good question to which I once knew the answer, but that has since been lost to the mists of time. (In my defence, I didn't write the function in the first place.) I do remember a couple of salient points, though:

  • We don't use lib.size <- lib.size+prior.count.scaled in order to avoid the obviously paradoxical situation of every gene approaching a CPM of a million (i.e., containing all reads in the library) when the lib.size approaches zero. Now, while the sum of the CPMs across all genes is still greater than a million with twice the prior count added, it's not so overtly silly as before.
  • Regarding the previous point; if we wanted to be fully consistent, then I suppose we would use something like nrow(x)*prior.count.scaled. However, this really increases the strength of the prior. For example, a prior count of 0.5 would become something like 5000 added to the library size when you have 10000 genes. In standard RNA-seq applications this mightn't matter when your library sizes are in the millions, but in other situations this won't be the case. For example, in some single-cell RNA-seq data sets, we deal with library sizes below 10000. And if I were to use, say, genomic windows instead of genes, I could be talking about millions of rows in x.

So I guess, the choice of 2 is just the best compromise between the two concerns above. Of course, it's largely academic because if your data is any good you should be nowhere near single digits for your library size.

Entering edit mode
mikaelhc ▴ 10
Last seen 6.5 years ago

Thanks for replying so quickly - it makes more sense now! I will also try looking at the empirical logistic transformation.

Best, Mikael.


Login before adding your answer.

Traffic: 242 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6