Question

Should I calculate normalization factors in edgeR using all libraries or using only the compared libraries?

0

Entering edit mode

Peter • 0

@peter-7104

Last seen 5.2 years ago

Ireland

I have 3 groups: untreated, negative control (mock treatment), and treated; with 3 replicates in each. I am looking for differential expression between the groups, most importantly the negative control -- treated.

Which is a better approach:
- calculating the normalization factors using all 9 libraries, or
- calculating the normalization factors using only the 2×3 libraries that are compared at a time (and load in three count tables, entirely separately)?

Example for first option:

groups <- factor(c("A", "A", "B", "C", "C", "A", "B", "B", "C"))
dgedata <- DGEList(counts=rnadata, group=groups)
keep <- rowSums(cpm(dgedata) > 1) >= 3
dgedata <- dgedata[keep, keep.lib.sizes=FALSE]
dgedata <- calcNormFactors(dgedata, method=c("TMM"))
dgedata <- estimateCommonDisp(dgedata)
dgedata <- estimateTagwiseDisp(dgedata)
dgedata.results <- exactTest(dgedata, pair=c("A", "B"))

(This is mostly theoretical, as the two approaches differ in only about 20 DE genes (out of hundreds), in each comparison, but I am wondering about the justifications.)

edgeR calcNormFactors RNA-seq • 1.3k views

ADD COMMENT • link updated 9.4 years ago by Aaron Lun ★ 28k • written 9.4 years ago by Peter • 0

score 2 · Accepted Answer · 2014-11-27

You should use all of the libraries in a dataset when running edgeR, as this provides more residual d.f. for dispersion estimation. This means you should be calculating normalization factors for all 9 libraries at once, rather than separately analyzing a count table for each of the three pairwise comparisons.

In any case, the actual normalization factors should not be very different. calcNormFactors picks a reference library and calculates the near-median M-value (i.e., the systematic difference) of each other library against that reference. If you change the input libraries, the only effect on the calculation would concern the reference library that is chosen. The size of the systematic difference between two libraries should not change much, whether it is calculated directly between libraries or through the reference (i.e., calculate A against reference, then B against the reference, to get A against B).