Why is an averaged reference used in DESeq2 normalisation?
2
0
Entering edit mode
@angelos-armen-21507
Last seen 4 months ago

Hi,

I would like to normalise my bulk data using DESeq2. Since the samples are normalised against an averaged reference, 50% of the genes must be constant across samples [1]. I was wondering why aren't the samples normalised against one of the samples. This would only require the weaker assumption that between any pair of samples (more specifically, the pairs including the reference), less than 50% of the genes are differential (see [1], section "Clustering to weaken the non-DE assumption"). This approach is followed for normalising single cells across clusters in the scran package and for normalising single cells across batches in the multiBatchNorm function of the batchelor package.

References: [1] Lun, Aaron TL, Karsten Bach, and John C. Marioni. "Pooling across cells to normalize single-cell RNA sequencing data with many zero counts." Genome biology 17.1 (2016): 75.

deseq2 scran batchelor • 283 views
0
Entering edit mode

Why exactly are you arguing with single-cell methods when dealing with bulk data?

0
Entering edit mode

Because the principle (median-based normalisation) is the same. The single-cell methods I'm referring to normalise averages of batches of single cells, which are effectively bulk samples.

1
Entering edit mode
@mikelove
Last seen 1 day ago
United States

You can use whichever size factor estimation method you like in DESeq2:

sf <- ...
sizeFactors(dds) <- sf
dds <- DESeq(dds)


DESeq2 uses the median ratio method by default because it works well on most bulk datasets, where the samples are typically from the same tissue but the organisms or cells differ by treatment, etc.

It's not a requirement that 50% of the genes be constant for median ratio to work. For example, all of the genes could be DE and median ratio could still provide effectize size factor estimation. It's more about the scale and balance of DE. In this three line simulation all of the genes are DE, the second sample is sequenced at 2x depth, and median ratio correctly recovers the size factors:

set.seed(1)
cts <- matrix(rpois(800,1:400),ncol=2)
cts[,2] <- cts[,2] * 2^rnorm(400,0,.5) # DE
cts[,2] <- cts[,2] * 2 # seq depth
DESeq2::estimateSizeFactorsForMatrix(cts)

[1] 0.7123276 1.4038484

1
Entering edit mode
Aaron Lun ★ 27k
@alun
Last seen 50 minutes ago
The city by the bay

From the scran perspective, it's just easier to use an averaged reference, because the single-cell profiles are too variable to use for pairwise normalization on the sample level.

As for the 50% requirement: it is easy to provide an example of what happens just above that:

set.seed(0)

# 0% DE, no difference in the size factors is expected.
cts <- matrix(rpois(800,1:400),ncol=2)
DESeq2::estimateSizeFactorsForMatrix(cts)
## [1] 1 1

# 40% DE, begins to diverge but is still reasonable.
cts2 <- cts
cts2[1:160,1] <- cts2[1:160,1] * 10
DESeq2::estimateSizeFactorsForMatrix(cts2)
## [1] 1.0431312 0.9586522

# >50% DE, woah.
cts2 <- cts
cts2[1:201,1] <- cts[1:201,1]*100
DESeq2::estimateSizeFactorsForMatrix(cts2)
## [1] 5.5470020 0.1802776


As Mike says, it is possible to retain accuracy up to 100% DEGs if the number and effect size of the DEGs are the same in both directions. But this is a pretty strong assumption, especially for single-cell data. Besides, most normalization methods will do pretty well in such balanced cases - I could even use good ol' library size normalization, which has a theoretical breakdown point of 0% that can be reported as 100% with perfect balance.

0
Entering edit mode

Thank you both. I wanted to understand whether there's a particular reason DESeq2 makes a strong assumption (<50% of the genes are DE) when a weaker assumption could be made by normalising against one of the samples (<50% of the genes are DE in each direction for each pair of samples). Obviously those assumptions are sufficient but not necessary conditions for normalisation to work.

I also understand that, in the single-cell case, we have to normalise single cells against their summed profile (a single cell is too sparse to use as a reference). But for bulk data, we could normalise against one of the samples like scran and batchelor do for normalising summed single-cell profiles:

set.seed(0)
# 0% DE
cts <- matrix(rpois(1200,100),ncol=3)
DESeq2::estimateSizeFactorsForMatrix(cts)
[1] 1.0042141 1.0050203 0.9964212
cts2 <- cts
cts2[1:160,2] <- cts2[1:160,2]*10 # first 40% genes DE
cts2[240:400,3] <- cts2[240:400,3]*10 # last 40% genes DE

# normalise against geometric means
DESeq2::estimateSizeFactorsForMatrix(cts2)

[1] 0.4771446 1.0015055 0.9937962
# normalise against sample 1
DESeq2::estimateSizeFactorsForMatrix(cts2, geoMeans = cts2[,1])

[1] 0.9113232 1.0656295 1.0297252
0
Entering edit mode

"whether there's a particular reason DESeq2 makes a strong assumption..."

Again, median ratio does not in fact require a particular percent of genes to have no changes (real experiments likely do not have null genes where LFC=0, as we discuss in the 2014 paper), but more about scale and balance of the changes, and it tends to work well and robustly for the vast majority of bulk datasets I would point out.

When bulk datasets do have large shifts in one direction or another, it is still possible to use controlGenes as prior information to compute normalization over genes that are not believed to have large shifts over conditions. Or one can easily use external normalization tools like RUVSeq or svaseq.

However, I agree that alternative methods are necessary when the "samples" are cells of different type and not bulk samples of the same tissue.