RNA-Seq count size factors are defined in formula 5 of Anders & Huber (2010)
With pre-specified geometric means, are size factors supposed to be the same for identical samples regardless of total count matrix context?
That is, if I calculate the size factor for a single sample or if I extract that size factor for that sample from a larger context, shouldn't they be identical if the geometric mean was fixed?
For example:
library(DESeq2)
set.seed(353567)
ddsRaw <- makeExampleDESeqDataSet(n=1000, m=40)
gm <- exp(rowMeans(log(counts(ddsRaw))))
dds <- estimateSizeFactors(ddsRaw, geoMeans=gm)
ddsSubset <- estimateSizeFactors(ddsRaw[, 10:20], geoMeans=gm)
all.equal(sizeFactors(dds)[10:20], sizeFactors(ddsSubset)) # Size factors are not equal
I think the code below from estimateSizeFactorsForMatrix()
appears to be responsible for the dataset-dependent size factors, but I do not understand how it relates to formula 5, because it is now no longer solely dependent on the reference geometric means.
if (incomingGeoMeans) {
sf <- sf/exp(mean(log(sf)))
}
Thanks!
Thanks for the quick reply!
I was under the impression that using an external geometric mean reference meant that size factors become context-independent. So you could normalize a single sample to a reference and get the same size factor.
You do get the same scaling across samples, up to a single global scaling. So the relative scaling between samples is fixed by fixing the geometric means.
Ok. Would it be possible to make this an option in a future release (ie. optionally disable relative scaling of size factors) or mention it somehow in the documentation? When the size factors for the same samples using reference are different it may not be apparent.
Sure, I've added this to the documentation:
Thanks for the explanation and modification!