22 months ago by
Cambridge, United Kingdom
For starters, there's a problem with your code, in that
calcNormFactors returns normalization factors. This is subtly different from the size factors of DESeq. In edgeR, the equivalent to the size factor is actually the effective library size, i.e., the product of the normalization factor and the library size for each sample. That is, the normalization factor captures sample-specific bias beyond differences in library size. (Yes, this is confusing, even for people who should know better, e.g., Davis and I.) So, if you want size factors, you should do:
nf_tmm <- calcNormFactors(raw_counts, method = "TMM")
sf_tmm <- nf_tmm * colSums(raw_counts)
Anyway, in general, it doesn't seem appropriate to normalize spike-in counts with size/normalization factors computed from endogenous genes. The latter accounts for RNA content and will lead to over-normalization if applied to the former (which is not affected by RNA content, assuming you added the same amount of spike-in RNA to each cell). So, if you have a cell with lots of RNA, you'll have relatively higher coverage of endogenous genes, leading to a larger size factor to scale down the counts; but relatively lower coverage of the spike-in transcripts, whereby normalization scales down the spike-in counts further. It's usually better to compute a separate set of size factors for the spike-in transcripts, as was done in the Brennecke et al. paper.
If you persevere with applying gene-based factors to all of the features, you'll scale up the large spike-in counts and scale down the small spike-in counts, thus inflating the variance of the spike-in transcripts. This is what you observe after RLE normalization, though I'm puzzled by why it doesn't also occur after TMM normalization. I would suggest that you take a closer look at the normalized spike-in expression values across cells, and see how the profile changes with RLE and TMM normalization (especially for extreme factors). Both methods start behaving strangely when there's a lot of zeroes around, but they needn't be similar in their strangeness.
P.S. If you ever want to do the trend fitting and significance calculations of Brennecke et al. without digging up their code from the supplementaries, you can use the
technicalCV2 function in scran instead.
P.P.S. This just came out, for anyone who's interested: http://f1000research.com/articles/5-2122/v1.
modified 22 months ago
22 months ago by
Aaron Lun • 20k