Search
Question: Disagreement between TMM (edgeR) and RLE (DESeq) normalisation for single-cell data
1
22 months ago by
kieranrcampbell30 wrote:

Hi all,

I have an (admittedly poor quality) single-cell RNA-seq dataset. I was en-route to performing the "highly variable gene selection" from http://www.nature.com/nmeth/journal/v10/n11/extref/nmeth.2645-S2.pdf, and looked at the difference of the TMM normalisation factors from edgeR and RLE from DESeq2 (recommended in the above).

When the data is normalised there is good agreement on the mean expression of each gene, except for the highly expressed spike-ins, which I find totally bizarre. This has a profound effect on the spike-in's position in the CV2-mean plots we're looking for. All plots for this analysis can be found here:

http://rpubs.com/kieranrcampbell/single-cell-mysteries

Count estimates were made using Kallisto and collapsed to gene level using scater (which I believe uses tximport under-the-hood). Any help much appreciated,

Kieran

modified 22 months ago by Aaron Lun20k • written 22 months ago by kieranrcampbell30

I don't know why the two methods disagree, but you might want to extract the normalization factors and examine them directly. Perhaps plot the log ratio of normalization factors vs other covaraites like total count to see if there is any apparent pattern.

To my knowledge, the difference is mainly due to the fact that TMM has a pre-normalization by library size before the actual normalization by the calculated size factor. Personally, I don't recommend normalizing single cell data by library directly, meaning using CPM, RPKM/FPKM or TPM value for downstream quantitative analysis. This will definitely lead to misinterpretation of your data.

Personally, I will only normalize my single-cell data by the size factor derived from non-differentially expressed genes, i.e. ERCC, which obviously refutes the point raised in the Nature Methods paper you cited above. But this makes more sense to me, a biologist.

Gary

4
22 months ago by
Aaron Lun20k
Cambridge, United Kingdom
Aaron Lun20k wrote:

For starters, there's a problem with your code, in that calcNormFactors returns normalization factors. This is subtly different from the size factors of DESeq. In edgeR, the equivalent to the size factor is actually the effective library size, i.e., the product of the normalization factor and the library size for each sample. That is, the normalization factor captures sample-specific bias beyond differences in library size. (Yes, this is confusing, even for people who should know better, e.g., Davis and I.) So, if you want size factors, you should do:

nf_tmm <- calcNormFactors(raw_counts, method = "TMM")
sf_tmm <- nf_tmm * colSums(raw_counts)


Anyway, in general, it doesn't seem appropriate to normalize spike-in counts with size/normalization factors computed from endogenous genes. The latter accounts for RNA content and will lead to over-normalization if applied to the former (which is not affected by RNA content, assuming you added the same amount of spike-in RNA to each cell). So, if you have a cell with lots of RNA, you'll have relatively higher coverage of endogenous genes, leading to a larger size factor to scale down the counts; but relatively lower coverage of the spike-in transcripts, whereby normalization scales down the spike-in counts further. It's usually better to compute a separate set of size factors for the spike-in transcripts, as was done in the Brennecke et al. paper.

If you persevere with applying gene-based factors to all of the features, you'll scale up the large spike-in counts and scale down the small spike-in counts, thus inflating the variance of the spike-in transcripts. This is what you observe after RLE normalization, though I'm puzzled by why it doesn't also occur after TMM normalization. I would suggest that you take a closer look at the normalized spike-in expression values across cells, and see how the profile changes with RLE and TMM normalization (especially for extreme factors). Both methods start behaving strangely when there's a lot of zeroes around, but they needn't be similar in their strangeness.

P.S. If you ever want to do the trend fitting and significance calculations of Brennecke et al. without digging up their code from the supplementaries, you can use the technicalCV2 function in scran instead.

P.P.S. This just came out, for anyone who's interested: http://f1000research.com/articles/5-2122/v1.

Hi Aaron,

Thanks very much for this. Entirely makes sense that if we use endogenous genes then we'll disproportionately change the (presumably constant) ERCCs. I think I'm still somewhat confused as there seems to be conflicting advice as to whether ERCCs are stable enough for valid estimates, and I think the PDF I listed to above actually uses the endogenous genes for normalisation, unless I've read the R code wrong. Either way, using just spike-ins to normalise in the above example removes the effect...and now I've scrapped it and used scran.

Thanks again!

2

If you're looking at the mouse/ERCC example, on page 28 of the linked PDF, there's the two lines:

sfMmus <- estimateSizeFactorsForMatrix( countsMmus )
sfERCC <- estimateSizeFactorsForMatrix( countsERCC )

... followed later by:

nCountsERCC <- t( t(countsERCC) / sfERCC )
nCountsMmus <- t( t(countsMmus) / sfMmus )

So the size factors do get computed and applied separately to the genes and spike-ins.

And oops, I forgot to mention normalization with computeSumFactors in scran. Yes, that'll work too.

For exactly the reasons Aaron gives, I have calculated two sets of size factors, one for the endogenous genes (called sfMus in the mouse data in Section 4.1 or sfAt in the plant data in Section 3.2) and one for the spike ins (sfERCC and sfHeLa, respectively). In the paper, we have called them s_j^B and s_j, for the biological (endogeneous) and technical (spike-in) size factors, respectively.

The way how both sets of size factors are used together (as see in the last two equations in the Methods part of the paper) is a bit subtle and explained in Supplementary Note 6.

Hi Simon,

Why don't you just normalize the endogenous genes counts with size factor derived from spike-ins. This makes more sense to me.

Gary