Search
Question: Disagreement between TMM (edgeR) and RLE (DESeq) normalisation for single-cell data
1
gravatar for kieranrcampbell
14 months ago by
kieranrcampbell10 wrote:

Hi all,

I have an (admittedly poor quality) single-cell RNA-seq dataset. I was en-route to performing the "highly variable gene selection" from http://www.nature.com/nmeth/journal/v10/n11/extref/nmeth.2645-S2.pdf, and looked at the difference of the TMM normalisation factors from edgeR and RLE from DESeq2 (recommended in the above).

When the data is normalised there is good agreement on the mean expression of each gene, except for the highly expressed spike-ins, which I find totally bizarre. This has a profound effect on the spike-in's position in the CV2-mean plots we're looking for. All plots for this analysis can be found here:

http://rpubs.com/kieranrcampbell/single-cell-mysteries

Count estimates were made using Kallisto and collapsed to gene level using scater (which I believe uses tximport under-the-hood). Any help much appreciated,

Kieran

 

 

ADD COMMENTlink modified 14 months ago by Aaron Lun17k • written 14 months ago by kieranrcampbell10

I don't know why the two methods disagree, but you might want to extract the normalization factors and examine them directly. Perhaps plot the log ratio of normalization factors vs other covaraites like total count to see if there is any apparent pattern.

ADD REPLYlink written 14 months ago by Ryan C. Thompson6.1k

To my knowledge, the difference is mainly due to the fact that TMM has a pre-normalization by library size before the actual normalization by the calculated size factor. Personally, I don't recommend normalizing single cell data by library directly, meaning using CPM, RPKM/FPKM or TPM value for downstream quantitative analysis. This will definitely lead to misinterpretation of your data.

See 10.1016/j.cell.2012.10.012, and https://haroldpimentel.wordpress.com/2014/12/08/in-rna-seq-2-2-between-sample-normalization/ for more information.

Personally, I will only normalize my single-cell data by the size factor derived from non-differentially expressed genes, i.e. ERCC, which obviously refutes the point raised in the Nature Methods paper you cited above. But this makes more sense to me, a biologist.

Gary

ADD REPLYlink written 8 months ago by garyhokawai0
4
gravatar for Aaron Lun
14 months ago by
Aaron Lun17k
Cambridge, United Kingdom
Aaron Lun17k wrote:

For starters, there's a problem with your code, in that calcNormFactors returns normalization factors. This is subtly different from the size factors of DESeq. In edgeR, the equivalent to the size factor is actually the effective library size, i.e., the product of the normalization factor and the library size for each sample. That is, the normalization factor captures sample-specific bias beyond differences in library size. (Yes, this is confusing, even for people who should know better, e.g., Davis and I.) So, if you want size factors, you should do:

nf_tmm <- calcNormFactors(raw_counts, method = "TMM")
sf_tmm <- nf_tmm * colSums(raw_counts)

Anyway, in general, it doesn't seem appropriate to normalize spike-in counts with size/normalization factors computed from endogenous genes. The latter accounts for RNA content and will lead to over-normalization if applied to the former (which is not affected by RNA content, assuming you added the same amount of spike-in RNA to each cell). So, if you have a cell with lots of RNA, you'll have relatively higher coverage of endogenous genes, leading to a larger size factor to scale down the counts; but relatively lower coverage of the spike-in transcripts, whereby normalization scales down the spike-in counts further. It's usually better to compute a separate set of size factors for the spike-in transcripts, as was done in the Brennecke et al. paper.

If you persevere with applying gene-based factors to all of the features, you'll scale up the large spike-in counts and scale down the small spike-in counts, thus inflating the variance of the spike-in transcripts. This is what you observe after RLE normalization, though I'm puzzled by why it doesn't also occur after TMM normalization. I would suggest that you take a closer look at the normalized spike-in expression values across cells, and see how the profile changes with RLE and TMM normalization (especially for extreme factors). Both methods start behaving strangely when there's a lot of zeroes around, but they needn't be similar in their strangeness.

P.S. If you ever want to do the trend fitting and significance calculations of Brennecke et al. without digging up their code from the supplementaries, you can use the technicalCV2 function in scran instead.

P.P.S. This just came out, for anyone who's interested: http://f1000research.com/articles/5-2122/v1.

ADD COMMENTlink modified 14 months ago • written 14 months ago by Aaron Lun17k

Hi Aaron,

Thanks very much for this. Entirely makes sense that if we use endogenous genes then we'll disproportionately change the (presumably constant) ERCCs. I think I'm still somewhat confused as there seems to be conflicting advice as to whether ERCCs are stable enough for valid estimates, and I think the PDF I listed to above actually uses the endogenous genes for normalisation, unless I've read the R code wrong. Either way, using just spike-ins to normalise in the above example removes the effect...and now I've scrapped it and used scran.

Thanks again! 

ADD REPLYlink written 14 months ago by kieranrcampbell10
2

If you're looking at the mouse/ERCC example, on page 28 of the linked PDF, there's the two lines:

sfMmus <- estimateSizeFactorsForMatrix( countsMmus )
sfERCC <- estimateSizeFactorsForMatrix( countsERCC )

... followed later by:

nCountsERCC <- t( t(countsERCC) / sfERCC )
nCountsMmus <- t( t(countsMmus) / sfMmus )

So the size factors do get computed and applied separately to the genes and spike-ins.

And oops, I forgot to mention normalization with computeSumFactors in scran. Yes, that'll work too.

ADD REPLYlink modified 14 months ago • written 14 months ago by Aaron Lun17k

For exactly the reasons Aaron gives, I have calculated two sets of size factors, one for the endogenous genes (called sfMus in the mouse data in Section 4.1 or sfAt in the plant data in Section 3.2) and one for the spike ins (sfERCC and sfHeLa, respectively). In the paper, we have called them s_j^B and s_j, for the biological (endogeneous) and technical (spike-in) size factors, respectively.  

The way how both sets of size factors are used together (as see in the last two equations in the Methods part of the paper) is a bit subtle and explained in Supplementary Note 6.

ADD REPLYlink modified 14 months ago • written 14 months ago by Simon Anders3.4k

Hi Simon,

Why don't you just normalize the endogenous genes counts with size factor derived from spike-ins. This makes more sense to me.

Gary

ADD REPLYlink written 8 months ago by garyhokawai0

Unfortunately, this is where theoretical elegance runs into practical problems with interpretation. The issue, as I discuss in the F1000Research link above, is that spike-in normalisation preserves differences in total RNA content between cells, whereas normalisation by endogenous gene counts does not. Now, if you're interested in differences in total RNA content, then spike-in normalisation is the way to go, and I know of a few projects where this is an explicit priority. However, in most cases, changes in RNA content are not of particular interest, and interpretation becomes more difficult than it needs to be if you get a whole stack of highly variable/differentially expressed genes (or equivalently, the first few principal components in PCA) being driven by a global increase in expression. 

ADD REPLYlink written 8 months ago by Aaron Lun17k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 165 users visited in the last hour