Spike-in controls are not a panacea that we sometimes imagine them to be.
The first problem is, how much spike-in RNA do we add to each sample? Well, we can't literally add the same amount of spike-in RNA to each tube because that won't adjust for differences in the amount of endogenous RNA we collected form each sample. It generally doesn't make sense to scale the amount of RNA added by the amount of endogenous expression, because any wide-spread/global changes in transcription between conditions would also be expected to increase the latter; scaling the amount of control RNA to match would just eliminate such changes upon normalization. Probably the most sensible strategy is to add the same amount of RNA per cell, but this assumes that we have accurate cell counts for each sample, and that's a bit of a stretch in some applications.
(It is for this reason that spike-in normalization is one of the few things that is technically easier for single-cell RNA-seq compared to bulk RNA-seq. Also note that ChIP-seq applications of spike-in methods generally have an easier time of it, because the total amount of DNA doesn't exhibit biological changes between conditions in an experiment involving diploid cells, so an operator can just match the amount of spiked-in chromatin to the concentration of input DNA.)
The next problem is how to apply the spike-ins for normalization. The most obvious approach is to scale the expression values so that the spike-in coverage are equal across all samples; there are some more sophisticated strategies but the idea of "making spike-ins equal" is generally the same. However... then what?
Let's consider a situation with whole-genome duplications, and with some hand-waving we will assume that, on average, genome duplication doubles the transcriptional output of most genes. Done correctly, spike-in normalization will yield expression values that are twice as large in the cells with duplicated genomes compared to normal cells. Great, and now every gene in the genome is significant with a log-fold change of ~1... not very helpful. Rather, I would be more interested in what changes occur on top of the duplication, e.g., synergistic effects that cause particular genes to be upregulated more than expected by a doubling of gene copy. In such cases, we want to normalize out the duplication effect before we perform our tests - say, by assuming that most genes have log-fold changes driven by duplication - and this is where TMM and friends come into play.
There are specific cases where spike-in normalization can be helpful for interpretation, see the book for more details. But I would say that these cases require some thought about what biological effects you want to study and how you plan to interpret your normalized expression values, it's not a slam dunk all the time.
There is also a recent journal article analysing 2500 whole genomes of cancer metastases which found more than half of them have whole genome duplication. It's remarkable that people are still doing median normalisation for cancer samples in 2020.
DESeq2 has
estimateSizeFactors
which has acontrolGenes
argument but edgeR'scalcNormFactors
doesn't even allow control genes to be specified.I wonder if there a set of housekeeping genes that are stably expressed in cancer. To date, all definitions of housekeeping genes used healthy samples to define a gene set. I think whole genome duplication would make many of those invalid.
If you needed to use control genes for some reason (and I'm not convinced that's really a good idea), then you can just run
calcNormFactors()
and that subset of genes and transfer the normalization factors back to your originalDGEList
:awesome, thanks again Aaron !
It is a bit of debate in our department about what normalization to use, especially if we do scRNA-seq by 10X Genomics (beside RNA-seq) ; we'd expect a global increase in transcription as cell development progresses.
As we know, 10X Genomics scRNA-seq do not include SPIKE-IN CONTROLS.
Hi Aaron, if I may add please :
shall we test the approach and normalize to "control" housekeeping genes, what is the minimal number of genes that we could use ? in the database http://www.housekeeping.unicamp.br/?download, they do report ~ 1130 common HOUSEKEEPING genes between human and mouse. thanks a lot !
I think that perhaps they wouldn't be suitable in cancer with chromothripsis and all that and, below, Aaron explains why TMM gives more biologically interpretable genes, even though the fold changes will be offset by 2 from their true value.
Thank you Dario.
talking about housekeeping genes, just came across an article : https://www.biorxiv.org/content/10.1101/787150v1 :
"HT Atlas v1.0 database: redefining human and mouse housekeeping genes and candidate reference transcripts by mining massive RNA-seq datasets"
Are you sure you want to use that database to do normalisation?