normalization methods for scRNA-seq and RNA-seq in specific cases of increased global transcription
1
1
Entering edit mode
Bogdan ▴ 670
@bogdan-2367
Last seen 14 months ago
Palo Alto, CA, USA

Dear all,

after re-visiting some articles showing that C-MYC induces global changes in gene expression,

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3505597/pdf/nihms416894.pdf

https://www.sciencedaily.com/releases/2012/10/121025121841.htm

where they have used SPIKE-IN controls for NORMALIZATION, thought that I shall ask you for an advice please :

-- shall we have RNA-seq collected from developing systems (where we do expect a global increase in transcription between time0 and time1), would TMM and DEseq2 normalization methods be advised ?

-- the same question for scRNA-seq (shall we use a pseudo-bulk approach for differential expression that includes edgeR)..

many thanks,

bogdan

RNA-seq scRNA-seq • 2.3k views
ADD COMMENT
0
Entering edit mode

There is also a recent journal article analysing 2500 whole genomes of cancer metastases which found more than half of them have whole genome duplication. It's remarkable that people are still doing median normalisation for cancer samples in 2020.

DESeq2 has estimateSizeFactors which has a controlGenes argument but edgeR's calcNormFactors doesn't even allow control genes to be specified.

I wonder if there a set of housekeeping genes that are stably expressed in cancer. To date, all definitions of housekeeping genes used healthy samples to define a gene set. I think whole genome duplication would make many of those invalid.

ADD REPLY
2
Entering edit mode

If you needed to use control genes for some reason (and I'm not convinced that's really a good idea), then you can just run calcNormFactors() and that subset of genes and transfer the normalization factors back to your original DGEList:

# do not use keep.lib.sizes=FALSE when subsetting here:
y.con <- calcNormFactors(y[control.genes,]) 
y$samples$norm.factors <- y.con$samples$norm.factors
ADD REPLY
0
Entering edit mode

awesome, thanks again Aaron !

It is a bit of debate in our department about what normalization to use, especially if we do scRNA-seq by 10X Genomics (beside RNA-seq) ; we'd expect a global increase in transcription as cell development progresses.

As we know, 10X Genomics scRNA-seq do not include SPIKE-IN CONTROLS.

ADD REPLY
0
Entering edit mode

Hi Aaron, if I may add please :

shall we test the approach and normalize to "control" housekeeping genes, what is the minimal number of genes that we could use ? in the database http://www.housekeeping.unicamp.br/?download, they do report ~ 1130 common HOUSEKEEPING genes between human and mouse. thanks a lot !

ADD REPLY
0
Entering edit mode

I think that perhaps they wouldn't be suitable in cancer with chromothripsis and all that and, below, Aaron explains why TMM gives more biologically interpretable genes, even though the fold changes will be offset by 2 from their true value.

ADD REPLY
0
Entering edit mode

Thank you Dario.

talking about housekeeping genes, just came across an article : https://www.biorxiv.org/content/10.1101/787150v1 :

"HT Atlas v1.0 database: redefining human and mouse housekeeping genes and candidate reference transcripts by mining massive RNA-seq datasets"

ADD REPLY
0
Entering edit mode

... 82 human non-disease tissues/cells and 15 healthy tissues/cells of C57BL/6 wild type mouse ...

Are you sure you want to use that database to do normalisation?

ADD REPLY
0
Entering edit mode
Aaron Lun ★ 28k
@alun
Last seen 14 minutes ago
The city by the bay

Spike-in controls are not a panacea that we sometimes imagine them to be.

The first problem is, how much spike-in RNA do we add to each sample? Well, we can't literally add the same amount of spike-in RNA to each tube because that won't adjust for differences in the amount of endogenous RNA we collected form each sample. It generally doesn't make sense to scale the amount of RNA added by the amount of endogenous expression, because any wide-spread/global changes in transcription between conditions would also be expected to increase the latter; scaling the amount of control RNA to match would just eliminate such changes upon normalization. Probably the most sensible strategy is to add the same amount of RNA per cell, but this assumes that we have accurate cell counts for each sample, and that's a bit of a stretch in some applications.

(It is for this reason that spike-in normalization is one of the few things that is technically easier for single-cell RNA-seq compared to bulk RNA-seq. Also note that ChIP-seq applications of spike-in methods generally have an easier time of it, because the total amount of DNA doesn't exhibit biological changes between conditions in an experiment involving diploid cells, so an operator can just match the amount of spiked-in chromatin to the concentration of input DNA.)

The next problem is how to apply the spike-ins for normalization. The most obvious approach is to scale the expression values so that the spike-in coverage are equal across all samples; there are some more sophisticated strategies but the idea of "making spike-ins equal" is generally the same. However... then what?

Let's consider a situation with whole-genome duplications, and with some hand-waving we will assume that, on average, genome duplication doubles the transcriptional output of most genes. Done correctly, spike-in normalization will yield expression values that are twice as large in the cells with duplicated genomes compared to normal cells. Great, and now every gene in the genome is significant with a log-fold change of ~1... not very helpful. Rather, I would be more interested in what changes occur on top of the duplication, e.g., synergistic effects that cause particular genes to be upregulated more than expected by a doubling of gene copy. In such cases, we want to normalize out the duplication effect before we perform our tests - say, by assuming that most genes have log-fold changes driven by duplication - and this is where TMM and friends come into play.

There are specific cases where spike-in normalization can be helpful for interpretation, see the book for more details. But I would say that these cases require some thought about what biological effects you want to study and how you plan to interpret your normalized expression values, it's not a slam dunk all the time.

ADD COMMENT
0
Entering edit mode

Probably right about the total amount not being important, but perhaps a relative amount. I analysed qPCR data once and for it, the fold change to (a) 'housekeeping' gene(s) is calculated. I wonder if you can make a rough guess about which cells have whole genome duplication in single cell DNA-seq data by possibly finding a bimodal distribution in the total reads per cell. Would be convenient if this could be moved to Biostars forum.

ADD REPLY
0
Entering edit mode

Thank you, Dario, yes, i could post the question on Biostars too ...

ADD REPLY
0
Entering edit mode

Thanks a lot, Aaron for a very extensive answer.

I hope I can convince my colleagues to re-do the experiments using the ERCC SPIKE-IN probes.

If not, beside TMM and RLE normalization, we will include also a normalization method to a set of House-Keeping Genes (according to : https://www.biorxiv.org/content/10.1101/787150v1 :

2,158 human HK transcripts from 2,176 HK genes and 3,024 mouse HK transcripts from 3,277 mouse HK genes)

ADD REPLY

Login before adding your answer.

Traffic: 577 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6