Question

normalization methods for scRNA-seq and RNA-seq in specific cases of increased global transcription

1

Entering edit mode

Bogdan ▴ 670

@bogdan-2367

Last seen 16 months ago

Palo Alto, CA, USA

Dear all,

after re-visiting some articles showing that C-MYC induces global changes in gene expression,

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3505597/pdf/nihms416894.pdf

https://www.sciencedaily.com/releases/2012/10/121025121841.htm

where they have used SPIKE-IN controls for NORMALIZATION, thought that I shall ask you for an advice please :

-- shall we have RNA-seq collected from developing systems (where we do expect a global increase in transcription between time0 and time1), would TMM and DEseq2 normalization methods be advised ?

-- the same question for scRNA-seq (shall we use a pseudo-bulk approach for differential expression that includes edgeR)..

many thanks,

bogdan

RNA-seq scRNA-seq • 2.4k views

ADD COMMENT • link updated 5.1 years ago by Aaron Lun ★ 28k • written 5.1 years ago by Bogdan ▴ 670

0

Entering edit mode

There is also a recent journal article analysing 2500 whole genomes of cancer metastases which found more than half of them have whole genome duplication. It's remarkable that people are still doing median normalisation for cancer samples in 2020.

DESeq2 has estimateSizeFactors which has a controlGenes argument but edgeR's calcNormFactors doesn't even allow control genes to be specified.

I wonder if there a set of housekeeping genes that are stably expressed in cancer. To date, all definitions of housekeeping genes used healthy samples to define a gene set. I think whole genome duplication would make many of those invalid.

ADD REPLY • link 5.1 years ago Dario Strbenac ★ 1.6k

2

Entering edit mode

If you needed to use control genes for some reason (and I'm not convinced that's really a good idea), then you can just run calcNormFactors() and that subset of genes and transfer the normalization factors back to your original DGEList:

# do not use keep.lib.sizes=FALSE when subsetting here:
y.con <- calcNormFactors(y[control.genes,]) 
y$samples$norm.factors <- y.con$samples$norm.factors

ADD REPLY • link 5.1 years ago Aaron Lun ★ 28k

0

Entering edit mode

awesome, thanks again Aaron !

It is a bit of debate in our department about what normalization to use, especially if we do scRNA-seq by 10X Genomics (beside RNA-seq) ; we'd expect a global increase in transcription as cell development progresses.

As we know, 10X Genomics scRNA-seq do not include SPIKE-IN CONTROLS.

ADD REPLY • link 5.1 years ago Bogdan ▴ 670

0

Entering edit mode

Hi Aaron, if I may add please :

shall we test the approach and normalize to "control" housekeeping genes, what is the minimal number of genes that we could use ? in the database http://www.housekeeping.unicamp.br/?download, they do report ~ 1130 common HOUSEKEEPING genes between human and mouse. thanks a lot !

ADD REPLY • link 5.1 years ago Bogdan ▴ 670

0

Entering edit mode

I think that perhaps they wouldn't be suitable in cancer with chromothripsis and all that and, below, Aaron explains why TMM gives more biologically interpretable genes, even though the fold changes will be offset by 2 from their true value.

ADD REPLY • link 5.0 years ago Dario Strbenac ★ 1.6k

0

Entering edit mode

Thank you Dario.

talking about housekeeping genes, just came across an article : https://www.biorxiv.org/content/10.1101/787150v1 :

"HT Atlas v1.0 database: redefining human and mouse housekeeping genes and candidate reference transcripts by mining massive RNA-seq datasets"

ADD REPLY • link 5.1 years ago Bogdan ▴ 670

0

Entering edit mode

... 82 human non-disease tissues/cells and 15 healthy tissues/cells of C57BL/6 wild type mouse ...

Are you sure you want to use that database to do normalisation?

ADD REPLY • link 5.1 years ago Dario Strbenac ★ 1.6k

score 0 · Answer 1 · 2020-01-30

Spike-in controls are not a panacea that we sometimes imagine them to be.

The first problem is, how much spike-in RNA do we add to each sample? Well, we can't literally add the same amount of spike-in RNA to each tube because that won't adjust for differences in the amount of endogenous RNA we collected form each sample. It generally doesn't make sense to scale the amount of RNA added by the amount of endogenous expression, because any wide-spread/global changes in transcription between conditions would also be expected to increase the latter; scaling the amount of control RNA to match would just eliminate such changes upon normalization. Probably the most sensible strategy is to add the same amount of RNA per cell, but this assumes that we have accurate cell counts for each sample, and that's a bit of a stretch in some applications.

(It is for this reason that spike-in normalization is one of the few things that is technically easier for single-cell RNA-seq compared to bulk RNA-seq. Also note that ChIP-seq applications of spike-in methods generally have an easier time of it, because the total amount of DNA doesn't exhibit biological changes between conditions in an experiment involving diploid cells, so an operator can just match the amount of spiked-in chromatin to the concentration of input DNA.)

The next problem is how to apply the spike-ins for normalization. The most obvious approach is to scale the expression values so that the spike-in coverage are equal across all samples; there are some more sophisticated strategies but the idea of "making spike-ins equal" is generally the same. However... then what?

Let's consider a situation with whole-genome duplications, and with some hand-waving we will assume that, on average, genome duplication doubles the transcriptional output of most genes. Done correctly, spike-in normalization will yield expression values that are twice as large in the cells with duplicated genomes compared to normal cells. Great, and now every gene in the genome is significant with a log-fold change of ~1... not very helpful. Rather, I would be more interested in what changes occur on top of the duplication, e.g., synergistic effects that cause particular genes to be upregulated more than expected by a doubling of gene copy. In such cases, we want to normalize out the duplication effect before we perform our tests - say, by assuming that most genes have log-fold changes driven by duplication - and this is where TMM and friends come into play.

There are specific cases where spike-in normalization can be helpful for interpretation, see the book for more details. But I would say that these cases require some thought about what biological effects you want to study and how you plan to interpret your normalized expression values, it's not a slam dunk all the time.