Question

Normalization of RNA-seq data: between samples and within samples

0

Entering edit mode

nickbern92 • 0

@nickbern92-8851

Last seen 9.6 years ago

United States

I've been looking into normalization more and more, and I was wondering about a few things that perhaps some of you might know the answer to or want to discuss

So their exists within samples normalization (TPM or others), i.e. relative abundances and between samples normalization (TMM or others), but is it necessary to do both ever, i.e. is it ever necessary to normalize relative abundances across a cohort?

I don't think it would be, but another scenario which seems to be quite common is filtering out isoforms that have no expression for 90% (or some other threshold) of the samples if working with a large cohort. But if you do this while working with TPM then the sum of TPM for every isoform for each subject will no longer be equal. Would it make sense to then use TMM after such a filtration process? I think it would.

Do you think such filter out of isoforms is flawed in some manner?

My guess is it used because people are worried about the sensitivity of RNA-seq and biologically most think that for specific tissue type a good percentage of genes are not expressed. So I think it makes some sense

It seems like all between samples normalizations require raw counts as input, and leave it there. I read harold pimentel's blog post about it (https://haroldpimentel.wordpress.com/2014/12/08/in-rna-seq-2-2-between-sample-normalization/, very informative) but I haven't seen a follow up about this problem if it is a problem.

I'm new to this stuff, so I was wondering what others thoughts are on the issue.

rnaseq normalization tpm tmm • 12k views

ADD COMMENT • link 9.5 years ago nickbern92 • 0

score 2 · Answer 1 · 2015-10-08

2

Entering edit mode

Ryan C. Thompson ★ 7.9k

@ryan-c-thompson-5618

Last seen 6 months ago

Icahn School of Medicine at Mount Sinai…

Any within-sample scalar normalization is going to be completely overriden by a scalar between-samples normalization like edgeR's TMM or DESeq's method. Also, I don't know if TMM and DESeq normalizations actually assume raw counts (except I believe that optional weighting scheme in TMM assumes raw counts). The goal of both of them is roughly to ensure that after normalization, the average gene's log fold change is zero between all samples, for varying robust definitions of "average gene". Scalar within-sample normalizations will have no effect on this, and independent filtering of genes should have minimal effect.

Once you move away from simple scalar normalizations, however, things get a lot more complicated.

ADD COMMENT • link 9.5 years ago Ryan C. Thompson ★ 7.9k

0

Entering edit mode

Another rookie here, sorry if I'm misunderstanding something obvious. I am a little confused by what constitutes a 'sample'. I'm doing single-cell RNA-seq with 100 cells, with cDNA for each made separately, then pooled for library enrichment and then sequenced together in a single lane. I get that between-sample normalization is necessary when comparing data across different lanes or sequencing runs. But what about the case where all samples (multiplexed, of course) were run in a single lane? My single cells have varying levels of total read count (0.8 to 1.2 million reads), so should I treat each single cell as a sample and do normalization (e.g.,TPM) within that sample? Would between-sample normalization be necessary if I wanted to compare expression of gene A across the 100 cells I sequenced together?

ADD REPLY • link 9.4 years ago brs120c • 0

0

Entering edit mode

All my advice above is for "ordinary" RNA-seq. I have no experience with single-cell sequencing, but I know that it requires a quite different statistical approach, so you probably shouldn't try to apply anything from this question to it. If you want to know about single-cell sequencing, ask a separate question about that.

In any case, though, I can tell you that normalizing for the technical batch effects inherent in the sequencing technology is only one of the reasons that between-sample normalization is required, and you should read the TMM paper for for a more detailed explanation on this topic (even though the TMM method itself may not be suitable for single-cell data): http://www.genomebiology.com/2010/11/3/R25

ADD REPLY • link 9.4 years ago Ryan C. Thompson ★ 7.9k

score 0 · Answer 2 · 2015-10-08

0

Entering edit mode

nickbern92 • 0

@nickbern92-8851

Last seen 9.6 years ago

United States

Interesting. I guess I'm still wondering if it's necessary to do between samples normalization after within samples normalization regardless in any situation.

So TMM and DESeq both expect raw counts, but their assumptions for their formula's to a rookie like me don't seem violated by TPM.

ADD COMMENT • link 9.5 years ago nickbern92 • 0

1

Entering edit mode

If you want to compare expression between samples, you need to do between-sample normalization.

TMM and DESeq normalization are both quite simple scaling normalizations, and are probably applicable in almost any data where you expect that any global change across all genes is a normalization issue and not a true change. If your data are not on a raw count scale, though, you would probably want to use doWeighting=FALSE for TMM, since I believe the weighting scheme assumes a raw count scale.

ADD REPLY • link 9.5 years ago Ryan C. Thompson ★ 7.9k