Search
Question: when to apply quantile normalization with voom in limma/voom framework with RNA-seq data?
2
gravatar for lassancejm
21 months ago by
lassancejm20
lassancejm20 wrote:

Hi all, 

I'm using voom and limma to analyse RNA-seq data. In general, I have followed the basic instructions regarding RNA-seq analysis presented in the limma manual. 

The matrix of read counts is filtered so that I keep genes where cpm>=1 in at least 3 samples. This is done using the following code:

keep=rowSums(cpm(matrix)>=1)>=3
matrix=rnaseqMatrix_parents[keep,]

Next, I apply the TMM normalization and use the results as input for voom

DGE=DGEList(matrix)
DGE=calcNormFactors(DGE,method =c("TMM"))
v=voom(DGE,design,plot=T)

However, with one of my datasets, the plot produced by voom looks generally ok, but with a gap midway

 

voom_plot_with_no_normalization_in_voom

Looking back at the manual for limma, I noticed the following note: 

If the data are very noisy, one can apply the same between-array normalization methods as would be used for microarrays, for example: v <- voom(counts,design,plot=TRUE,normalize="quantile")

Although I am not certain that my dataset qualifies as "very noisy" (happy to read what people have to say about it or point me to examples) I decided to give it a whirl. 

v2=voom(matrix,design,plot=T, normalize="quantile")

The plot looks better than before:

Voom plot

I should finish by saying that I went on and perform the differential expression analysis using the two outputs from voom. Things change marginally (1570 vs 1758 DEG out of 15729 genes at 5% FDR when using the "TMM" vs "quantile" strategy). Not surprisingly, the genes differentially expressed in one configuration only are associated with Q-values close to the 0.05 cut-off.

I am a bit uncertain about using the quantile normalization route (did not find a reference where people had used it), so I am eager to have inputs/guidelines about when to decide using TMM vs quantile norm.

Input on this most welcome!

Thanks!

Jean-Marc

ADD COMMENTlink modified 21 months ago by Gordon Smyth32k • written 21 months ago by lassancejm20
4
gravatar for Gordon Smyth
21 months ago by
Gordon Smyth32k
Walter and Eliza Hall Institute of Medical Research, Melbourne, Australia
Gordon Smyth32k wrote:

You ask a good question. Personally I reserve quantile normalization for really extreme situations. On the other hand, one of my close colleagues, Wei Shi, likes to use quantile normalization almost routinely, see for example:

http://www.ncbi.nlm.nih.gov/pubmed/25894659

As you have seen, it doesn't make a spectacular difference in most cases; scale normalization is simpler but quantile normalization tends to give a few more DE genes.

To me, your voom plot with TMM looks fine. The plots shows that you have some clusters of transcripts with special characteristics, evidenced by the gap and by the blob of points in the middle left of the plot. These transcripts are probably outliers in a subset of samples and this is something to explore as part of the downstream analysis, but to me is not something for the normalization to solve. It doesn't cause any serious problems to voom.

ADD COMMENTlink modified 21 months ago • written 21 months ago by Gordon Smyth32k

Thank you for your answer(s) Gordon, and pointing to reference making use to the quantile normalization.

I will be more careful about potential outliers (we expect some heterogeneity between the samples here) as well as interactions between the variables tested here, good point.

From what I see, I gain some DE genes,but I also lose some, which worries me a bit as I find arbitrary to favor one normalization route over the other at present.

ADD REPLYlink modified 21 months ago by Gordon Smyth32k • written 21 months ago by lassancejm20

Rather than worry about the normalization too much, better to explore the data. E.g. try a BCV plot to look for dispersion outliers, or try robust=TRUE with eBayes() to downweight dispersion outliers. I like to do this:

dge <- estimateDisp(DGE, design, robust=TRUE)
plotBCV(dge)

then highlight genes for which dge$df.prior is small.

ADD REPLYlink modified 21 months ago • written 21 months ago by Gordon Smyth32k

Well, something is possibly going on here. As I wrote previously, I am expecting some heterogeneity in the dataset. More about the samples: they correspond to a particular brain region sampled in males and females of two different genetic backgrounds. So, there is probably some biological noise coming from sex, strain and the interaction as well as due to the sampling process. I read previously that high level of BCV were typically expected in brain samples.  

The BCV plot linked here was produced using the code mentioned by Gordon (see his answer above). 

BCV plot

So, to counter-balance the effect of outliers and based on the previous replies, the code should then be:

  DGE=DGEList(matrix)
  DGE=calcNormFactors(DGE,method =c("TMM"))
  v=voom(DGE,design) 
  fit=lmFit(v,design)
  fit=eBayes(fit,robust=TRUE )

Does that seem reasonable or do you recommend additional steps? Thanks again for your input!

ADD REPLYlink written 21 months ago by lassancejm20
2

Well, the above code may be enough. But you will learn more from the BCV plot if you look to see which are the most variable genes (with small df.prior). Are they sex-linked genes on the Y and X chromosomes? Are they ribosomal genes? Are they mitochondrial? Are they clearly associated with a particular cell type? In my experience, the dispersion outliers almost always tell a story that tells you something about your experiment.

ADD REPLYlink written 21 months ago by Gordon Smyth32k

Very good points, thanks for your advise.

ADD REPLYlink written 21 months ago by lassancejm20

Just add to Gordon's reply - the quantile normalization was used along with limma/voom in the SEQC/MAQC III study and this approach has been demonstrated to perform very well for the SEQC RNA-seq data:

http://www.ncbi.nlm.nih.gov/pubmed/25150838

ADD REPLYlink modified 21 months ago by Gordon Smyth32k • written 21 months ago by Wei Shi2.7k
3
gravatar for Steve Lianoglou
21 months ago by
Genentech
Steve Lianoglou12k wrote:

I believe the original voom paper (Law et al.) actually uses quantile normalization. although I've never used it, it would be interesting to hear if the gurus have any pointers here.

Also, TMM and quantile normalization are separate and you should always ensure that you are using TMM instead of "simple" normalization by read depth. An easy way to ensure this works as expect is to create an edgeR::DGEList with your data and call calcNormFactors on it. Use that data object to pass through voom, and not just a count matrix 

Update

In the event that someone skimming this thread doesn't read through the comments, I wanted to provide an update to this answer in light of Gordon's replies to this answer:

Take note that (contrary to what I said in my answer) Gordon suggests that you actually wouldn't want to do both TMM followed by quantile normalization, but rather choose one or the other.

ADD COMMENTlink modified 21 months ago • written 21 months ago by Steve Lianoglou12k

In the voom paper, we used TMM for the HapMap and and D. melanogaster datasets and quantile for the SEQC and DBA/2J datasets. Wei Shi prepared the SEQC data and he likes to use quantile. I don't recall why Charity used quantile for the DBA/2J data.

ADD REPLYlink modified 21 months ago • written 21 months ago by Gordon Smyth32k

Thanks for chiming in, but I'm a bit confused that you are referencing TMM and quantile normalization as mutually exclusive entities.

Do you not consider these two separate things, as I mentioned in my answer? So, when you say that you used TMM for HapMap and fly, but quantile for SEQC, does this mean that when voom (essentially) calculates a cpm(counts, prior.count=0.5) of the count matrix, it's using straight-up colSums of the count matrix instead of an adjusted lib.size using TMM normalization factors in the SEQC (quantile normalized) case?

I'm pretty sure that's not what you mean, which but that's how I'm reading what you're saying ... ? Even if I were going to quantile normalize my data, I'd still want to adjust the library size by the norm factors either way ...

ADD REPLYlink modified 21 months ago by Gordon Smyth32k • written 21 months ago by Steve Lianoglou12k
1

Hi Steve,

No I wouldn't recommend TMM before quantile normalization, but it won't make much difference. If you apply scale normalization (such as TMM) then follow it up with quantile normalization of the logCPM values, then the scale normalization will be overwritten. Scale normalization becomes essentially an additive constant on the log-scale for each library, so it will be removed by quantile normalization.

We have never recommended using multiple normalization methods together as it confuses the issue somewhat.

ADD REPLYlink written 21 months ago by Gordon Smyth32k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 130 users visited in the last hour