Hi all,
I'm using voom and limma to analyse RNA-seq data. In general, I have followed the basic instructions regarding RNA-seq analysis presented in the limma manual.
The matrix of read counts is filtered so that I keep genes where cpm>=1 in at least 3 samples. This is done using the following code:
keep=rowSums(cpm(matrix)>=1)>=3 matrix=rnaseqMatrix_parents[keep,]
Next, I apply the TMM normalization and use the results as input for voom
DGE=DGEList(matrix) DGE=calcNormFactors(DGE,method =c("TMM")) v=voom(DGE,design,plot=T)
However, with one of my datasets, the plot produced by voom looks generally ok, but with a gap midway
Looking back at the manual for limma, I noticed the following note:
If the data are very noisy, one can apply the same between-array normalization methods as would be used for microarrays, for example: v <- voom(counts,design,plot=TRUE,normalize="quantile")
Although I am not certain that my dataset qualifies as "very noisy" (happy to read what people have to say about it or point me to examples) I decided to give it a whirl.
v2=voom(matrix,design,plot=T, normalize="quantile")
The plot looks better than before:
I should finish by saying that I went on and perform the differential expression analysis using the two outputs from voom. Things change marginally (1570 vs 1758 DEG out of 15729 genes at 5% FDR when using the "TMM" vs "quantile" strategy). Not surprisingly, the genes differentially expressed in one configuration only are associated with Q-values close to the 0.05 cut-off.
I am a bit uncertain about using the quantile normalization route (did not find a reference where people had used it), so I am eager to have inputs/guidelines about when to decide using TMM vs quantile norm.
Input on this most welcome!
Thanks!
Jean-Marc
Thank you for your answer(s) Gordon, and pointing to reference making use to the quantile normalization.
I will be more careful about potential outliers (we expect some heterogeneity between the samples here) as well as interactions between the variables tested here, good point.
From what I see, I gain some DE genes,but I also lose some, which worries me a bit as I find arbitrary to favor one normalization route over the other at present.
Rather than worry about the normalization too much, better to explore the data. E.g. try a BCV plot to look for dispersion outliers, or try robust=TRUE with eBayes() to downweight dispersion outliers. I like to do this:
then highlight genes for which dge$df.prior is small.
Well, something is possibly going on here. As I wrote previously, I am expecting some heterogeneity in the dataset. More about the samples: they correspond to a particular brain region sampled in males and females of two different genetic backgrounds. So, there is probably some biological noise coming from sex, strain and the interaction as well as due to the sampling process. I read previously that high level of BCV were typically expected in brain samples.
The BCV plot linked here was produced using the code mentioned by Gordon (see his answer above).
So, to counter-balance the effect of outliers and based on the previous replies, the code should then be:
Does that seem reasonable or do you recommend additional steps? Thanks again for your input!
Well, the above code may be enough. But you will learn more from the BCV plot if you look to see which are the most variable genes (with small df.prior). Are they sex-linked genes on the Y and X chromosomes? Are they ribosomal genes? Are they mitochondrial? Are they clearly associated with a particular cell type? In my experience, the dispersion outliers almost always tell a story that tells you something about your experiment.
Very good points, thanks for your advise.
Just add to Gordon's reply - the quantile normalization was used along with limma/voom in the SEQC/MAQC III study and this approach has been demonstrated to perform very well for the SEQC RNA-seq data:
http://www.ncbi.nlm.nih.gov/pubmed/25150838