Limma: log2 of TMM values for RNA-Seq Analysis and PlotMDS()?
Entering edit mode
Sara • 0
Last seen 6 weeks ago
United States


I am a Bioinformatics Research Assistant tasked with re-constructing our local sequencing centers analysis pipeline of RNA-Seq Data.

I am following several tutorials for an RNA-Seq pipeline using EdgeR and Limma. I have 6 samples - 3 mock and 3 infected - and I have already had the sequencing center perform analysis and they saw that one of our infected samples seemed to be an outlier.

Now, I am re-running the analysis.

I performed normalization:

d <- calcNormFactors(d, method = "TMM")

# use cpm() function to get normalized counts - applies nomalization factors so it is TMM when in combination with above function

And now was plotting MDS: (The group variable is where mock/infected is specified)


ltmm <- cpm(d, log = T)
tmm <- cpm(d)

plotMDS(ltmm, col = as.numeric(d$samples$group))
plotMDS(tmm, col = as.numeric(d$samples$group))

Now, plotting the log2 transformed TMM counts shows that the infected sample is plotted distant from the others and is also identical when I just plot "d" which is my DGElist object from above. When i plot the non-log transformed, there is no large gap between samples....... The first plot appears like that of the sequencing centers but I am unsure of which normalized counts to use for downstream DGE and why or why not I should apply log2 transformation?

Any advice or references on the correct normalization methods would be appreciated.

Thanks, Sara

Log2 edgeR RNA limma Normalization • 1.7k views
Entering edit mode

This image shows the MDS plot for using log2 transformed TMM counts.

This image shows the MDS plot for using just the TMM counts without log2 transform.

The first image shows the MDS plot for using log2 transformed TMM counts and the second shows the TMM counts plotted without log2 transform.

Entering edit mode
Last seen 9 hours ago
United States

For most Bioconductor packages, and for sure in the case of limma, you should use the object itself as input to the various functions, to get the reasonable defaults. In other words, you can extract the data from your DGEList object (you called 'd'), compute CPM and then use that as input to plotMDS. But then you have to decide if you should take logs or not.

As you already noticed, doing plotMDS(d) results in the same plot as (essentially) doing plotMDS(cpm(d, log = TRUE)), but without all that extra typing. Which tells you that the default method is to do cpm(d, log = TRUE), and which, being the default, is what you should be doing.

Entering edit mode

Thank you very much for clearing that up!

Entering edit mode

If you are interested in a little inside baseball, limma primarily uses what are called S3 methods, and if you type the function name (without parentheses) you get:

> plotMDS
function (x, ...) 
<bytecode: 0x000001c51d2a0e50>
<environment: namespace:limma>

And that UseMethod business indicates that it's an S3 method. So you can see the various methods by using (of all things) the methods function.

> methods(plotMDS)
[1] plotMDS.default              plotMDS.DGEList             
[3] plotMDS.MDS                  plotMDS.SummarizedExperiment
see '?methods' for accessing help and source code

Which for an end user isn't necessarily that useful, except there are often help pages for each of the methods (and Gordon Smyth is very good at providing extensive documentation, so you should expect the method-specific help pages to exist for any of his packages). You can access method-specific help page using e.g., ?plotMDS.DGEList, which will give you this:

plotMDS.DGEList             package:edgeR              R Documentation

Multidimensional scaling plot of distances between digital gene
expression profiles


     Plot samples on a two-dimensional scatterplot so that distances on
     the plot approximate the expression differences between the


     ## S3 method for class 'DGEList'
     plotMDS(x, top = 500, labels = NULL, pch = NULL, cex = 1,
             dim.plot = c(1,2), gene.selection = "pairwise", xlab = NULL, 
             ylab = NULL, method = "logFC", prior.count = 2, plot = TRUE, var.explained = TRUE, ...)


       x: a 'DGEList' or 'SummarizedExperiment' object.

     top: number of top genes used to calculate pairwise distances.

  labels: character vector of sample names or labels. If 'x' has no
          column names, then defaults the index of the samples.

     pch: plotting symbol or symbols. See 'points' for possible values.
          Ignored if 'labels' is non-'NULL'.

     cex: numeric vector of plot symbol expansions. See 'text' for
          possible values.

dim.plot: which two dimensions should be plotted, numeric vector of
          length two.

gene.selection: character, '"pairwise"' to choose the top genes
          separately for each pairwise comparison between the samples,
          or '"common"' to select the same genes for all comparisons.
          Only used when 'method="logFC"'.

    xlab: x-axis label

    ylab: y-axis label

  method: method used to compute distances. Possible values are
          '"logFC"' or '"bcv"'. Note the '"bcv"' method is scheduled to
          be removed in a future version of edgeR.

prior.count: average prior count to be added to observation to shrink
          the estimated log-fold-changes towards zero. Only used when

    plot: logical. If 'TRUE' then a plot is created on the current
          graphics device.

var.explained: logical. If 'TRUE' then the percentage variation
          explained is included in the axis labels.

     ...: any other arguments are passed to 'plot'.


     The default method ('method="logFC"') is to convert the counts to
     log-counts-per-million using 'cpm' and to pass these to the limma
     'plotMDS' function. This method calculates distances between
     samples based on log2 fold changes. See the 'plotMDS help page'
     for details.
Entering edit mode
Last seen 14 minutes ago
WEHI, Melbourne, Australia

I've upvoted James' answer but will add a few more comments.

The unlogged cpm values have a strong mean-variance relationship and are therefore unsuitable for input to PCA or MDS routines. The logged cpm values produced by cpm with log=TRUE on the other hand have relatively constant variances so that the distance calculations by performed by plotMDS make sense.

Your first (correct) MDS plot shows an outlier and also shows the mock samples separated from the infected samples on the second dimension. So you learn something about the samples which might guide you to improve the analysis or data quality. The second (incorrect) MDS plot on the other hand seems to have destroyed all the biological signal in the data so that all the samples are mixed up.

You say that you're following edgeR tutorials, but there are no edgeR tutorials that advise you to make MDS plots of unlogged cpms.

Entering edit mode

Thank you for the explanation. My confusion was coming from me not understanding that if you don't take prior log of normalized counts prior to plotMDS that the plotMDS function will do it for you of the DGElist object. Some tutorials show prior log normalized counts as input and some do not so yes, you're right; the tutorials didn't advise to made MDS plots of unlogged CPMs.


Login before adding your answer.

Traffic: 555 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6