Search
Question: What problem does the MA plot diagnose? And how do you solve it?
2
gravatar for ysdel
8 months ago by
ysdel30
United States
ysdel30 wrote:

I have an RNASeq experiment, and I am using DESeq2. After I get the results, I plot the MA plot. This is the output of plotMA:

And this is my attempt at the MA plot:

    res$significant = (res$padj < .05)
    res$significant = as.factor(res$significant)
    res$significant[is.na(res$significant)] = F
    ggplot(as.data.table(res), aes(x=log2(baseMean), y=log2FoldChange, color=significant)) +
        geom_point() +
        geom_hline(color = "blue3", yintercept = 0) +
        stat_smooth(se = FALSE, method = "loess", color = "red3") +
        scale_color_manual(values=c("Black","Red"))

 

  1. There is a slight bias at the end, so genes with a high A, tend to have a high M, and we are detecting more up-regulation than down. Is this a problem? What might be causing this, and more importantly, is there something we can do to fix it?
  2. Even if the slight effect is too little to be a problem, what causes problems like this? Imbalanced sampling depth at the two conditions? Why doesn't normalization (sample size factors) fix this?

Also, is there a reason why DESeq2::plotMA doesn't plot the best fit line?

Disclaimer: Cross posted to BioStars

ADD COMMENTlink modified 8 months ago by Ryan C. Thompson6.1k • written 8 months ago by ysdel30
2
gravatar for Ryan C. Thompson
8 months ago by
The Scripps Research Institute, La Jolla, CA
Ryan C. Thompson6.1k wrote:

Depending on the biological effect you're studying, it might make perfect sense for more genes to be upregulated than downregulated, and if expression level is an indicator of importance to the tissue, then it might also make sense for many of the regulated genes to have high expression. If both of these assumptions are at least somewhat true, then your MA plot is exactly what you'd expect to see. Without knowing more about the experiment, I wouldn't say that your MA plot looks out of the ordinary. In other words, the non-zero correlation between M and A could very well be a real biological effect, in which case you would not want to normalize it away. Also consider that if you were to "center" the MA plot by subtracting from each gene's M value the M value of the loess curve at that gene's A value, you would still have many more up-regulated than down-regulated genes, so the imbalance cannot be explained purely by imperfect normalization.

As for why size factor normalization doesn't remove this effect, this is because normalizing all genes by the same size factors is equivalent to shifting the entire MA plot up or down by a constant amount. Such a normalization cannot change the shape or remove such a curve from the plot. If you are really convinced that this is a technical effect that you want to eliminate, you could certainly do so using a more heavy-handed method, such as quantile normalization. And DESeq2 doesn't plot the "best fit" line because just like there is no one normalization that works for every case, there is no method for generating a "best fit line" that is the best fit for every situation. I believe DESeq2 doesn't actually fit any kind of line to the data; it just plots the line y=0, probably as a visual reminder that the log fold changes were squeezed toward this value.

ADD COMMENTlink modified 8 months ago • written 8 months ago by Ryan C. Thompson6.1k

Thank you very much for explaining this! You're right, there are more up-regulated genes even after considering the Loess line, and there is plausible biological reason for this. I'm still curious though - the MA plot is supposed to detect some artifacts. When do those artifacts arise?

ADD REPLYlink written 8 months ago by ysdel30
1

"the MA plot is supposed to detect some artifacts"

I agree with Ryan. It's maybe counterproductive to consider the MA plot *only* as a tool for diagnosing problems.

And I wouldn't go down the path of trying to force the center of the LFCs to the y=0 line. That's definitely too heavy handed in my opinion for RNA-seq data.

It's simply the log fold changes due to condition over the mean expression.

And the y=0 line is drawn in simply to show what no change due to condition looks like.

ADD REPLYlink written 8 months ago by Michael Love14k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 394 users visited in the last hour