Search
Question: EdgeR differentially expressed genes vs normal boxplot visualization
0
22 months ago by
snowru0
snowru0 wrote:

I got a differentially expressed gene, with log(mean CPM) = 2.2447; logFC = 11.2344; p-adjusted = 0.0016;

This looks neat. But the problem araises when I take the tpm (transcript per million) values of these samples in 2 groups and draw boxplots.

Attached is a boxplot.

It turns out that the medians of both groups are ZERO, and visually, these two groups should not be called different at all!

Here are the two arrays that I used for boxplot:

[0,0,0.0363,0,0,0,0,15.1621,0,0,0,0.091,13.1992,0,0.064,0,0,27.9052,15.4516,0,0,0,22.6814,0,0.0124,5.3274]

[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]

Here is the boxplot picture https://drive.google.com/open?id=0B0AM3r3EIYRUVl8zNFphWWJCbEk (somehow can't attached to this form)

Has anyone already encountered this problem? And I would like to know how to justify this case (statistical package edgeR calls it differential expressed, but it's clearly not -- visually).

Thanks.

SnowRu

modified 22 months ago • written 22 months ago by snowru0
1
22 months ago by
Aaron Lun21k
Cambridge, United Kingdom
Aaron Lun21k wrote:

The statistical tests in edgeR don't care about the median. If you have two groups, then edgeR will test the null hypothesis that the mean count (normalized by effective library size) is equal between groups. In your case, the means are clearly different as one of the groups has all-zero expression and the other group has many samples with non-zero expression. The data provides evidence against the null hypothesis; ergo, you get a low p-value.

0
22 months ago by
snowru0
snowru0 wrote:

agree, but if you present this boxplot to your audience, you will have very hard time persuading them that this gene is differentially expresses.

Well, I'm not sure what you want edgeR to say. Clearly, this gene is differentially expressed between your groups. Perhaps not in every sample, but the mean expression is definitely different, so what more do you want? Similar scenarios arise in analyses of single-cell RNA-seq data where a gene may not be expressed in every cell of a population, but the average population-level expression is still different between two groups. I've never found this hard to explain.