Question

EdgeR differentially expressed genes vs normal boxplot visualization

0

Entering edit mode

snowru • 0

@snowru-12343

Last seen 7.2 years ago

I got a differentially expressed gene, with log(mean CPM) = 2.2447; logFC = 11.2344; p-adjusted = 0.0016;

This looks neat. But the problem araises when I take the tpm (transcript per million) values of these samples in 2 groups and draw boxplots.

Attached is a boxplot.

It turns out that the medians of both groups are ZERO, and visually, these two groups should not be called different at all!

Here are the two arrays that I used for boxplot:

[0,0,0.0363,0,0,0,0,15.1621,0,0,0,0.091,13.1992,0,0.064,0,0,27.9052,15.4516,0,0,0,22.6814,0,0.0124,5.3274]

[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]

Here is the boxplot picture https://drive.google.com/open?id=0B0AM3r3EIYRUVl8zNFphWWJCbEk (somehow can't attached to this form)

Has anyone already encountered this problem? And I would like to know how to justify this case (statistical package edgeR calls it differential expressed, but it's clearly not -- visually).

Thanks.

SnowRu

edgeR boxplot • 2.0k views

ADD COMMENT • link 7.2 years ago snowru • 0

score 1 · Answer 1 · 2017-02-12

The statistical tests in edgeR don't care about the median. If you have two groups, then edgeR will test the null hypothesis that the mean count (normalized by effective library size) is equal between groups. In your case, the means are clearly different as one of the groups has all-zero expression and the other group has many samples with non-zero expression. The data provides evidence against the null hypothesis; ergo, you get a low p-value.

score 0 · Answer 2 · 2017-02-12

0

Entering edit mode

snowru • 0

@snowru-12343

Last seen 7.2 years ago

agree, but if you present this boxplot to your audience, you will have very hard time persuading them that this gene is differentially expresses.

ADD COMMENT • link 7.2 years ago snowru • 0

0

Entering edit mode

Well, I'm not sure what you want edgeR to say. Clearly, this gene is differentially expressed between your groups. Perhaps not in every sample, but the mean expression is definitely different, so what more do you want? Similar scenarios arise in analyses of single-cell RNA-seq data where a gene may not be expressed in every cell of a population, but the average population-level expression is still different between two groups. I've never found this hard to explain.

P.S. Reply to answers using the "add comment" or "add reply" buttons, not the "add answer" button.

ADD REPLY • link 7.2 years ago Aaron Lun ★ 28k

0

Entering edit mode

This isn't a problem specific to RNA-seq. Any measurement at or near the detection limit of any assay is going to have a boxplot that looks like this.

ADD REPLY • link 7.2 years ago Ryan C. Thompson ★ 7.9k