EdgeR differentially expressed genes vs normal boxplot visualization
2
0
Entering edit mode
snowru • 0
@snowru-12343
Last seen 7.2 years ago

I got a differentially expressed gene, with log(mean CPM) = 2.2447; logFC = 11.2344; p-adjusted = 0.0016;

This looks neat. But the problem araises when I take the tpm (transcript per million) values of these samples in 2 groups and draw boxplots.

Attached is a boxplot.

It turns out that the medians of both groups are ZERO, and visually, these two groups should not be called different at all!

Here are the two arrays that I used for boxplot: 

[0,0,0.0363,0,0,0,0,15.1621,0,0,0,0.091,13.1992,0,0.064,0,0,27.9052,15.4516,0,0,0,22.6814,0,0.0124,5.3274]

 [0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]

 

Here is the boxplot picture https://drive.google.com/open?id=0B0AM3r3EIYRUVl8zNFphWWJCbEk (somehow can't attached to this form)

 

Has anyone already encountered this problem? And I would like to know how to justify this case (statistical package edgeR calls it differential expressed, but it's clearly not -- visually). 

Thanks. 

 

SnowRu

 

 

edgeR boxplot • 2.0k views
ADD COMMENT
1
Entering edit mode
Aaron Lun ★ 28k
@alun
Last seen 19 hours ago
The city by the bay

The statistical tests in edgeR don't care about the median. If you have two groups, then edgeR will test the null hypothesis that the mean count (normalized by effective library size) is equal between groups. In your case, the means are clearly different as one of the groups has all-zero expression and the other group has many samples with non-zero expression. The data provides evidence against the null hypothesis; ergo, you get a low p-value.

ADD COMMENT
0
Entering edit mode
snowru • 0
@snowru-12343
Last seen 7.2 years ago

agree, but if you present this boxplot to your audience, you will have very hard time persuading them that this gene is differentially expresses. 

ADD COMMENT
0
Entering edit mode

Well, I'm not sure what you want edgeR to say. Clearly, this gene is differentially expressed between your groups. Perhaps not in every sample, but the mean expression is definitely different, so what more do you want? Similar scenarios arise in analyses of single-cell RNA-seq data where a gene may not be expressed in every cell of a population, but the average population-level expression is still different between two groups. I've never found this hard to explain.

P.S. Reply to answers using the "add comment" or "add reply" buttons, not the "add answer" button.

ADD REPLY
0
Entering edit mode

This isn't a problem specific to RNA-seq. Any measurement at or near the detection limit of any assay is going to have a boxplot that looks like this.

ADD REPLY

Login before adding your answer.

Traffic: 665 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6