Question

unexpected p-value histogram in edgeR

0

Entering edit mode

approximately.random • 0

@approximatelyrandom-16287

Last seen 7.6 years ago

Greetings,

I have an RNA Seq experiment looking at the impact of a stressor on several fish species. The analysis in edgeR went smoothly in all but one species where I could only identify a few deferentially expressed genes. This fish has been shown to show muted transcriptional responses in the past and it's not a complete surprise to see a small response in this experiment, but looking at the p-value histogram showed a rising hill (shown in the link below) rather than the flat distribution that I would traditionally expect :

https://imgur.com/a/zcRjwIO

Is this the sign of a missing covariate? Outside of gender (which attempted to block on without impact) I'm hard pressed to think of any. The PCA for this particular species showed no clustering at all and nothing that would suggest another impact. Any advice on how to troubleshoot the analysis would be appreciated.

thanks in advance,

Kevin

edgeR p-value distribution • 2.2k views

ADD COMMENT • link updated 7.6 years ago by Ryan C. Thompson ★ 7.9k • written 7.6 years ago by approximately.random • 0

Ryan C. Thompson · Answer 1 · 2018-06-27

0

Entering edit mode

Gordon Smyth 53k

@gordon-smyth

Last seen 1 hour ago

WEHI, Melbourne, Australia

Yes, the rising hill p-value histogram could arise from a systematic batch effect or a missing covariate. If that is the cause, then it would typically also be associated with large dispersion estimates and a large prior df, so the BCV plot may also look unusual with a very strong trend.

You could perhaps search for hidden covariates using RUVseq or sva. However it appears that there really is no DE for your data, so persuing analysis refinements may not have much purpose.

You don't give any code details of how you have analysed your data in edgeR, but I am assuming that you have used estimateDisp() followed by either likelihood ratio tests with glmLRT or quasi-lilkelihood F-tests with glmQLFTest().

ADD COMMENT • link 7.6 years ago Gordon Smyth 53k

0

Entering edit mode

Thanks for the suggestion, I'm going to try those and see if they help. One follow up question about the dispersion- I actually see marginally smaller dispersion for the samples that are giving me the strange p-value distribution albeit they do also show an uptick in BCV at higher logCPM. They're shown here in comparison to two species that don't give me trouble when I analyze them:

https://imgur.com/a/lWsiEIg

Is this still consistent with a missing covariat or are there any other confounding influences I should keep my eye out for?

thank you again!

cheers,

Kevin

ADD REPLY • link updated 7.6 years ago by Ryan C. Thompson ★ 7.9k • written 7.6 years ago by approximately.random • 0

0

Entering edit mode

I have seen the type of p-value distribution that you are seeing in similar situations. Specifically, when a particular contrast has both little to no real differential expression and low biological variability relative to other contrasts in the same data set. The result is that the null distribution of fold changes for that specific contrast is narrower than the BCV for the whole data set would predict, yielding a p-value histogram biased towards 1. Hence, this could be caused by heteroskedasticity between species rather than (or in addition to) an unobserved covariate.

You might try using limma and voomWithQualityWeights, which are better equipped to handle sample and group heteroskedasticity than edgeR. If this flattens the p-value histogram, you can be fairly confident that heteroskedasticity was the cause.

ADD REPLY • link 7.6 years ago Ryan C. Thompson ★ 7.9k

0

Entering edit mode

In edgeR, all the contrasts share the same BCV so, if heteroskedasticity was a substantial consideration, it would affect all contrasts in more or less the same way.

ADD REPLY • link 7.6 years ago Gordon Smyth 53k

0

Entering edit mode

Sorry, I think I explained poorly. I didn't mean that different contrasts could have different BCV values. I meant that if one contrast involves only samples in the data set with lower-than-average variability, then the null distribution of logFC values will be smaller than expected for the whole-dataset BCV, yielding excessive large p-values for that contrast. This seems to be the case in the second plot, since the BCV estimated from only the "strange fish" samples is smaller than the other fish.

ADD REPLY • link 7.6 years ago Ryan C. Thompson ★ 7.9k

0

Entering edit mode

Well, if the dispersion is not large, then it doesn't appear that a batch effect is the problem. It appears more likely that there just is no DE.

Yes, an uptick in the BCV trend can occur from a missing covariate, but the effect here doesn't appear very great.

I notice that you have the same prior.df for all three datasets. It would appear than you are using the original edgeR-glm pipeline from 2012 instead of the newer pipelines with estimateDisp().

ADD REPLY • link 7.6 years ago Gordon Smyth 53k