I have an RNA Seq experiment looking at the impact of a stressor on several fish species. The analysis in edgeR went smoothly in all but one species where I could only identify a few deferentially expressed genes. This fish has been shown to show muted transcriptional responses in the past and it's not a complete surprise to see a small response in this experiment, but looking at the p-value histogram showed a rising hill (shown in the link below) rather than the flat distribution that I would traditionally expect :
Is this the sign of a missing covariate? Outside of gender (which attempted to block on without impact) I'm hard pressed to think of any. The PCA for this particular species showed no clustering at all and nothing that would suggest another impact. Any advice on how to troubleshoot the analysis would be appreciated.
Yes, the rising hill p-value histogram could arise from a systematic batch effect or a missing covariate. If that is the cause, then it would typically also be associated with large dispersion estimates and a large prior df, so the BCV plot may also look unusual with a very strong trend.
You could perhaps search for hidden covariates using RUVseq or sva. However it appears that there really is no DE for your data, so persuing analysis refinements may not have much purpose.
You don't give any code details of how you have analysed your data in edgeR, but I am assuming that you have used estimateDisp() followed by either likelihood ratio tests with glmLRT or quasi-lilkelihood F-tests with glmQLFTest().
Thanks for the suggestion, I'm going to try those and see if they help. One follow up question about the dispersion- I actually see marginally smaller dispersion for the samples that are giving me the strange p-value distribution albeit they do also show an uptick in BCV at higher logCPM. They're shown here in comparison to two species that don't give me trouble when I analyze them:
I have seen the type of p-value distribution that you are seeing in similar situations. Specifically, when a particular contrast has both little to no real differential expression and low biological variability relative to other contrasts in the same data set. The result is that the null distribution of fold changes for that specific contrast is narrower than the BCV for the whole data set would predict, yielding a p-value histogram biased towards 1. Hence, this could be caused by heteroskedasticity between species rather than (or in addition to) an unobserved covariate.
You might try using limma and voomWithQualityWeights, which are better equipped to handle sample and group heteroskedasticity than edgeR. If this flattens the p-value histogram, you can be fairly confident that heteroskedasticity was the cause.
In edgeR, all the contrasts share the same BCV so, if heteroskedasticity was a substantial consideration, it would affect all contrasts in more or less the same way.
Sorry, I think I explained poorly. I didn't mean that different contrasts could have different BCV values. I meant that if one contrast involves only samples in the data set with lower-than-average variability, then the null distribution of logFC values will be smaller than expected for the whole-dataset BCV, yielding excessive large p-values for that contrast. This seems to be the case in the second plot, since the BCV estimated from only the "strange fish" samples is smaller than the other fish.
Well, if the dispersion is not large, then it doesn't appear that a batch effect is the problem. It appears more likely that there just is no DE.
Yes, an uptick in the BCV trend can occur from a missing covariate, but the effect here doesn't appear very great.
I notice that you have the same prior.df for all three datasets. It would appear than you are using the original edgeR-glm pipeline from 2012 instead of the newer pipelines with estimateDisp().
Thanks for the suggestion, I'm going to try those and see if they help. One follow up question about the dispersion- I actually see marginally smaller dispersion for the samples that are giving me the strange p-value distribution albeit they do also show an uptick in BCV at higher logCPM. They're shown here in comparison to two species that don't give me trouble when I analyze them:
https://imgur.com/a/lWsiEIg
Is this still consistent with a missing covariat or are there any other confounding influences I should keep my eye out for?
thank you again!
cheers,
Kevin
I have seen the type of p-value distribution that you are seeing in similar situations. Specifically, when a particular contrast has both little to no real differential expression and low biological variability relative to other contrasts in the same data set. The result is that the null distribution of fold changes for that specific contrast is narrower than the BCV for the whole data set would predict, yielding a p-value histogram biased towards 1. Hence, this could be caused by heteroskedasticity between species rather than (or in addition to) an unobserved covariate.
You might try using limma and voomWithQualityWeights, which are better equipped to handle sample and group heteroskedasticity than edgeR. If this flattens the p-value histogram, you can be fairly confident that heteroskedasticity was the cause.
In edgeR, all the contrasts share the same BCV so, if heteroskedasticity was a substantial consideration, it would affect all contrasts in more or less the same way.
Sorry, I think I explained poorly. I didn't mean that different contrasts could have different BCV values. I meant that if one contrast involves only samples in the data set with lower-than-average variability, then the null distribution of logFC values will be smaller than expected for the whole-dataset BCV, yielding excessive large p-values for that contrast. This seems to be the case in the second plot, since the BCV estimated from only the "strange fish" samples is smaller than the other fish.
Well, if the dispersion is not large, then it doesn't appear that a batch effect is the problem. It appears more likely that there just is no DE.
Yes, an uptick in the BCV trend can occur from a missing covariate, but the effect here doesn't appear very great.
I notice that you have the same prior.df for all three datasets. It would appear than you are using the original edgeR-glm pipeline from 2012 instead of the newer pipelines with estimateDisp().