Reference paper or resource for limma::diffSplice and edgeR::diffSpliceDGE methods?
Entering edit mode
maltethodberg ▴ 180
Last seen 5 weeks ago

I have recently obtained very promising results using the diffSplice and diffsSpliceDGE from limma/edgeR, respectively. I was surprised to find that neither method has a cited reference despite being included in both the main limma paper and both edgeR and limma user guides. DEXSeq in comparison has a separate reference in addition to DESeq/DESeq2. 

This meant that I had to piece together what the method actually does from the help files from diffSplice/topSplice and diffSpliceDGE/topSliceDGE.

As far as I can tell, diffSplice works directly from the model fitted in a normal limma/edgeR analysis, unlike DEXSeq which fits a separate model including the exons, although it still uses the same dispersion estimation from DESeq2.

As I understand, the F-statistics test tests whether any exon logFC is different from any other, yielding a single gene-level p-value. The exon-level test tests whether each exon has a logFC different from the average across genes. These exon-level p-values are then corrected using the Simes method, before using the lowest p-value of among exons to represent the gene. 

I am unfamiliar with the Simes method for correcting p-values. Conceptually, the approach seems similar to DEXSeq's approach with perGeneQvalue, where p-values are defined first at the exon level, and then aggregated at the gene level (Asking whether at least one exon-level p-value is significant in the gene). Intuitively, how is aggregating exon-level p-values using the Simes method different from using DEXSeq perGeneQvalue? Does it possibly relate to the comment that "The exon-level tests are not recommended for formal error rate control." from the help files?

Any insight or pointers to resources are much appreciated.


edgeR limma diffSplice DEXSeq • 3.3k views
Entering edit mode
Charity Law ▴ 90
Last seen 5.7 years ago

I'm glad to hear that you are finding promising results using diffSplice and diffSpliceDGE from limma/edgeR. It is true that neither of the methods have a cited reference as yet, but we are hoping to write something up for it in the near future.

It's not clear to me how DEXSeq's perGeneQvalue function works, so I can't comment much on the similarities between that and diffSplice's gene-level tests. Both diffSplice and diffSpliceDGE offers two gene-level tests -- one using an F-test and the other using Simes correction. In practice, the main difference between the two is that the F-test is better at picking out genes where evidence of differential splicing comes from several exons (such that there are many exons with logFCs that are different from the rest); whereas the Simes correction is better at picking out genes where there are fewer exons affected. For example, if there is a gene where the logFC in only one exon is very different from the rest, then the Simes method would pick this out better than the F-test.

"The exon-level tests are not recommended for formal error rate control" because our tests look at overall changes in exon expression patterns between groups. The expression of individual exons can be affected by the expression of multiple transcripts containing that exon for that gene. Depending on how the transcript-level expression translates into exon-level counts, looking at exon-level tests can be misleading and have inaccurate error rate control. This is why we don't recommend it.

Entering edit mode

Thank for your reply. I haven't done any systematic investigation, but it does indeed on the face of it seems that the F-test tends to mainly find differential splicing in genes with many exons, whereas the Simes correction seem to be more stable across different number of exons. 

With regards the the exon-level test, I'm actually not using RNA-Seq data, but rather look at expression from different promoters of the same gene. In that case, there is no uncertainty in quantification of counts, since each transcript uniquely uses a single promoter. Would that mean that the error rate is controlled in this case?

Entering edit mode

No, it has nothing to do with uncertainty of quantification. Regardless of the nature of your data, is it not statistically correct to apply FDR control at a lower level (promotors or exons) when the ultimate aim is to interpret results at a higher level (genes). Simply looking for genes in which any exon has a low p-value will tend to select genes with a large number of exons, just by chance. Simes method has the effect of making the minimum p-value for each gene uniformly distributed, regardless of the number of exons in that gene. See my reply to your other comment.

Entering edit mode
Yunshun Chen ▴ 840
Last seen 9 weeks ago

The Simes method was introduced and described in the following paper:

R. J. Simes. An improved Bonferroni procedure for multiple tests of significance. Biometrika, 73(3):751~754, 1986.

The Simes' method controls the family-wise error rate in the weak sense, i.e., only when all null hypotheses are true (no exons within the gene are differentially used). I'm not sure how DEXSeq perGeneQvalue works though.

Entering edit mode

Interesting, so what motivated the choice of this particular statistics relative to something more common like the Benjamini-Hochberg correction? Does it have to do with the fact that p-values can be correlated, as described in the introduction of the paper?

Entering edit mode

Actually Simes method is just as well known in mathematical statistics circles as Benjamin-Hochberg. In fact, Simes and BH are essentially the same algorithm, just used for slightly different purposes.

We use Simes simply because it is the most statistically powerful adjustment method that gives the required result, which is weak FWER control within a gene. We then apply BH to the gene-level Simes-adjusted p-values.

If you want to understand this approach, you could look at this paper:

Although the setting is different, the principles are the same. This article shows that applying the BH algorithm to window-level p-values fails to give correct FDR control at the region level. We solve this problem by using Simes method to aggregate the window-level p-values for each region, then apply BH to the region-level Simes p-values. This process controls the FDR correctly at the region level, whereas other methods do not.



Login before adding your answer.

Traffic: 329 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6