Question

Designating contrasts in diffSpliceDGE to test for nested effects

0

Entering edit mode

rachel.steward • 0

@rachelsteward-24298

Last seen 4.7 years ago

I am comparing differential gene expression (using glmQLFTest) and differential splicing (using diffSpliceDGE). While I have been able to test my contrasts of interest for gene expression, I am having a hard time doing the same for splicing.

I have designed contrast matrices to test my comparisons of interest: the effects of history (Ext. v. Cor) and chemistry (Ast. v. Ran. v. Ros) within four different tissues (Fat, Lab, Gut, Mal). I have created a large design matrix:

> design.all <- model.matrix(~0 + group, data = VC_metadata) # interaction effects
> colnames(design.all) <- levels(VC_metadata$group)
> colnames(design.all)
# [1]  "Fat.Cor.Ast" "Fat.Cor.Ros" "Fat.Ext.Ast" "Fat.Ext.Ran" 
# [5]  "Gut.Cor.Ast" "Gut.Cor.Ros" "Gut.Ext.Ast" "Gut.Ext.Ran" 
# [9]  "Lab.Cor.Ast" "Lab.Cor.Ros" "Lab.Ext.Ast" "Lab.Ext.Ran"
# [13] "Mal.Cor.Ast" "Mal.Cor.Ros" "Mal.Ext.Ast" "Mal.Ext.Ran"

Focusing on the "Fat" tissue, I created three contrast matrices targeting my contrasts of interest: Main effect of history

> FB_Main.F <- makeContrasts((Fat.Cor.Ast + Fat.Cor.Ros)/2 - (Fat.Ext.Ast + Fat.Ext.Ran)/2, levels = design.all)
> colnames(FB_Main.F) <- "FB_Main.F"
> rownames(FB_Main.F) <- levels(VC_metadata$group)
> colSums(FB_Main.F)
FB_Main.F 
        0 
> sum(FB_Main.F[which(FB_Main.F > 0),])
[1] 1
> sum(FB_Main.F[which(FB_Main.F < 0),])
[1] -1

Main effect of chemistry

> HPcontrasts <- rbind(
+   c(0.5,1)/2,
+   c(0.5, -1), 
+   c(0.5,1)/2,
+   c(-1,0))
> nullcontrasts <-  rbind(c(0,0), c(0,0), c(0,0), c(0,0))
> FB_Main.HP <- rbind(
+   HPcontrasts,
+   nullcontrasts,
+   nullcontrasts, 
+   nullcontrasts)
> rownames(FB_Main.HP) <- levels(VC_metadata$group)
> colnames(FB_Main.HP) <- paste("C",1:2,sep="_")
> crossprod(FB_Main.HP)
       C_1   C_2
C_1  1.375 -0.25
C_2 -0.250  1.50
> colSums(FB_Main.HP)
C_1 C_2 
  0   0 
> for (ct in 1:2){
+   print(sum(FB_Main.HP[which(FB_Main.HP[,ct] > 0),ct]))
+   print(sum(FB_Main.HP[which(FB_Main.HP[,ct] < 0),ct]))
+ }
[1] 1
[1] -1
[1] 1
[1] -1

History x chemistry interaction

> FxHPcontrasts <- cbind(
+   c(rep(1/3,3),-1),
+   c(rep(1/2,2),-1,0),
+   c(rep(1/1,1),-1,0,0))
> nullcontrast2 <-  rbind(c(rep(0,3)), c(rep(0,3)), c(rep(0,3)), c(rep(0,3)))
> FB_Int.FxHP <- rbind(FxHPcontrasts,
+                      nullcontrast2,
+                      nullcontrast2, 
+                      nullcontrast2) 
> rownames(FB_Int.FxHP) <- levels(VC_metadata$group)
> colnames(FB_Int.FxHP) <- paste("C",1:3,sep="_")
> crossprod(FB_Int.FxHP)
         C_1 C_2 C_3
C_1 1.333333 0.0   0
C_2 0.000000 1.5   0
C_3 0.000000 0.0   2
> colSums(FB_Int.FxHP)
          C_1           C_2           C_3 
-5.551115e-17  0.000000e+00  0.000000e+00 
> for (ct in 1:3){
+   print(sum(FB_Int.FxHP[which(FB_Int.FxHP[,ct] > 0),ct]))
+   print(sum(FB_Int.FxHP[which(FB_Int.FxHP[,ct] < 0),ct]))
+ }  
[1] 1
[1] -1
[1] 1
[1] -1
[1] 1
[1] -1

I was able to effectively use these contrasts with the glmQLFTest command:

> fit.genes.filt.all <- glmQLFit(VC.genes.filt.all,design.all, robust=TRUE)
> modelQLF <- list()
> modelQLF$FB_Main.F <- glmQLFTest(fit.genes.filt.all, contrast = FB_Main.F)
> modelQLF$FB_Main.HP <- glmQLFTest(fit.genes.filt.all, contrast = FB_Main.HP)
> modelQLF$FB_Int.FxHP <- glmQLFTest(fit.genes.filt.all, contrast = FB_Int.FxHP)

> head(topTags(modelQLF$FB_Main.HP, 10))
 # (chr, start, end, strand, length excluded for space)
Coefficient:  LR test on 2 degrees of freedom 
                       GeneID       logFC.C_1        logFC.C_2      logCPM                    F              PValue               FDR
VCARD_00002726-RA    -5.8067313    -2.165252532    1.80375345   65.45306    1.049847e-15    1.108743e-11
VCARD_00007039-RA    -4.8165992     0.009733779   -0.10870504   49.90271    2.971616e-13    1.569162e-09
VCARD_00003307-RA    -5.2875728    1.523010670    -0.02549543   47.68861    1.167581e-12    3.186382e-09
VCARD_00009318-RA    -3.2893327    -0.835207267    2.58535467   45.27999    1.206849e-12    3.186382e-09
VCARD_00012005-RA      0.6319163   -1.897763282    7.53346103   41.71435     5.101614e-12   1.077563e-08
VCARD_00012006-RA      0.5993499   -1.898216585    8.13562869   36.39604     5.038427e-11   8.868471e-08

Can I use the same contrast matrices within the diffSpliceDGE command? I know that I can designate coefficients withing diffSpliceDGE, but when I use the same structure as above, I get the following error for both the chemistry and interaction comparison: e.g.

> fit.exons.filt.all <- glmQLFit(VC.exons.filt.all,design.all, robust=TRUE)
...
> modeldiffSplice$FB_Main.HP <- diffSpliceDGE(fit.exons.filt.all, contrast = FB_Main.HP, geneid="GeneID", exonid="ExonID")
Total number of exons:  64211 
Total number of genes:  10269 
Number of genes with 1 exon:  1713 
Mean number of exons in a gene:  6 
Max number of exons in a gene:  134 
Error in diffSpliceDGE(fit.exons.filt.all, contrast = FB_Main.HP, geneid = "GeneID",  : 
  dims [product 62498] do not match the length of object [124996]

I appreciate any suggestions on how I can best move forward with my analysis. Thank you, Rachel

diffSpliceDGE Contrast matrix edger • 1.7k views

ADD COMMENT • link updated 4.7 years ago by Gordon Smyth 52k • written 4.7 years ago by rachel.steward • 0

Gordon Smyth · Answer 1 · 2020-10-09

0

Entering edit mode

Gordon Smyth 52k

@gordon-smyth

Last seen 10 hours ago

WEHI, Melbourne, Australia

glmQLFTest will accept a matrix of contrasts, but diffSpliceDGE will not. For diffSpliceDGE, the contrast must be a vector or a single column.

The help page for diffSpliceDGE says:

contrast: numeric vector specifying the contrast of the linear model coefficients to be tested for differential exon usage. Length must equal to the number of columns of design.

The above is different from the help page for glmLRT and glmQLFTest which says:

contrast: numeric vector or matrix specifying one or more contrasts of the linear model coefficients to be tested equal to zero. Number of rows must equal to the number of columns of design.

ADD COMMENT • link 4.7 years ago Gordon Smyth 52k

0

Entering edit mode

For clarification, does this mean that diffSpliceDGE can only handle pairwise contrasts coded as a single vector, and that I should reorganize my design so that I can instead specify coefficients?

ADD REPLY • link 4.7 years ago rachel.steward • 0

0

Entering edit mode

Sorry, I don't understand what is concerning you. diffSpliceDGE can accept any contrast at all, there is no restriction. There is no reason why you would need to reorganize the design matrix. diffSpliceDGE explores all differential exon usage possibilities for a single comparison, and it is exactly the same whether you specify the comparison via coefficients or contrasts. diffSpliceDGE allows you to specify exactly one coefficient or exactly one contrast.

I don't understand why you are trying to give more than one contrast the diffSpliceDGE. To me it wouldn't make any sense to do so.

ADD REPLY • link 4.7 years ago Gordon Smyth 52k

0

Entering edit mode

I would like to know whether I can use diffspliceDGE to identify genes that are deferentially spliced among three or more levels of my variable of interest, for example, levels A, B and C. I think what you are saying is yes, but only if I test AvB, BvC, and CvA. Is that correct? Can you suggest any resources in addition to the manual and the help page that describe how diffspliceDGE functions as a t-test? The manual (pages 27, 86-88) and the help page do not mention t-tests, and instead only refer to the F-tests.

ADD REPLY • link 4.7 years ago rachel.steward • 0

0

Entering edit mode

Yes, if you have more than two groups to compare and you want to explore alternative splicing between any of them, then you would need to run diffSpliceDGE more than once with different contrasts. The contrasts don't necessarily have to be pairwise comparisons however.

I agree that my remark about t-tests was misleading, so I've edited my previous comment to remove it. What I meant was that diffSpliceDGE conducts tests with have a single numerator degree of freedom, like a t-test. If you run diffSpliceDGE on a glmQL fit object, then diffSpliceDGE will conduct exon-level F-tests where the numerator df is 1, and these F-statistics can be interpreted as squared t-statistics. The analogous function diffSplice in the limma package really does conduct t-tests.

I have no plans to allow you to input multiple contrasts to diffSpliceDGE because I think the results would be too difficult to interpret. Looking for any possible different splicing event between any pair of groups would be too confusing. When you got a signficant result, it could be difficult to track down what the specific alternative splicing pattern was.

ADD REPLY • link 4.7 years ago Gordon Smyth 52k

0

Entering edit mode

Thanks so much for the clarification.

The weird thing is, I have been able to use diffspliceDGE with multiple coefficients. For example, I tried to compare splicing differences among families using the following design matrix.

> head(design.exons.filt.SandF<- model.matrix(~ 1 + family)) 
  (Intercept) family29 family30 family37 family53 family60 family63
1           1        0        0        0        0        0        0
2           1        0        0        0        0        0        0
3           1        0        0        0        0        0        0
4           1        0        0        0        0        0        0 
5           1        1        0        0        0        0        0
> exons.filt.SandF <- estimateDisp(y.exons.filt, design.exons.filt.SandF, robust=TRUE)
> fit.exons.filt.SandF <- glmQLFit(y.exons.filt.SandF,design.exons.filt.SandF, robust=TRUE)
> sp.exons.filt.SandF_family <- diffSpliceDGE(fit.exons.filt.SandF, coef = 2:7, geneid="GeneID", exonid="ExonID")

From your explanation above, diffSpliceDGE should not be able to handle the full comparison. The help page suggests it should default to the last column. But, the resulting diffSpliceDGE object has six columns of coefficients, with only a single vector of p-values at both the exon and gene level.

> tail(matrix(sp.exons.filt.SandF_family$coefficients, ncol = 6, byrow = F))
                 [,1]         [,2]        [,3]       [,4]        [,5]         [,6]
[67111,] -0.003655635  0.012434268  0.02913014 0.02207139  0.00310111 0.0116667890
[67112,]  0.067607229  0.005769708  0.03048908 0.02170920  0.06296046 0.0170454409
[67113,] -0.001359356  0.021987396  0.05159830 0.01190305  0.01064321 0.0005981719
[67114,]  0.067607229  0.005769708  0.03048908 0.02170920  0.06296046 0.0170454409
[67115,] -0.044111599 -0.105977316 -0.13420187 0.10551209 -0.10046394 0.0696666262

> tail(sp.exons.filt.SandF_family$exon.p.value)
      0.6892649       0.9999710       0.9971867       0.9994914       0.9971867       0.6892649 

> topSpliceDGE(sp.exons.filt.SandF_family, test="simes") 
                   GeneID            Chr Strand NExons       P.Value           FDR
gene11495 NW_019863186.1      -     16 7.112354e-124 6.315770e-120
gene5387 NW_019862849.1      +     48  3.128567e-88  1.389084e-84
gene15105 NW_019864020.1      -     13  8.264995e-86  2.446438e-82
 gene7529 NW_019862934.1      -     32  8.456534e-85  1.836734e-81

The p-values are not the same as if diffSpliceDGE had just reverted to test the pairwise comparison of coefficient 7 with the intercept:

> sp.exons.filt.SandF_family_7only <- diffSpliceDGE(fit.exons.filt.SandF, coef = 7, geneid="GeneID", exonid="ExonID")

> tail(sp.exons.filt.SandF_family_7only$coefficients)
[1] 0.0696666262 0.0116667890 0.0170454409 0.0005981719 0.0170454409 0.0696666262

> tail(sp.exons.filt.SandF_family_7only$exon.p.value)
[1] 0.6863963 0.9330268 0.9034488 0.9966823 0.9034488 0.6863963

> topSpliceDGE(sp.exons.filt.SandF_family_7only, test="simes")                  
GeneID            Chr Strand NExons      P.Value          FDR
gene10703 NW_019863119.1      +     12 2.502626e-52 2.222332e-48
gene9658 NW_019863047.1      +     69 1.095286e-40 4.863072e-37
gene13844 NW_019863529.1      -     10 7.211268e-37 2.134535e-33
gene8767 NW_019862996.1      -      9 1.753817e-27 3.893474e-24

ADD REPLY • link 4.7 years ago rachel.steward • 0

0

Entering edit mode

I also tested whether releveling families had any effect on the output, and found that while the coefficients change relative to the base-level family (here family 37), the exon-level p-values do not change, nor do the genes identified as differentially spliced by topSpliceDGE.

> family_rl <- relevel(family, ref = "37")
> design.exons.filt.SandF_rl<- model.matrix(~ 1 + family_rl + seasons)
 > y.exons.filt.SandF_rl <- estimateDisp(y.exons.filt, design.exons.filt.SandF_rl, robust=TRUE)
> fit.exons.filt.SandF_rl <- glmQLFit(y.exons.filt.SandF_rl,design.exons.filt.SandF_rl, robust=TRUE)
> sp.exons.filt.SandF_family_rl <- diffSpliceDGE(fit.exons.filt.SandF_rl, coef = 2:7, geneid="GeneID", exonid="ExonID")

> tail(matrix(sp.exons.filt.SandF_family_rl$coefficients, ncol = 6, byrow = F))
                [,1]        [,2]        [,3]         [,4]        [,5]        [,6]
[67111,] -0.02913014 -0.03278578 -0.01669587 -0.007058750 -0.02602903 -0.01746335
[67112,] -0.03048908  0.03711815 -0.02471937 -0.008779877  0.03247138 -0.01344364
[67113,] -0.05159830 -0.05295766 -0.02961091 -0.039695251 -0.04095509 -0.05100013
[67114,] -0.03048908  0.03711815 -0.02471937 -0.008779877  0.03247138 -0.01344364
[67115,]  0.13420187  0.09009027  0.02822456  0.239713962  0.03373793  0.20386850

> tail(matrix(sp.exons.filt.SandF_family_rl$exon.p.value)
      0.6892649       0.9999710       0.9971867       0.9994914       0.9971867       0.6892649 

> topSpliceDGE(sp.exons.filt.SandF_family_rl, test="simes")
      GeneID            Chr Strand NExons       P.Value           FDR effect
1  gene11495 NW_019863186.1      -     16 7.112212e-124 6.315644e-120 family
2   gene5387 NW_019862849.1      +     48  3.128567e-88  1.389084e-84 family
3  gene15105 NW_019864020.1      -     13  8.264995e-86  2.446438e-82 family
4   gene7529 NW_019862934.1      -     32  8.456534e-85  1.836734e-81 family

I am sorry to flood you with so much information, and I trust you when you say that diffSpliceDGE has not been designed to accommodate comparing more than two levels, but I am still confused about how I should interpret this output.

ADD REPLY • link updated 4.7 years ago by Gordon Smyth 52k • written 4.7 years ago by rachel.steward • 0

0

Entering edit mode

diffSpliceDGE() shouldn't be accepting more than one coefficient. We will add a warning to the function to prevent this misunderstanding in the future.

I have checked with Yunshun Chen, who is the author of the function. He says you should ignore the results you have now and rerun with coef set to a single value.

ADD REPLY • link 4.7 years ago Gordon Smyth 52k