Making a Full Rank Model for Allele-Specific Expression with DESeq2
2
0
Entering edit mode
chp265 • 0
@7be55a3e
Last seen 12 hours ago
United States

I'm trying to do a parental allele-specific expression analysis using DESeq2 based on this approach from Michael Love and this post elsewhere on this forum. I have the following samples:

   cross allele time mouse
1    BxC    mat    0     1
2    BxC    mat    0     2
3    BxC    mat    0     3
4    CxB    mat    0     4
5    CxB    mat    0     5
6    CxB    mat    0     6
7    BxC    mat    1     7
8    BxC    mat    1     8
9    BxC    mat    1     9
10   BxC    mat    2    10
11   BxC    mat    2    11
12   BxC    mat    2    12
13   BxC    mat    5    13
14   BxC    mat    5    14
15   BxC    mat    5    15
16   BxC    mat    7    16
17   BxC    mat    7    17
18   BxC    mat    7    18
19   BxC    pat    0     1
20   BxC    pat    0     2
21   BxC    pat    0     3
22   CxB    pat    0     4
23   CxB    pat    0     5
24   CxB    pat    0     6
25   BxC    pat    1     7
26   BxC    pat    1     8
27   BxC    pat    1     9
28   BxC    pat    2    10
29   BxC    pat    2    11
30   BxC    pat    2    12
31   BxC    pat    5    13
32   BxC    pat    5    14
33   BxC    pat    5    15
34   BxC    pat    7    16
35   BxC    pat    7    17
36   BxC    pat    7    18

Explanation of variables

  • cross: parental mouse strains for the sample; BxC means maternal C57BL/6J and paternal CAST/EiJ, CxB means the opposite.
  • allele: maternal or paternal allelic counts for the associated sample
  • time: mouse age normalized to my earliest time point, E8.5
  • mouse: the individual mouse sequenced

My ideal design is ~0+cross+time+mouse+allele+allele:time, but unsurprisingly this doesn't produce a full rank model matrix as each individual mouse is part of an isolated group of 3. The DESeq2 vignette suggests that the solution is to create a nested "individual-in-group" variable, which would look like this:

   cross allele time mouse mouseInGroup
1    BxC    mat    0     1            1
2    BxC    mat    0     2            2
3    BxC    mat    0     3            3
4    CxB    mat    0     4            1
5    CxB    mat    0     5            2
6    CxB    mat    0     6            3
7    BxC    mat    1     7            1
8    BxC    mat    1     8            2
9    BxC    mat    1     9            3
10   BxC    mat    2    10            1
11   BxC    mat    2    11            2
12   BxC    mat    2    12            3
13   BxC    mat    5    13            1
14   BxC    mat    5    14            2
15   BxC    mat    5    15            3
16   BxC    mat    7    16            1
17   BxC    mat    7    17            2
18   BxC    mat    7    18            3
19   BxC    pat    0     1            1
20   BxC    pat    0     2            2
21   BxC    pat    0     3            3
22   CxB    pat    0     4            1
23   CxB    pat    0     5            2
24   CxB    pat    0     6            3
25   BxC    pat    1     7            1
26   BxC    pat    1     8            2
27   BxC    pat    1     9            3
28   BxC    pat    2    10            1
29   BxC    pat    2    11            2
30   BxC    pat    2    12            3
31   BxC    pat    5    13            1
32   BxC    pat    5    14            2
33   BxC    pat    5    15            3
34   BxC    pat    7    16            1
35   BxC    pat    7    17            2
36   BxC    pat    7    18            3

Since each group is defined by both cross and time, my understanding is my design would need to be ~0+cross+time+cross:time:mouseInGroup+allele+allele:time. This still does not produce a full rank model matrix, though. The DESeq2 vignette suggests this may also be a result of some columns being all 0s:

all.zero = colSums(design) == 0
all.zero
                   crossBxC                    crossCxB                        time                   allelepat 
                      FALSE                       FALSE                       FALSE                       FALSE 
             time:allelepat crossBxC:time:mouseInGroup1 crossCxB:time:mouseInGroup1 crossBxC:time:mouseInGroup2 
                      FALSE                       FALSE                        TRUE                       FALSE 
crossCxB:time:mouseInGroup2 crossBxC:time:mouseInGroup3 crossCxB:time:mouseInGroup3 
                       TRUE                       FALSE                        TRUE

Removing those columns, however...

idx = which(all.zero)
testDesign = design[,-idx]
colSums(testDesign) == 0
                   crossBxC                    crossCxB                        time                   allelepat 
                      FALSE                       FALSE                       FALSE                       FALSE 
             time:allelepat crossBxC:time:mouseInGroup1 crossBxC:time:mouseInGroup2 crossBxC:time:mouseInGroup3 
                      FALSE                       FALSE                       FALSE                       FALSE

... still doesn't produce a model matrix that satisfies DESeq2. Any idea what the problem is and how to solve it? My suspicion is the problem is related to the fact I only have CxB samples at time 0, but they're also the only way for me to determine if a gene's ASE is due to the parent-of-origin's strain instead of actual parental allele regulation. Any help is appreciated!

Edit: for the sake of ease, some code to build the sample table:

cross = rep(c("BxC", "CxB", rep("BxC", 4)), each = 3, times = 2)
allele = rep(c("mat", "pat"), each = 18)
time = rep(c(0, 0, 1, 2, 5, 7), each = 3, times = 2)
mouse = factor(rep(1:18, times = 2))
sampleInfo = data.frame(cross, allele, time, mouse)
DESeq2 • 255 views
ADD COMMENT
0
Entering edit mode
@mikelove
Last seen 13 hours ago
United States

You may want to look at the coefficients you are trying to estimate with ExploreModelMatrix:

https://www.bioconductor.org/packages/release/bioc/vignettes/ExploreModelMatrix/inst/doc/ExploreModelMatrix.html

The ones with cross and time involve seeing how the cross behaves at different time points, but this isn't estimable.

You can have a cross term and a time term, maybe the easiest is to combine cross and time into one variable and use that to find differences in allelic across groups defined by the combination of those two.

But do take a look at EMM -- it was designed to help folks visualize the terms they are estimating.

ADD COMMENT
0
Entering edit mode

Ah, this is handy! Hopefully this will help with troubleshooting; at the very least, I can see it being helpful when I try to explain to people what the different coefficients mean.

If I'm understanding you correctly, you're suggesting creating an additional variable like so:

crossTime = paste(cross, time, sep="-")

Assuming so, here are the results of a few test design formulae:

  • ~0+cross+time+crossTime:mouseInGroup+allele+allele:time: not full rank
  • ~0+crossTime+crossTime:mouseInGroup+allele+allele:time: not full rank
  • ~0+crossTime+crossTime:mouseInGroup+allele+allele:crossTime: full rank, but I'm not sure how I'd extricate "here is how ASE changes over time" from allele:crossTime without like, isolating the BxC-containing coefficients and doing a linear regression on those.
  • ~0+cross+time+crossTime:mouseInGroup+allele+allele:crossTime: not full rank, not really sure what I was hoping for with this one.

It seems like the issue is that, if I use cross and time separately in addition to crossTime, they end up describing the same groups. So, still not sure how to build a working design formula for this experiment. Maybe it'll help if I try to start from an ideal model and derive a design formula from there? It may also be time to change tools; I wanted to use DESeq2 because of its normalization strategy, but it might work better to just calculate FPKM -> log2(allelic ratio) and then use something to model & evaluate that...

ADD REPLY
0
Entering edit mode

Hm, okay, in building a fake model by hand I think I'm realizing the problem: mouse and its variants are ideally acting as basically a random effect variable to try and establish an imaginary "baseline expression" for each mouse. I think I have to effectively average that out and pick a variable to incorporate that average into, then figure out relevant coefficients based on that? Maybe? Confirmation would be appreciated; the joys of independently learning statistical modeling is I never feel sure I know what I'm doing. x_x

ADD REPLY
0
Entering edit mode

You just need the mouse identifier (controls baseline) and the group specific allelic effect.

ADD REPLY
0
Entering edit mode

You may want to speak with a local statistician, because you don't just want to find the first full rank matrix and run with that if it doesn't represent what you are after.

Alternatively you can examine the model matrix yourself and spot what is happening.

ADD REPLY
0
Entering edit mode
@gordon-smyth
Last seen 5 hours ago
WEHI, Melbourne, Australia

Aaron Lun was a PhD student in my Lab back in 2015, when he posted the answers about allele-specific expression that you and Mike Love have linked to. The idea of the approach is to use sample effects in the linear model in order to estimate the allele ratios. The idea is an extension of what has sometimes been called the "Poisson trick" in generalized linear model theory, which allows one to turn a Poisson log-linear model into a binomial or multinomial logit model. We have used the idea for a number of purposes, for example Aaron adapted the idea for normalization purposes in the csaw and diffHic packages.

It is helpful to draw a connection with methylation data, where the focus on ratios is very explicit. Yunshun Chen and I developed the Poisson-trick idea further for analysing reduced represention methylation data (RRBS) in 2016, and we published the method in F1000Research in November 2017 (Chen et al 2017). The 2017 paper includes some exposition of the method, including some toy illustrative examples and a theoretical discussion of why the method might out-perform existing binomial approaches. We also collaborated with the DMRcate developers, who incorporated the approach into their core pipeline for whole genome BS-seq (Peters et al 2021). You can learn more about how the allele-specific method works from the F1000Research paper, even though the paper is on methylation rather than RNA-seq, because the statistical method is the same in either case. In the methylation pipeline, you can provide a design matrix for the original biological samples, and edgeR can expand it to make the full design matrix with methylation effects.

Coming back to your data, the key to creating a clean simple design matrix is to form the baseline effects and the allele-specific effects separately.

First, make the mouse effects (total expression). Note that mouse must be a factor rather than a numeric vector:

design.mouse <- model.matrix(~0+mouse)

Then make a design matrix that describes your groups (cross by time combinations). I prefer the group-means parametrization without an intercept, but you can parametrize the groups anyway you want.

group <- factor(paste(cross,time,sep="."))
design.group <- model.matrix(~0+group)
colnames(design.group) <- levels(group)

Then make groupwise allele-specific effects (paternal vs maternal):

design.allele <- design.group * (allele == "pat")
colnames(design.allele) <- paste(colnames(design.group),"patvsmat",sep=".")

Now put them all together:

design <- cbind(design.allele, design.mouse)

The matrix is full rank and the coefficients are all interpretable.

In this formulation, the first six coefficients represent the log2-ratios of paternal to maternal expression at each cross.time combination. You can test each of these coefficients directly to test for allele-specificity. For example, the coefficient BxC.0.patvsmat tests for paternal vs maternal specificity in the BxC cross at time 0. A significant positive coefficient represents paternal bias while a negative coefficient would be maternal bias. The coefficient BxC.7.patvsmat tests for the same thing at time 7.

Forming the contrast BxC.7.patvsmat - BxC.0.patvsmat tests for changes in specificity between time 0 and time 7. Forming the contrast BxC.0.patvsmat - CxB.0.patvsmat tests for differences in specificity between the two crosses.

My group has papers currently in preparation where we have used this sort of approach to examine allele-specificity in RNA-seq, ATAC-seq, and Hi-C data.

References

Chen Y, Pal B, Visvader JE, Smyth GK (2017). Differential methylation analysis of reduced representation bisulfite sequencing experiments using edgeR. F1000Research 6, 2055. https://bioinf.wehi.edu.au/edgeR/F1000Research2017

Peters TJ, Buckley MJ, Chen Y, Smyth GK, Goodnow CC, Clark SJ (2021). Calling differentially methylated regions from whole genome bisulphite sequencing with DMRcate. Nucleic Acids Research 49(19), e109. https://doi.org/10.1093/nar/gkab637

Note: the above code was corrected 11:50am 23 Jan 2026 Melbourne time.

ADD COMMENT
0
Entering edit mode

Confirming my understanding after executing the code provided: this is similar to a design formula of ~0+mouse+group+group:allele, except we also have an explicit additional column for group == BxC.0, right? Comparing the output I got to a model matrix with that design formula:

> colnames(design)
 [1] "BxC.0.patvsmat" "BxC.1.patvsmat" "BxC.2.patvsmat" "BxC.5.patvsmat" "BxC.7.patvsmat" "CxB.0.patvsmat" "BxC.0"         
 [8] "BxC.1"          "BxC.2"          "BxC.5"          "BxC.7"          "CxB.0"          "mouse1"         "mouse2"        
[15] "mouse3"         "mouse4"         "mouse5"         "mouse6"         "mouse7"         "mouse8"         "mouse9"        
[22] "mouse10"        "mouse11"        "mouse12"        "mouse13"        "mouse14"        "mouse15"        "mouse16"       
[29] "mouse17"        "mouse18"

> colnames(model.matrix(~0+mouse+group+group:allele))
 [1] "mouse1"               "mouse2"               "mouse3"               "mouse4"               "mouse5"              
 [6] "mouse6"               "mouse7"               "mouse8"               "mouse9"               "mouse10"             
[11] "mouse11"              "mouse12"              "mouse13"              "mouse14"              "mouse15"             
[16] "mouse16"              "mouse17"              "mouse18"              "groupBxC.1"           "groupBxC.2"          
[21] "groupBxC.5"           "groupBxC.7"           "groupCxB.0"           "groupBxC.0:allelepat" "groupBxC.1:allelepat"
[26] "groupBxC.2:allelepat" "groupBxC.5:allelepat" "groupBxC.7:allelepat" "groupCxB.0:allelepat"

Firstly, I want to note: this doesn't seem to produce a full rank model matrix, either as a design formula or as an assemblage of sub-matrices. Secondly, if I'm not mistaken, even if this design were to work, it would force many of the comparisons I want to make to be individual contrasts, correct? For example, if I wanted to see which genes had significant changes to ASE over time---not just between one time point and another, but over time in general---I'd need to contrast each of the BxC.t.patvsmat groups with BxC.0.patvsmat, store their coefficients, then... do a t-test or some such to evaluate if they're consistently significantly different from 0? It feels like I'm losing a lot of information by turning the quantitative time variable into simply a component of the categorical group variable, and it's unclear to me if this is actually answering the questions I have as a model.

ADD REPLY
0
Entering edit mode

My apologies, I forgot that design.group is not included in the final matrix, because the baseline group effects are already captured by mouse effects. I have corrected my answer now to give the correct full-rank design matrix.

Yes, this design matrix does force you to form contrasts, which I view as an advantage, but it does not restrict you to testing individual contrasts. I don't know DESeq2, but in the edgeR syntax you just pass a number of contrasts at once to edge::glmLRT and it will perform the appropriate anova-like test on all the degrees of freedom. To test for any changes in ASE over time in the BxC cross, you have to pass a vector of four contrasts, e.g. comparing each time to 0.

The design in my answer does not lose any information nor does it restrict in any way the tests you can conduct. It is just an explicit parametrization, not a restriction or a reduction.

If you did want to test for any change in ASE over time, without forming contrasts, then you could define design.group without an intercept:

design.group <- model.matrix(~group)

and proceed to form design.allele and design in the same way as above. In the edgeR pipeline, you would then use

lrt <- glmLRT(fit, coef=2:5)

to do an anova-like test for any changes in ASE over the five time-points.

The design matrix approach that I outlined in my answer above mimics what is done by the edgeR::modelMatrixMeth function for methylation data. The beauty of this approach is that you can define design.group in any valid way that describes your experimental factors, and the final full design matrix will still be correct. You can code design.group to be a factorial design if you want, or even to be one with a spline trend over time. It does not need to be a oneway layout.

ADD REPLY

Login before adding your answer.

Traffic: 1362 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6