Question

Making a Full Rank Model for Allele-Specific Expression with DESeq2

0

Entering edit mode

chp265 • 0

@7be55a3e

Last seen 4 weeks ago

United States

I'm trying to do a parental allele-specific expression analysis using DESeq2 based on this approach from Michael Love and this post elsewhere on this forum. I have the following samples:

   cross allele time mouse
1    BxC    mat    0     1
2    BxC    mat    0     2
3    BxC    mat    0     3
4    CxB    mat    0     4
5    CxB    mat    0     5
6    CxB    mat    0     6
7    BxC    mat    1     7
8    BxC    mat    1     8
9    BxC    mat    1     9
10   BxC    mat    2    10
11   BxC    mat    2    11
12   BxC    mat    2    12
13   BxC    mat    5    13
14   BxC    mat    5    14
15   BxC    mat    5    15
16   BxC    mat    7    16
17   BxC    mat    7    17
18   BxC    mat    7    18
19   BxC    pat    0     1
20   BxC    pat    0     2
21   BxC    pat    0     3
22   CxB    pat    0     4
23   CxB    pat    0     5
24   CxB    pat    0     6
25   BxC    pat    1     7
26   BxC    pat    1     8
27   BxC    pat    1     9
28   BxC    pat    2    10
29   BxC    pat    2    11
30   BxC    pat    2    12
31   BxC    pat    5    13
32   BxC    pat    5    14
33   BxC    pat    5    15
34   BxC    pat    7    16
35   BxC    pat    7    17
36   BxC    pat    7    18

Explanation of variables

cross: parental mouse strains for the sample; BxC means maternal C57BL/6J and paternal CAST/EiJ, CxB means the opposite.
allele: maternal or paternal allelic counts for the associated sample
time: mouse age normalized to my earliest time point, E8.5
mouse: the individual mouse sequenced

My ideal design is ~0+cross+time+mouse+allele+allele:time, but unsurprisingly this doesn't produce a full rank model matrix as each individual mouse is part of an isolated group of 3. The DESeq2 vignette suggests that the solution is to create a nested "individual-in-group" variable, which would look like this:

   cross allele time mouse mouseInGroup
1    BxC    mat    0     1            1
2    BxC    mat    0     2            2
3    BxC    mat    0     3            3
4    CxB    mat    0     4            1
5    CxB    mat    0     5            2
6    CxB    mat    0     6            3
7    BxC    mat    1     7            1
8    BxC    mat    1     8            2
9    BxC    mat    1     9            3
10   BxC    mat    2    10            1
11   BxC    mat    2    11            2
12   BxC    mat    2    12            3
13   BxC    mat    5    13            1
14   BxC    mat    5    14            2
15   BxC    mat    5    15            3
16   BxC    mat    7    16            1
17   BxC    mat    7    17            2
18   BxC    mat    7    18            3
19   BxC    pat    0     1            1
20   BxC    pat    0     2            2
21   BxC    pat    0     3            3
22   CxB    pat    0     4            1
23   CxB    pat    0     5            2
24   CxB    pat    0     6            3
25   BxC    pat    1     7            1
26   BxC    pat    1     8            2
27   BxC    pat    1     9            3
28   BxC    pat    2    10            1
29   BxC    pat    2    11            2
30   BxC    pat    2    12            3
31   BxC    pat    5    13            1
32   BxC    pat    5    14            2
33   BxC    pat    5    15            3
34   BxC    pat    7    16            1
35   BxC    pat    7    17            2
36   BxC    pat    7    18            3

Since each group is defined by both cross and time, my understanding is my design would need to be ~0+cross+time+cross:time:mouseInGroup+allele+allele:time. This still does not produce a full rank model matrix, though. The DESeq2 vignette suggests this may also be a result of some columns being all 0s:

all.zero = colSums(design) == 0
all.zero
                   crossBxC                    crossCxB                        time                   allelepat 
                      FALSE                       FALSE                       FALSE                       FALSE 
             time:allelepat crossBxC:time:mouseInGroup1 crossCxB:time:mouseInGroup1 crossBxC:time:mouseInGroup2 
                      FALSE                       FALSE                        TRUE                       FALSE 
crossCxB:time:mouseInGroup2 crossBxC:time:mouseInGroup3 crossCxB:time:mouseInGroup3 
                       TRUE                       FALSE                        TRUE

Removing those columns, however...

idx = which(all.zero)
testDesign = design[,-idx]
colSums(testDesign) == 0
                   crossBxC                    crossCxB                        time                   allelepat 
                      FALSE                       FALSE                       FALSE                       FALSE 
             time:allelepat crossBxC:time:mouseInGroup1 crossBxC:time:mouseInGroup2 crossBxC:time:mouseInGroup3 
                      FALSE                       FALSE                       FALSE                       FALSE

... still doesn't produce a model matrix that satisfies DESeq2. Any idea what the problem is and how to solve it? My suspicion is the problem is related to the fact I only have CxB samples at time 0, but they're also the only way for me to determine if a gene's ASE is due to the parent-of-origin's strain instead of actual parental allele regulation. Any help is appreciated!

Edit: for the sake of ease, some code to build the sample table:

cross = rep(c("BxC", "CxB", rep("BxC", 4)), each = 3, times = 2)
allele = rep(c("mat", "pat"), each = 18)
time = rep(c(0, 0, 1, 2, 5, 7), each = 3, times = 2)
mouse = factor(rep(1:18, times = 2))
sampleInfo = data.frame(cross, allele, time, mouse)

DESeq2 • 952 views

ADD COMMENT • link updated 5 weeks ago by Gordon Smyth 53k • written 6 weeks ago by chp265 • 0

score 0 · Answer 1 · 2026-01-21

0

Entering edit mode

Michael Love 43k

@mikelove

Last seen 5 days ago

United States

You may want to look at the coefficients you are trying to estimate with ExploreModelMatrix:

https://www.bioconductor.org/packages/release/bioc/vignettes/ExploreModelMatrix/inst/doc/ExploreModelMatrix.html

The ones with cross and time involve seeing how the cross behaves at different time points, but this isn't estimable.

You can have a cross term and a time term, maybe the easiest is to combine cross and time into one variable and use that to find differences in allelic across groups defined by the combination of those two.

But do take a look at EMM -- it was designed to help folks visualize the terms they are estimating.

ADD COMMENT • link 5 weeks ago Michael Love 43k

0

Entering edit mode

Ah, this is handy! Hopefully this will help with troubleshooting; at the very least, I can see it being helpful when I try to explain to people what the different coefficients mean.

If I'm understanding you correctly, you're suggesting creating an additional variable like so:

crossTime = paste(cross, time, sep="-")

Assuming so, here are the results of a few test design formulae:

~0+cross+time+crossTime:mouseInGroup+allele+allele:time: not full rank
~0+crossTime+crossTime:mouseInGroup+allele+allele:time: not full rank
~0+crossTime+crossTime:mouseInGroup+allele+allele:crossTime: full rank, but I'm not sure how I'd extricate "here is how ASE changes over time" from allele:crossTime without like, isolating the BxC-containing coefficients and doing a linear regression on those.
~0+cross+time+crossTime:mouseInGroup+allele+allele:crossTime: not full rank, not really sure what I was hoping for with this one.

It seems like the issue is that, if I use cross and time separately in addition to crossTime, they end up describing the same groups. So, still not sure how to build a working design formula for this experiment. Maybe it'll help if I try to start from an ideal model and derive a design formula from there? It may also be time to change tools; I wanted to use DESeq2 because of its normalization strategy, but it might work better to just calculate FPKM -> log2(allelic ratio) and then use something to model & evaluate that...

ADD REPLY • link 5 weeks ago chp265 • 0

0

Entering edit mode

Hm, okay, in building a fake model by hand I think I'm realizing the problem: mouse and its variants are ideally acting as basically a random effect variable to try and establish an imaginary "baseline expression" for each mouse. I think I have to effectively average that out and pick a variable to incorporate that average into, then figure out relevant coefficients based on that? Maybe? Confirmation would be appreciated; the joys of independently learning statistical modeling is I never feel sure I know what I'm doing. x_x

ADD REPLY • link 5 weeks ago chp265 • 0

0

Entering edit mode

You just need the mouse identifier (controls baseline) and the group specific allelic effect.

ADD REPLY • link 5 weeks ago Michael Love 43k

0

Entering edit mode

You may want to speak with a local statistician, because you don't just want to find the first full rank matrix and run with that if it doesn't represent what you are after.

Alternatively you can examine the model matrix yourself and spot what is happening.

ADD REPLY • link 5 weeks ago Michael Love 43k

score 0 · Answer 2 · 2026-01-21

0

Entering edit mode

Gordon Smyth 53k

@gordon-smyth

Last seen 4 hours ago

WEHI, Melbourne, Australia

Aaron Lun was a PhD student in my Lab back in 2015, when he posted the answers about allele-specific expression that you and Mike Love have linked to. The idea of the approach is to use sample effects in the linear model in order to estimate the allele ratios. The idea is an extension of what has sometimes been called the "Poisson trick" in generalized linear model theory, which allows one to turn a Poisson log-linear model into a binomial or multinomial logit model. We have used the idea for a number of purposes, for example Aaron adapted the idea for normalization purposes in the csaw and diffHic packages.

It is helpful to draw a connection with methylation data, where the focus on ratios is very explicit. Yunshun Chen and I developed the Poisson-trick idea further for analysing reduced represention methylation data (RRBS) in 2016, and we published the method in F1000Research in November 2017 (Chen et al 2017). The 2017 paper includes some exposition of the method, including some toy illustrative examples and a theoretical discussion of why the method might out-perform existing binomial approaches. We also collaborated with the DMRcate developers, who incorporated the approach into their core pipeline for whole genome BS-seq (Peters et al 2021). You can learn more about how the allele-specific method works from the F1000Research paper, even though the paper is on methylation rather than RNA-seq, because the statistical method is the same in either case. In the methylation pipeline, you can provide a design matrix for the original biological samples, and edgeR can expand it to make the full design matrix with methylation effects.

Coming back to your data, the key to creating a clean simple design matrix is to form the baseline effects and the allele-specific effects separately.

First, make the mouse effects (total expression). Note that mouse must be a factor rather than a numeric vector:

design.mouse <- model.matrix(~0+mouse)

Then make a design matrix that describes your groups (cross by time combinations). I prefer the group-means parametrization without an intercept, but you can parametrize the groups anyway you want.

group <- factor(paste(cross,time,sep="."))
design.group <- model.matrix(~0+group)
colnames(design.group) <- levels(group)

Then make groupwise allele-specific effects (paternal vs maternal):

design.allele <- design.group * (allele == "pat")
colnames(design.allele) <- paste(colnames(design.group),"patvsmat",sep=".")

Now put them all together:

design <- cbind(design.allele, design.mouse)

The matrix is full rank and the coefficients are all interpretable.

In this formulation, the first six coefficients represent the log2-ratios of paternal to maternal expression at each cross.time combination. You can test each of these coefficients directly to test for allele-specificity. For example, the coefficient BxC.0.patvsmat tests for paternal vs maternal specificity in the BxC cross at time 0. A significant positive coefficient represents paternal bias while a negative coefficient would be maternal bias. The coefficient BxC.7.patvsmat tests for the same thing at time 7.

Forming the contrast BxC.7.patvsmat - BxC.0.patvsmat tests for changes in specificity between time 0 and time 7. Forming the contrast BxC.0.patvsmat - CxB.0.patvsmat tests for differences in specificity between the two crosses.

My group has papers currently in preparation where we have used this sort of approach to examine allele-specificity in RNA-seq, ATAC-seq, and Hi-C data.

References

Chen Y, Pal B, Visvader JE, Smyth GK (2017). Differential methylation analysis of reduced representation bisulfite sequencing experiments using edgeR. F1000Research 6, 2055. https://bioinf.wehi.edu.au/edgeR/F1000Research2017

Peters TJ, Buckley MJ, Chen Y, Smyth GK, Goodnow CC, Clark SJ (2021). Calling differentially methylated regions from whole genome bisulphite sequencing with DMRcate. Nucleic Acids Research 49(19), e109. https://doi.org/10.1093/nar/gkab637

Note: the above code was corrected 11:50am 23 Jan 2026 Melbourne time.

ADD COMMENT • link 5 weeks ago Gordon Smyth 53k

0

Entering edit mode

Confirming my understanding after executing the code provided: this is similar to a design formula of ~0+mouse+group+group:allele, except we also have an explicit additional column for group == BxC.0, right? Comparing the output I got to a model matrix with that design formula:

> colnames(design)
 [1] "BxC.0.patvsmat" "BxC.1.patvsmat" "BxC.2.patvsmat" "BxC.5.patvsmat" "BxC.7.patvsmat" "CxB.0.patvsmat" "BxC.0"         
 [8] "BxC.1"          "BxC.2"          "BxC.5"          "BxC.7"          "CxB.0"          "mouse1"         "mouse2"        
[15] "mouse3"         "mouse4"         "mouse5"         "mouse6"         "mouse7"         "mouse8"         "mouse9"        
[22] "mouse10"        "mouse11"        "mouse12"        "mouse13"        "mouse14"        "mouse15"        "mouse16"       
[29] "mouse17"        "mouse18"

> colnames(model.matrix(~0+mouse+group+group:allele))
 [1] "mouse1"               "mouse2"               "mouse3"               "mouse4"               "mouse5"              
 [6] "mouse6"               "mouse7"               "mouse8"               "mouse9"               "mouse10"             
[11] "mouse11"              "mouse12"              "mouse13"              "mouse14"              "mouse15"             
[16] "mouse16"              "mouse17"              "mouse18"              "groupBxC.1"           "groupBxC.2"          
[21] "groupBxC.5"           "groupBxC.7"           "groupCxB.0"           "groupBxC.0:allelepat" "groupBxC.1:allelepat"
[26] "groupBxC.2:allelepat" "groupBxC.5:allelepat" "groupBxC.7:allelepat" "groupCxB.0:allelepat"

Firstly, I want to note: this doesn't seem to produce a full rank model matrix, either as a design formula or as an assemblage of sub-matrices. Secondly, if I'm not mistaken, even if this design were to work, it would force many of the comparisons I want to make to be individual contrasts, correct? For example, if I wanted to see which genes had significant changes to ASE over time---not just between one time point and another, but over time in general---I'd need to contrast each of the BxC.t.patvsmat groups with BxC.0.patvsmat, store their coefficients, then... do a t-test or some such to evaluate if they're consistently significantly different from 0? It feels like I'm losing a lot of information by turning the quantitative time variable into simply a component of the categorical group variable, and it's unclear to me if this is actually answering the questions I have as a model.

ADD REPLY • link 5 weeks ago chp265 • 0

0

Entering edit mode

My apologies, I forgot that design.group is not included in the final matrix, because the baseline group effects are already captured by mouse effects. I have corrected my answer now to give the correct full-rank design matrix.

Yes, this design matrix does force you to form contrasts, which I view as an advantage, but it does not restrict you to testing individual contrasts. I don't know DESeq2, but in the edgeR syntax you just pass a number of contrasts at once to edge::glmLRT and it will perform the appropriate anova-like test on all the degrees of freedom. To test for any changes in ASE over time in the BxC cross, you have to pass a vector of four contrasts, e.g. comparing each time to 0.

The design in my answer does not lose any information nor does it restrict in any way the tests you can conduct. It is just an explicit parametrization, not a restriction or a reduction.

If you did want to test for any change in ASE over time, without forming contrasts, then you could define design.group without an intercept:

design.group <- model.matrix(~group)

and proceed to form design.allele and design in the same way as above. In the edgeR pipeline, you would then use

lrt <- glmLRT(fit, coef=2:5)

to do an anova-like test for any changes in ASE over the five time-points.

The design matrix approach that I outlined in my answer above mimics what is done by the edgeR::modelMatrixMeth function for methylation data. The beauty of this approach is that you can define design.group in any valid way that describes your experimental factors, and the final full design matrix will still be correct. You can code design.group to be a factorial design if you want, or even to be one with a spline trend over time. It does not need to be a oneway layout.

ADD REPLY • link 5 weeks ago Gordon Smyth 53k

0

Entering edit mode

I don't think DESeq2 has the ability to do multi-contrast comparisons, at least based on what I can understand of the manual; I can pick multiple factors to be part of the numerator or denominator of a contrast, but I can't establish multiple contrasts. I also worry about the fact that my time variable isn't linearly spaced. That said, it seems like I can adapt this method as follows:

> design.mouse = model.matrix(~0+mouse)
> design.cross = model.matrix(~0+cross)
> design.time = model.matrix(~0+time)
> design.cross.allele = design.cross * (allele == "pat")
> colnames(design.cross.allele) = paste(colnames(design.cross), "patvsmat", sep=".")
> design.time.allele = design.time * (allele == "pat")
> colnames(design.time.allele) = paste(colnames(design.time), "patvsmat", sep=".")
> design = cbind(design.mouse, design.cross.allele, design.time.allele)

If I'm not mistaken, this is equivalent to a design formula of ~0+mouse+cross:allele+time:allele after dropping all the allelemat columns, which ExploreModelMatrix indicates is a full rank model matrix. This means that time effects are appropriately quantitatively scaled, and I can directly interrogate strain effects separately from time effects. If I want to do a comparison between specific groups, I calculate a contrast by averaging the parameters for the three mice of each group (e.g. colMeans(cross == "BxC" & time == 0)) and subtracting one group from the other.

I know this complicates my analysis beyond "just contrast group:allele A with group:allele B," but it seems like this is the only way to make use of time as a quantitative variable? Plus, considering my scientific questions are also specifically about cross:allele and time:allele, I feel like sacrificing some ease of analysis is worth making the output easier to interpret. That said, if something about this design is about to introduce a glaring flaw, please say so.

ADD REPLY • link 5 weeks ago chp265 • 0

0

Entering edit mode

Yes, so for one thing, note that Gordon has provided code for doing what you are asking in edgeR, which you can go ahead and use.

You asked about DESeq2: we don't allow user to specify all their contrasts in one go, you have to run results() separately to perform different linear contrasts (see ?results for full details on this).

ADD REPLY • link 5 weeks ago Michael Love 43k

0

Entering edit mode

Cool, glad to hear I wasn't missing some functionality in DESeq2. I don't know that much about the differences between DESeq2 vs edgeR; my understanding is the practical outcome of using one vs the other is a matter of nuance that's lost on most folks, myself included. Since I've got DESeq2 already set up in my code I'm favoring using that, but if I absolutely have to go the group:allele route I'll probably switch to edgeR so I can do the multi-contrast stuff more readily.

ADD REPLY • link 5 weeks ago chp265 • 0

0

Entering edit mode

The design matrix that I described above is completely general. I imagine that it would be possible to use the same approach to generate full and reduced matrices for DESeq2, avoiding the need for multiple contrasts, but I don't know DESeq2 well enough to say.

To treat time as quantitative, you could have simply replaced design.group in my code by

> design.group <- model.matrix(~cross+time)

although I think your code may be equivalent. However, I would not be doing this if it was me. I would want the data to show me how large are the changes in specificity at each step in age, rather that building in a priori the assumption that the allele proportions must be perfectly logit-linear in mouse age. If you were willing to assume linearity, then you woudn't have needed such an expensive RNA-seq experiment with so many time points in the first place.

I don't find interaction model formulas easier to interpret, but it's up to you.

You mention your scientific questions, but I don't think that you have explained what they are. I did a quick search through your question and comments above and couldn't find anything.

Anyway, I have answered your question about how to define a full-rank matrix. Whatever you do and whichever package you use, you can compare back to what I have shown you to check that the model you are using is valid. Good luck.

ADD REPLY • link 5 weeks ago Gordon Smyth 53k

0

Entering edit mode

Good point, I hadn't considered the fact quantitative time implies a (logit-)linear response to time. You're correct that I haven't explained my scientific questions; for the sake of posterity, they're basically "which genes have ASE," "how does that ASE change over time," and "is that ASE caused by the strain or by the parent of origin." I'll do some thinking/discussing with my PI about how I should construct my model matrix to answer those questions given all the discussion so far; at the very least, I think these conversations have explained 1) how to properly build a full-rank model matrix that appropriately controls for the necessary factors and 2) how to build the appropriate contrasts to interrogate any version of those questions. Thanks for your help!

ADD REPLY • link 5 weeks ago chp265 • 0