edgeR design/contrast matrix troubles
Entering edit mode
alex.hart • 0
Last seen 4 days ago


I am trying to analyse the differential gene expression between groups in a complicated multi-factor experiment and am having trouble setting up the contrast matrix. In this experiment, samples were collected from three different geographic origins (Brazil, California and Yemen), were kept under one of four different regimes (ancestral, hot, cold, switch), and RNA was assayed at one of three temperatures (23, 29, 35). I wish for the assay temperature to be treated as a factor rather than covariate.

I initially added the group info during the readDGE stage:

counts <- readDGE(files$Files, header=FALSE, labels = files$Summary, group = files$group)

where 'files' is a spreadsheet of file names, and 'group' is a unique indicator for each possible combination of origin, regime, and assay temp (36 groups). I tried using numeric and character string names for each group, for what it's worth. I did it at this stage because the filterByExpr() function works best when you define groups.

The issue arises at a later stage, with:

design <- model.matrix(~0 + as.factor(files$Assay) + files$Regime + files$Origin)

Which I'm attempting to generate an intercept-free model for each of the groups. The design matrix generated has columns for all three assay temps but is missing a column for regime (only has hot, cold and switch), and for origin (only California and Yemen, no Brazil). I understand that this is because the information is redundant - i.e. the presence of Ancestral is implied with by an absence of hot, cold, or switch, that's how it's encoded. What I can't figure out is how to set up the resulting DE testing, because what would be nice and easy would be:

regime.HotvsAnc <- makeContrasts(Regime.Hot-Regime.Anc, levels=colnames(design))
assay.HotvsAnc.results <- glmQLFTest(fit, contrast = regime.HotvsAnc)

but this is not possible, because there's no ancestral column in the design matrix. Instead I've ended up with this absolute monstrosity, which is based on the files$groups:

regime.HotvsAnc <- makeContrasts(
  (BraAnc23 + BraAnc29 + BraAnc35 +
   CalAnc23 + CalAnc29 + CalAnc35 +
  YemAnc23 + YemAnc29 + YemAnc35)/9 -
  (BraHot23 + BraHot29 + BraHot35 +
   CalHot23 + CalHot29 + CalHot35 +
   YemHot23 + YemHot29 + YemHot35)/9, 
regime.HotvsAnc <- glmQLFTest(fit, contrast = regime.HotvsAnc)

Does anyone know how I can take a less convoluted approach? Or is it necessary to make these horrible collections for every DE comparison?

Thank you so much!


edgeR • 298 views
Entering edit mode
Last seen 1 day ago
United States

You don't say what your goals are, apart from one contrast. For that one contrast it's easiest to simply set up as a conventional factor effects model.

> design <- model.matrix(~Assay + Regime + Origin, d.f)
> colnames(design)
[1] "(Intercept)"  "Assay29"     
[3] "Assay35"      "Regimecold"  
[5] "Regimehot"    "Regimeswitch"
[7] "OriginCA"     "OriginYE"

And now the fifth coefficient (Regeimehot) tests for the difference between the hot and ancestral regime. The fourth, and sixth coefficients test for the difference between cold and ancestral and switch and ancestral, respectively. If you want hot vs cold, it's Regimehot - Regimecold. But getting at more complicated comparisons is more difficult with this paramerization. For example, if you wanted to know if there are any genes that react differently to the hot vs ancestral regime depending on the origin (e.g., do Brazilian samples react differently than Yemeni samples), then you need to either include interaction terms or you need to fit the cell means model you already tried and then just compute contrasts by hand.

Entering edit mode

Hi James,

Thanks very much for your explanation, indeed, the reason I was trying to avoid using the intercept-based model was because while we will also do simple comparisons, the interactions between the factors - assay, regime and origin - are the main point of the study.

In that case, would it be best to run the intercept model for the simple comparisons - such as ancestral vs. other regimes, Brazil vs the other origins - and then use a non-intercept model for the interactive effects? How would you set up the design model for something like origin by regime effects? Something like this?

design <- model.matrix(~0 + as.factor(files$Assay) + files$Regime + files$Origin + files$Regime*files$Origin)
Entering edit mode

You have a fairly complex experiment that will take some sophistication to analyze appropriately. You would do well to find a local statistician with whom you can collaborate.

Entering edit mode
Last seen 2 hours ago
WEHI, Melbourne, Australia

You're getting mixed up between simple additive models on one hand and factorial models using all possible interactions of factors on the other. The group-means approach without an intercept is only applicable for the latter. Your situation is actually much simpler. You don't even need contrasts at all.

As James has said, you just define:

> Assay <- factor(files$Assay)
> Regime <- factor(files$Regime)
> Origin <- factor(files$Origin)
> design <- model.matrix(~Assay + Regime + Origin)


keep <- filterByExpr(counts, design)
y < counts[keep, ,keep.lib.size=FALSE]
y <- normLibSizes(y)
y <- estimateDisp(y,design)
fit <- glmQLFit(y, design)

To compare Hot to Ancestral you just use

q <- glmQLFTest(fit, coef="RegimeHot")

All other contrasts can be extracted just as easily. There's no need to define groups at all, in fact defining groups is wrong because your simple additive linear model is not equivalent to groups.

I'm not sure where you got the idea that filterByExpr requires a group variable. What it actually needs is the correct design matrix!

Entering edit mode

Hi Gordon,

Thank you for your help. I have not set up models and matrices like this before, so even after reading the very extensive guide by Law et al. 2020, I am still struggling. My understanding of filterByExpr was that it filters features based on their absence/presence in groups that are set, hence my concern for setting the groups up during the filtering steps.

The aims of the analysis are to investigate the origin x regime, regime x assay, assay x origin effects, and maybe even the origin x regime x assay effects if we can get enough samples. In which case, I think maybe I will need a group-means intercept-free approach for factor combinations. I will read the manual and the Law et al. 2020 paper again and see if I can figure it out!


Login before adding your answer.

Traffic: 229 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6