Question

edgeR design/contrast matrix troubles

0

Entering edit mode

alex.hart • 0

@4d22b863

Last seen 7 months ago

Sweden

Hello!

I am trying to analyse the differential gene expression between groups in a complicated multi-factor experiment and am having trouble setting up the contrast matrix. In this experiment, samples were collected from three different geographic origins (Brazil, California and Yemen), were kept under one of four different regimes (ancestral, hot, cold, switch), and RNA was assayed at one of three temperatures (23, 29, 35). I wish for the assay temperature to be treated as a factor rather than covariate.

I initially added the group info during the readDGE stage:

counts <- readDGE(files$Files, header=FALSE, labels = files$Summary, group = files$group)

where 'files' is a spreadsheet of file names, and 'group' is a unique indicator for each possible combination of origin, regime, and assay temp (36 groups). I tried using numeric and character string names for each group, for what it's worth. I did it at this stage because the filterByExpr() function works best when you define groups.

The issue arises at a later stage, with:

design <- model.matrix(~0 + as.factor(files$Assay) + files$Regime + files$Origin)

Which I'm attempting to generate an intercept-free model for each of the groups. The design matrix generated has columns for all three assay temps but is missing a column for regime (only has hot, cold and switch), and for origin (only California and Yemen, no Brazil). I understand that this is because the information is redundant - i.e. the presence of Ancestral is implied with by an absence of hot, cold, or switch, that's how it's encoded. What I can't figure out is how to set up the resulting DE testing, because what would be nice and easy would be:

regime.HotvsAnc <- makeContrasts(Regime.Hot-Regime.Anc, levels=colnames(design))
assay.HotvsAnc.results <- glmQLFTest(fit, contrast = regime.HotvsAnc)

but this is not possible, because there's no ancestral column in the design matrix. Instead I've ended up with this absolute monstrosity, which is based on the files$groups:

regime.HotvsAnc <- makeContrasts(
  (BraAnc23 + BraAnc29 + BraAnc35 +
   CalAnc23 + CalAnc29 + CalAnc35 +
  YemAnc23 + YemAnc29 + YemAnc35)/9 -
  (BraHot23 + BraHot29 + BraHot35 +
   CalHot23 + CalHot29 + CalHot35 +
   YemHot23 + YemHot29 + YemHot35)/9, 
  levels=colnames(design))
regime.HotvsAnc <- glmQLFTest(fit, contrast = regime.HotvsAnc)

Does anyone know how I can take a less convoluted approach? Or is it necessary to make these horrible collections for every DE comparison?

Thank you so much!

Alex

edgeR • 667 views

ADD COMMENT • link updated 7 months ago by James W. MacDonald 65k • written 7 months ago by alex.hart • 0

score 1 · Answer 1 · 2023-09-06

1

Entering edit mode

James W. MacDonald 65k

@james-w-macdonald-5106

Last seen 1 day ago

United States

You don't say what your goals are, apart from one contrast. For that one contrast it's easiest to simply set up as a conventional factor effects model.

> design <- model.matrix(~Assay + Regime + Origin, d.f)
> colnames(design)
[1] "(Intercept)"  "Assay29"     
[3] "Assay35"      "Regimecold"  
[5] "Regimehot"    "Regimeswitch"
[7] "OriginCA"     "OriginYE"

And now the fifth coefficient (Regeimehot) tests for the difference between the hot and ancestral regime. The fourth, and sixth coefficients test for the difference between cold and ancestral and switch and ancestral, respectively. If you want hot vs cold, it's Regimehot - Regimecold. But getting at more complicated comparisons is more difficult with this paramerization. For example, if you wanted to know if there are any genes that react differently to the hot vs ancestral regime depending on the origin (e.g., do Brazilian samples react differently than Yemeni samples), then you need to either include interaction terms or you need to fit the cell means model you already tried and then just compute contrasts by hand.

ADD COMMENT • link 7 months ago James W. MacDonald 65k

0

Entering edit mode

Hi James,

Thanks very much for your explanation, indeed, the reason I was trying to avoid using the intercept-based model was because while we will also do simple comparisons, the interactions between the factors - assay, regime and origin - are the main point of the study.

In that case, would it be best to run the intercept model for the simple comparisons - such as ancestral vs. other regimes, Brazil vs the other origins - and then use a non-intercept model for the interactive effects? How would you set up the design model for something like origin by regime effects? Something like this?

design <- model.matrix(~0 + as.factor(files$Assay) + files$Regime + files$Origin + files$Regime*files$Origin)

ADD REPLY • link 7 months ago alex.hart • 0

0

Entering edit mode

You have a fairly complex experiment that will take some sophistication to analyze appropriately. You would do well to find a local statistician with whom you can collaborate.

ADD REPLY • link 7 months ago James W. MacDonald 65k

score 0 · Answer 2 · 2023-09-07

You're getting mixed up between simple additive models on one hand and factorial models using all possible interactions of factors on the other. The group-means approach without an intercept is only applicable for the latter. Your situation is actually much simpler. You don't even need contrasts at all.

As James has said, you just define:

> Assay <- factor(files$Assay)
> Regime <- factor(files$Regime)
> Origin <- factor(files$Origin)
> design <- model.matrix(~Assay + Regime + Origin)

Then

keep <- filterByExpr(counts, design)
y < counts[keep, ,keep.lib.size=FALSE]
y <- normLibSizes(y)
y <- estimateDisp(y,design)
fit <- glmQLFit(y, design)

To compare Hot to Ancestral you just use

q <- glmQLFTest(fit, coef="RegimeHot")
topTags(q)

All other contrasts can be extracted just as easily. There's no need to define groups at all, in fact defining groups is wrong because your simple additive linear model is not equivalent to groups.

I'm not sure where you got the idea that filterByExpr requires a group variable. What it actually needs is the correct design matrix!