Hello!
I am trying to analyse the differential gene expression between groups in a complicated multi-factor experiment and am having trouble setting up the contrast matrix. In this experiment, samples were collected from three different geographic origins (Brazil, California and Yemen), were kept under one of four different regimes (ancestral, hot, cold, switch), and RNA was assayed at one of three temperatures (23, 29, 35). I wish for the assay temperature to be treated as a factor rather than covariate.
I initially added the group info during the readDGE stage:
counts <- readDGE(files$Files, header=FALSE, labels = files$Summary, group = files$group)
where 'files' is a spreadsheet of file names, and 'group' is a unique indicator for each possible combination of origin, regime, and assay temp (36 groups). I tried using numeric and character string names for each group, for what it's worth. I did it at this stage because the filterByExpr() function works best when you define groups.
The issue arises at a later stage, with:
design <- model.matrix(~0 + as.factor(files$Assay) + files$Regime + files$Origin)
Which I'm attempting to generate an intercept-free model for each of the groups. The design matrix generated has columns for all three assay temps but is missing a column for regime (only has hot, cold and switch), and for origin (only California and Yemen, no Brazil). I understand that this is because the information is redundant - i.e. the presence of Ancestral is implied with by an absence of hot, cold, or switch, that's how it's encoded. What I can't figure out is how to set up the resulting DE testing, because what would be nice and easy would be:
regime.HotvsAnc <- makeContrasts(Regime.Hot-Regime.Anc, levels=colnames(design))
assay.HotvsAnc.results <- glmQLFTest(fit, contrast = regime.HotvsAnc)
but this is not possible, because there's no ancestral column in the design matrix. Instead I've ended up with this absolute monstrosity, which is based on the files$groups:
regime.HotvsAnc <- makeContrasts(
(BraAnc23 + BraAnc29 + BraAnc35 +
CalAnc23 + CalAnc29 + CalAnc35 +
YemAnc23 + YemAnc29 + YemAnc35)/9 -
(BraHot23 + BraHot29 + BraHot35 +
CalHot23 + CalHot29 + CalHot35 +
YemHot23 + YemHot29 + YemHot35)/9,
levels=colnames(design))
regime.HotvsAnc <- glmQLFTest(fit, contrast = regime.HotvsAnc)
Does anyone know how I can take a less convoluted approach? Or is it necessary to make these horrible collections for every DE comparison?
Thank you so much!
Alex
Hi James,
Thanks very much for your explanation, indeed, the reason I was trying to avoid using the intercept-based model was because while we will also do simple comparisons, the interactions between the factors - assay, regime and origin - are the main point of the study.
In that case, would it be best to run the intercept model for the simple comparisons - such as ancestral vs. other regimes, Brazil vs the other origins - and then use a non-intercept model for the interactive effects? How would you set up the design model for something like origin by regime effects? Something like this?
You have a fairly complex experiment that will take some sophistication to analyze appropriately. You would do well to find a local statistician with whom you can collaborate.