Question

edgeR setting up glm/contrasts/ANOVA-like tests

1

Entering edit mode

ericalarson ▴ 10

@ericalarson-6890

Last seen 5.0 years ago

United States

Hello all,

I am working with RNASeq data using edgeR. The software is very clear and straightforward, but I would like to double check that I am correctly implementing my contrasts/glm approach for my data. Thank you in advance for taking the time to read my question.

My data contain two species (mus and dom) and I have three replicates for each species (inbred mouse strains). For each species I have four cell types (SP, DIP, LZ, RS) that represent different stages of spermatogenesis (they can be considered time points).

I want to identify the following:
1. genes that are differentially expressed between species in each cell type
2. genes that are differentially expressed (highly expressed) in a single cell type compared to the other three (within a single species)
3. genes that are differentially expressed at any time point within one species

So for, I have approached this in two steps (outlined below). My questions for the list are:
Q1. Is my approach to implementing contrasts/glm appropriate for my data?
Q2. I currently test for differentially expressed genes in two steps, the first using pairwise contrasts (comparing cell types between species) and the second as a time series. Is there a way to do all of this in one step?
Q3. To get at the number of genes that are differentially expressed in a single cell type compared to the other three, I have been using a series of pairwise comparisons, and then identifying genes that are differentially expressed in all three comparisons (I take the intersection of my significant toptags with positive or negative logFC in all three comparisons). Is there a better way?

What I have done so far:

Set up contrasts and identified genes that are differentially expressed between species in each cell type (#1) following section 3.3 in the edgeR manual (Experiments with all combinations of multiple factors).

I set a design that combines my experimental factors into a single factor (called Group)

> design.musvdom
          sample species cell   Group
1  CCPP21.1M.DIP     mus  DIP mus.DIP
2  CCPP21.2M.DIP     mus  DIP mus.DIP
3  CCPP21.3M.DIP     mus  DIP mus.DIP
4   CCPP21.1M.LZ     mus   LZ  mus.LZ
5   CCPP21.2M.LZ     mus   LZ  mus.LZ
6   CCPP21.3M.LZ     mus   LZ  mus.LZ
7   CCPP21.1M.RS     mus   RS  mus.RS
8   CCPP21.2M.RS     mus   RS  mus.RS
9   CCPP21.3M.RS     mus   RS  mus.RS
10  CCPP21.1M.SP     mus   SP  mus.SP
11  CCPP21.2M.SP     mus   SP  mus.SP
12  CCPP21.3M.SP     mus   SP  mus.SP
13  WWLL3.1M.DIP     dom  DIP dom.DIP
14  WWLL7.1M.DIP     dom  DIP dom.DIP
15  WWLL7.2M.DIP     dom  DIP dom.DIP
16   WWLL3.1M.RS     dom   RS  dom.RS
17   WWLL6.1M.RS     dom   RS  dom.RS
18   WWLL7.2M.RS     dom   RS  dom.RS
19   WWLL3.1M.SP     dom   SP  dom.SP
20   WWLL7.2M.SP     dom   SP  dom.SP
21   WWLL7.3M.SP     dom   SP  dom.SP
22   WWLL4.1M.LZ     dom   LZ  dom.LZ
23   WWLL7.2M.LZ     dom   LZ  dom.LZ
24   WWLL7.3M.LZ     dom   LZ  dom.LZ

I generated a design matrix:

designmatrix.musvdom <- model.matrix(~0+Group)

> designmatrix.musvdom
              dom.DIP dom.LZ dom.RS dom.SP mus.DIP mus.LZ mus.RS mus.SP
CCPP21.1M.DIP       0      0      0      0       1      0      0      0
CCPP21.2M.DIP       0      0      0      0       1      0      0      0
CCPP21.3M.DIP       0      0      0      0       1      0      0      0
CCPP21.1M.LZ        0      0      0      0       0      1      0      0
CCPP21.2M.LZ        0      0      0      0       0      1      0      0
CCPP21.3M.LZ        0      0      0      0       0      1      0      0
CCPP21.1M.RS        0      0      0      0       0      0      1      0
CCPP21.2M.RS        0      0      0      0       0      0      1      0
CCPP21.3M.RS        0      0      0      0       0      0      1      0
CCPP21.1M.SP        0      0      0      0       0      0      0      1
CCPP21.2M.SP        0      0      0      0       0      0      0      1
CCPP21.3M.SP        0      0      0      0       0      0      0      1
WWLL3.1M.DIP        1      0      0      0       0      0      0      0
WWLL7.1M.DIP        1      0      0      0       0      0      0      0
WWLL7.2M.DIP        1      0      0      0       0      0      0      0
WWLL3.1M.RS         0      0      1      0       0      0      0      0
WWLL6.1M.RS         0      0      1      0       0      0      0      0
WWLL7.2M.RS         0      0      1      0       0      0      0      0
WWLL3.1M.SP         0      0      0      1       0      0      0      0
WWLL7.2M.SP         0      0      0      1       0      0      0      0
WWLL7.3M.SP         0      0      0      1       0      0      0      0
WWLL4.1M.LZ         0      1      0      0       0      0      0      0
WWLL7.2M.LZ         0      1      0      0       0      0      0      0
WWLL7.3M.LZ         0      1      0      0       0      0      0      0

and a list of contrasts to test:

my.contrasts<-makeContrasts(
  musvdom.sp = mus.SP - dom.SP,
  musvdom.lz = mus.LZ - dom.LZ,
  musvdom.dip = mus.DIP - dom.DIP,
  musvdom.rs = mus.RS - dom.RS,
  mus.sp.lz = mus.SP - mus.LZ,
  mus.sp.dip = mus.SP - mus.DIP,
  mus.sp.rs = mus.SP - mus.RS,
  mus.lz.dip = mus.LZ - mus.DIP,
  mus.lz.rs = mus.LZ - mus.RS,
  mus.dip.rs = mus.DIP - mus.RS, 
levels = designmatrix.musvdom
)

Finally, to identify genes that are differentially expressed at any time point, I followed the edgeR manual section 3.3.3 (Treatment effects over all times) and this user post (edgeR - ANOVA-like test for differential gene expression)

I use a similar design table and matrix:

          sample species cell   Group
1  CCPP21.1M.DIP     mus  DIP mus.DIP
2  CCPP21.2M.DIP     mus  DIP mus.DIP
3  CCPP21.3M.DIP     mus  DIP mus.DIP
4   CCPP21.1M.LZ     mus   LZ  mus.LZ
5   CCPP21.2M.LZ     mus   LZ  mus.LZ
6   CCPP21.3M.LZ     mus   LZ  mus.LZ
7   CCPP21.1M.RS     mus   RS  mus.RS
8   CCPP21.2M.RS     mus   RS  mus.RS
9   CCPP21.3M.RS     mus   RS  mus.RS
10  CCPP21.1M.SP     mus   SP  mus.SP
11  CCPP21.2M.SP     mus   SP  mus.SP
12  CCPP21.3M.SP     mus   SP  mus.SP

I assign the first cell type as the reference level

>design.mus$cell <- relevel(design.mus$cell, ref = "SP")

>designmatrix.mus <- model.matrix(~cell, data = design.mus)

              (Intercept) cellDIP cellLZ cellRS
CCPP21.1M.DIP           1       1      0      0
CCPP21.2M.DIP           1       1      0      0
CCPP21.3M.DIP           1       1      0      0
CCPP21.1M.LZ            1       0      1      0
CCPP21.2M.LZ            1       0      1      0
CCPP21.3M.LZ            1       0      1      0
CCPP21.1M.RS            1       0      0      1
CCPP21.2M.RS            1       0      0      1
CCPP21.3M.RS            1       0      0      1
CCPP21.1M.SP            1       0      0      0
CCPP21.2M.SP            1       0      0      0
CCPP21.3M.SP            1       0      0      0

When I determine differentially expressed genes I specify the coefficients in my model:

>fit <- glmFit(dge.mus, designmatrix.mus)
>colnames(fit)
   [1] "(Intercept)" "cellDIP"     "cellLZ"      "cellRS" 
>de <- glmLRT(fit, coef=c(2:4))

Again thanks in advance for any advice!

Erica Larson

edger rnaseq • 2.8k views

ADD COMMENT • link updated 9.5 years ago by Aaron Lun ★ 28k • written 9.5 years ago by ericalarson ▴ 10

score 2 · Accepted Answer · 2014-10-21

Your first aim is to identify changes between species in each cell type. In that case, you want to specify and run the contrast for each cell type separately, i.e.,

contrast.SP <- makeContrasts(mus.SP - dom.SP, levels = designmatrix.musvdom)
de.SP <- glmLRT(fit, contrast=contrast.SP)

and so on, for each cell type. Right now, your my.contrasts will be a matrix, such that glmLRT will perform an ANOVA-like test for changes in any of the comparisons. That may not be the desired result if you're investigating one cell type at a time.

For your second aim, there are at least two ways to do it. The first is to take the intersection of pairwise comparisons as you have described. The second is to compare the expression of your chosen cell type (e.g., SP) with the average expression of all other cell types from the same species, i.e.,

ave.con <- makeContrasts(mus.SP - (mus.LZ + mus.DIP + mus.RS)/3, levels=designmatrix.musvdom)
de.ave <- glmLRT(fit, contrast=ave.con)

This isn't as stringent as the intersection approach, which may or may not be desirable. For example, consider a gene that is upregulated in SP over LZ/DIP but identical between SP and RS. This will be detected by comparing the averages with ave.con, which may be sensible the gene is, on average, upregulated. However, it won't be present in the intersection of the three pairwise comparisons.

For your final aim, your analysis is correct. However, you can get more precision for dispersion estimation by including the samples from the dom species in your design matrix. If you use your designmatrix.musvdom, the ANOVA-like contrast would become

anova.con <-makeContrasts(
  mus.sp.lz = mus.SP - mus.LZ,
  mus.sp.dip = mus.SP - mus.DIP,
  mus.sp.rs = mus.SP - mus.RS,
levels = designmatrix.musvdom)

In this case, there's no need to specify all pairwise comparisons. For example, if mus.LZ were different to mus.DIP, then at least one of them would not be equal to mus.SP. This would result in rejection of the null hypothesis through mus.sp.dip and/or mus.sp.lz.