edgeR setting up glm/contrasts/ANOVA-like tests
1
1
Entering edit mode
ericalarson ▴ 10
@ericalarson-6890
Last seen 5.0 years ago
United States

Hello all,

I am working with RNASeq data using edgeR. The software is very clear and straightforward, but I would like to double check that I am correctly implementing my contrasts/glm approach for my data. Thank you in advance for taking the time to read my question.

My data contain two species (mus and dom) and I have three replicates for each species (inbred mouse strains). For each species I have four cell types (SP, DIP, LZ, RS) that represent different stages of spermatogenesis (they can be considered time points).

I want to identify the following:
1. genes that are differentially expressed between species in each cell type
2. genes that are differentially expressed (highly expressed) in a single cell type compared to the other three (within a single species) 
3. genes that are differentially expressed at any time point within one species 

So for, I have approached this in two steps (outlined below). My questions for the list are:
Q1. Is my approach to implementing contrasts/glm appropriate for my data?
Q2. I currently test for differentially expressed genes in two steps, the first using pairwise contrasts (comparing cell types between species) and the second as a time series. Is there a way to do all of this in one step?
Q3. To get at the number of genes that are differentially expressed in a single cell type compared to the other three, I have been using a series of pairwise comparisons, and then identifying genes that are differentially expressed in all three comparisons (I take the intersection of my significant toptags with positive or negative logFC in all three comparisons). Is there a better way?

 

What I have done so far:

Set up contrasts and identified genes that are differentially expressed between species in each cell type (#1) following section 3.3 in the edgeR manual (Experiments with all combinations of multiple factors). 

I set a design that combines my experimental factors into a single factor (called Group)

> design.musvdom
          sample species cell   Group
1  CCPP21.1M.DIP     mus  DIP mus.DIP
2  CCPP21.2M.DIP     mus  DIP mus.DIP
3  CCPP21.3M.DIP     mus  DIP mus.DIP
4   CCPP21.1M.LZ     mus   LZ  mus.LZ
5   CCPP21.2M.LZ     mus   LZ  mus.LZ
6   CCPP21.3M.LZ     mus   LZ  mus.LZ
7   CCPP21.1M.RS     mus   RS  mus.RS
8   CCPP21.2M.RS     mus   RS  mus.RS
9   CCPP21.3M.RS     mus   RS  mus.RS
10  CCPP21.1M.SP     mus   SP  mus.SP
11  CCPP21.2M.SP     mus   SP  mus.SP
12  CCPP21.3M.SP     mus   SP  mus.SP
13  WWLL3.1M.DIP     dom  DIP dom.DIP
14  WWLL7.1M.DIP     dom  DIP dom.DIP
15  WWLL7.2M.DIP     dom  DIP dom.DIP
16   WWLL3.1M.RS     dom   RS  dom.RS
17   WWLL6.1M.RS     dom   RS  dom.RS
18   WWLL7.2M.RS     dom   RS  dom.RS
19   WWLL3.1M.SP     dom   SP  dom.SP
20   WWLL7.2M.SP     dom   SP  dom.SP
21   WWLL7.3M.SP     dom   SP  dom.SP
22   WWLL4.1M.LZ     dom   LZ  dom.LZ
23   WWLL7.2M.LZ     dom   LZ  dom.LZ
24   WWLL7.3M.LZ     dom   LZ  dom.LZ

I generated a design matrix:

designmatrix.musvdom <- model.matrix(~0+Group)
> designmatrix.musvdom
              dom.DIP dom.LZ dom.RS dom.SP mus.DIP mus.LZ mus.RS mus.SP
CCPP21.1M.DIP       0      0      0      0       1      0      0      0
CCPP21.2M.DIP       0      0      0      0       1      0      0      0
CCPP21.3M.DIP       0      0      0      0       1      0      0      0
CCPP21.1M.LZ        0      0      0      0       0      1      0      0
CCPP21.2M.LZ        0      0      0      0       0      1      0      0
CCPP21.3M.LZ        0      0      0      0       0      1      0      0
CCPP21.1M.RS        0      0      0      0       0      0      1      0
CCPP21.2M.RS        0      0      0      0       0      0      1      0
CCPP21.3M.RS        0      0      0      0       0      0      1      0
CCPP21.1M.SP        0      0      0      0       0      0      0      1
CCPP21.2M.SP        0      0      0      0       0      0      0      1
CCPP21.3M.SP        0      0      0      0       0      0      0      1
WWLL3.1M.DIP        1      0      0      0       0      0      0      0
WWLL7.1M.DIP        1      0      0      0       0      0      0      0
WWLL7.2M.DIP        1      0      0      0       0      0      0      0
WWLL3.1M.RS         0      0      1      0       0      0      0      0
WWLL6.1M.RS         0      0      1      0       0      0      0      0
WWLL7.2M.RS         0      0      1      0       0      0      0      0
WWLL3.1M.SP         0      0      0      1       0      0      0      0
WWLL7.2M.SP         0      0      0      1       0      0      0      0
WWLL7.3M.SP         0      0      0      1       0      0      0      0
WWLL4.1M.LZ         0      1      0      0       0      0      0      0
WWLL7.2M.LZ         0      1      0      0       0      0      0      0
WWLL7.3M.LZ         0      1      0      0       0      0      0      0

and a list of contrasts to test:

my.contrasts<-makeContrasts(
  musvdom.sp = mus.SP - dom.SP,
  musvdom.lz = mus.LZ - dom.LZ,
  musvdom.dip = mus.DIP - dom.DIP,
  musvdom.rs = mus.RS - dom.RS,
  mus.sp.lz = mus.SP - mus.LZ,
  mus.sp.dip = mus.SP - mus.DIP,
  mus.sp.rs = mus.SP - mus.RS,
  mus.lz.dip = mus.LZ - mus.DIP,
  mus.lz.rs = mus.LZ - mus.RS,
  mus.dip.rs = mus.DIP - mus.RS, 
levels = designmatrix.musvdom
)

Finally, to identify genes that are differentially expressed at any time point, I followed the edgeR manual section 3.3.3 (Treatment effects over all times) and this user post (edgeR - ANOVA-like test for differential gene expression)

I use a similar design table and matrix:

          sample species cell   Group
1  CCPP21.1M.DIP     mus  DIP mus.DIP
2  CCPP21.2M.DIP     mus  DIP mus.DIP
3  CCPP21.3M.DIP     mus  DIP mus.DIP
4   CCPP21.1M.LZ     mus   LZ  mus.LZ
5   CCPP21.2M.LZ     mus   LZ  mus.LZ
6   CCPP21.3M.LZ     mus   LZ  mus.LZ
7   CCPP21.1M.RS     mus   RS  mus.RS
8   CCPP21.2M.RS     mus   RS  mus.RS
9   CCPP21.3M.RS     mus   RS  mus.RS
10  CCPP21.1M.SP     mus   SP  mus.SP
11  CCPP21.2M.SP     mus   SP  mus.SP
12  CCPP21.3M.SP     mus   SP  mus.SP

I assign the first cell type as the reference level

>design.mus$cell <- relevel(design.mus$cell, ref = "SP")

>designmatrix.mus <- model.matrix(~cell, data = design.mus)

              (Intercept) cellDIP cellLZ cellRS
CCPP21.1M.DIP           1       1      0      0
CCPP21.2M.DIP           1       1      0      0
CCPP21.3M.DIP           1       1      0      0
CCPP21.1M.LZ            1       0      1      0
CCPP21.2M.LZ            1       0      1      0
CCPP21.3M.LZ            1       0      1      0
CCPP21.1M.RS            1       0      0      1
CCPP21.2M.RS            1       0      0      1
CCPP21.3M.RS            1       0      0      1
CCPP21.1M.SP            1       0      0      0
CCPP21.2M.SP            1       0      0      0
CCPP21.3M.SP            1       0      0      0

When I determine differentially expressed genes  I specify the coefficients in my model:

>fit <- glmFit(dge.mus, designmatrix.mus)
>colnames(fit)
   [1] "(Intercept)" "cellDIP"     "cellLZ"      "cellRS" 
>de <- glmLRT(fit, coef=c(2:4))

Again thanks in advance for any advice!

Erica Larson

 

 

edger rnaseq • 2.8k views
ADD COMMENT
2
Entering edit mode
Aaron Lun ★ 28k
@alun
Last seen 28 minutes ago
The city by the bay

Your first aim is to identify changes between species in each cell type. In that case, you want to specify and run the contrast for each cell type separately, i.e., 

contrast.SP <- makeContrasts(mus.SP - dom.SP, levels = designmatrix.musvdom)
de.SP <- glmLRT(fit, contrast=contrast.SP)

and so on, for each cell type. Right now, your my.contrasts will be a matrix, such that glmLRT will perform an ANOVA-like test for changes in any of the comparisons. That may not be the desired result if you're investigating one cell type at a time.

For your second aim, there are at least two ways to do it. The first is to take the intersection of pairwise comparisons as you have described. The second is to compare the expression of your chosen cell type (e.g., SP) with the average expression of all other cell types from the same species, i.e.,

ave.con <- makeContrasts(mus.SP - (mus.LZ + mus.DIP + mus.RS)/3, levels=designmatrix.musvdom)
de.ave <- glmLRT(fit, contrast=ave.con)

This isn't as stringent as the intersection approach, which may or may not be desirable. For example, consider a gene that is upregulated in SP over LZ/DIP but identical between SP and RS. This will be detected by comparing the averages with ave.con, which may be sensible the gene is, on average, upregulated. However, it won't be present in the intersection of the three pairwise comparisons.

For your final aim, your analysis is correct. However, you can get more precision for dispersion estimation by including the samples from the dom species in your design matrix. If you use your designmatrix.musvdom, the ANOVA-like contrast would become

anova.con <-makeContrasts(
  mus.sp.lz = mus.SP - mus.LZ,
  mus.sp.dip = mus.SP - mus.DIP,
  mus.sp.rs = mus.SP - mus.RS,
levels = designmatrix.musvdom)

In this case, there's no need to specify all pairwise comparisons. For example, if mus.LZ were different to mus.DIP, then at least one of them would not be equal to mus.SP. This would result in rejection of the null hypothesis through mus.sp.dip and/or mus.sp.lz.

ADD COMMENT
0
Entering edit mode

Aaron,

Thanks for the quick reply!

You are correct that I have to run each contrast in the "mycontrasts" list separately. I failed to mention in the original post that I have listed the contrasts as a matrix, which I then loop through and call a series of commands and plots for each contrast.

Thanks for the feedback regarding aim two. I had tried taking the average expression of the other three cell types and I ended up with a very large number of differentially expressed genes. I intuitively thought that the reason was as you said, one cell type may be different from two, but not the third, but I am glad to have confirmation that that my code was correct. In this case, I am looking only for genes that are higher in one than all the other three, so the intersection does seem to be the best method.

Finally, thanks for the tip on the ANOVA-like contrast. If the dispersion estimates are better when all samples are included, then it does seem preferable (and less code!) to use pairwise contrasts.

 

ADD REPLY

Login before adding your answer.

Traffic: 691 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6