Hello all,
I am working with RNASeq data using edgeR. The software is very clear and straightforward, but I would like to double check that I am correctly implementing my contrasts/glm approach for my data. Thank you in advance for taking the time to read my question.
My data contain two species (mus and dom) and I have three replicates for each species (inbred mouse strains). For each species I have four cell types (SP, DIP, LZ, RS) that represent different stages of spermatogenesis (they can be considered time points).
I want to identify the following:
1. genes that are differentially expressed between species in each cell type
2. genes that are differentially expressed (highly expressed) in a single cell type compared to the other three (within a single species)
3. genes that are differentially expressed at any time point within one species
So for, I have approached this in two steps (outlined below). My questions for the list are:
Q1. Is my approach to implementing contrasts/glm appropriate for my data?
Q2. I currently test for differentially expressed genes in two steps, the first using pairwise contrasts (comparing cell types between species) and the second as a time series. Is there a way to do all of this in one step?
Q3. To get at the number of genes that are differentially expressed in a single cell type compared to the other three, I have been using a series of pairwise comparisons, and then identifying genes that are differentially expressed in all three comparisons (I take the intersection of my significant toptags with positive or negative logFC in all three comparisons). Is there a better way?
What I have done so far:
Set up contrasts and identified genes that are differentially expressed between species in each cell type (#1) following section 3.3 in the edgeR manual (Experiments with all combinations of multiple factors).
I set a design that combines my experimental factors into a single factor (called Group)
> design.musvdom sample species cell Group 1 CCPP21.1M.DIP mus DIP mus.DIP 2 CCPP21.2M.DIP mus DIP mus.DIP 3 CCPP21.3M.DIP mus DIP mus.DIP 4 CCPP21.1M.LZ mus LZ mus.LZ 5 CCPP21.2M.LZ mus LZ mus.LZ 6 CCPP21.3M.LZ mus LZ mus.LZ 7 CCPP21.1M.RS mus RS mus.RS 8 CCPP21.2M.RS mus RS mus.RS 9 CCPP21.3M.RS mus RS mus.RS 10 CCPP21.1M.SP mus SP mus.SP 11 CCPP21.2M.SP mus SP mus.SP 12 CCPP21.3M.SP mus SP mus.SP 13 WWLL3.1M.DIP dom DIP dom.DIP 14 WWLL7.1M.DIP dom DIP dom.DIP 15 WWLL7.2M.DIP dom DIP dom.DIP 16 WWLL3.1M.RS dom RS dom.RS 17 WWLL6.1M.RS dom RS dom.RS 18 WWLL7.2M.RS dom RS dom.RS 19 WWLL3.1M.SP dom SP dom.SP 20 WWLL7.2M.SP dom SP dom.SP 21 WWLL7.3M.SP dom SP dom.SP 22 WWLL4.1M.LZ dom LZ dom.LZ 23 WWLL7.2M.LZ dom LZ dom.LZ 24 WWLL7.3M.LZ dom LZ dom.LZ
I generated a design matrix:
designmatrix.musvdom <- model.matrix(~0+Group)
> designmatrix.musvdom dom.DIP dom.LZ dom.RS dom.SP mus.DIP mus.LZ mus.RS mus.SP CCPP21.1M.DIP 0 0 0 0 1 0 0 0 CCPP21.2M.DIP 0 0 0 0 1 0 0 0 CCPP21.3M.DIP 0 0 0 0 1 0 0 0 CCPP21.1M.LZ 0 0 0 0 0 1 0 0 CCPP21.2M.LZ 0 0 0 0 0 1 0 0 CCPP21.3M.LZ 0 0 0 0 0 1 0 0 CCPP21.1M.RS 0 0 0 0 0 0 1 0 CCPP21.2M.RS 0 0 0 0 0 0 1 0 CCPP21.3M.RS 0 0 0 0 0 0 1 0 CCPP21.1M.SP 0 0 0 0 0 0 0 1 CCPP21.2M.SP 0 0 0 0 0 0 0 1 CCPP21.3M.SP 0 0 0 0 0 0 0 1 WWLL3.1M.DIP 1 0 0 0 0 0 0 0 WWLL7.1M.DIP 1 0 0 0 0 0 0 0 WWLL7.2M.DIP 1 0 0 0 0 0 0 0 WWLL3.1M.RS 0 0 1 0 0 0 0 0 WWLL6.1M.RS 0 0 1 0 0 0 0 0 WWLL7.2M.RS 0 0 1 0 0 0 0 0 WWLL3.1M.SP 0 0 0 1 0 0 0 0 WWLL7.2M.SP 0 0 0 1 0 0 0 0 WWLL7.3M.SP 0 0 0 1 0 0 0 0 WWLL4.1M.LZ 0 1 0 0 0 0 0 0 WWLL7.2M.LZ 0 1 0 0 0 0 0 0 WWLL7.3M.LZ 0 1 0 0 0 0 0 0
and a list of contrasts to test:
my.contrasts<-makeContrasts( musvdom.sp = mus.SP - dom.SP, musvdom.lz = mus.LZ - dom.LZ, musvdom.dip = mus.DIP - dom.DIP, musvdom.rs = mus.RS - dom.RS, mus.sp.lz = mus.SP - mus.LZ, mus.sp.dip = mus.SP - mus.DIP, mus.sp.rs = mus.SP - mus.RS, mus.lz.dip = mus.LZ - mus.DIP, mus.lz.rs = mus.LZ - mus.RS, mus.dip.rs = mus.DIP - mus.RS, levels = designmatrix.musvdom )
Finally, to identify genes that are differentially expressed at any time point, I followed the edgeR manual section 3.3.3 (Treatment effects over all times) and this user post (edgeR - ANOVA-like test for differential gene expression)
I use a similar design table and matrix:
sample species cell Group 1 CCPP21.1M.DIP mus DIP mus.DIP 2 CCPP21.2M.DIP mus DIP mus.DIP 3 CCPP21.3M.DIP mus DIP mus.DIP 4 CCPP21.1M.LZ mus LZ mus.LZ 5 CCPP21.2M.LZ mus LZ mus.LZ 6 CCPP21.3M.LZ mus LZ mus.LZ 7 CCPP21.1M.RS mus RS mus.RS 8 CCPP21.2M.RS mus RS mus.RS 9 CCPP21.3M.RS mus RS mus.RS 10 CCPP21.1M.SP mus SP mus.SP 11 CCPP21.2M.SP mus SP mus.SP 12 CCPP21.3M.SP mus SP mus.SP
I assign the first cell type as the reference level
>design.mus$cell <- relevel(design.mus$cell, ref = "SP") >designmatrix.mus <- model.matrix(~cell, data = design.mus) (Intercept) cellDIP cellLZ cellRS CCPP21.1M.DIP 1 1 0 0 CCPP21.2M.DIP 1 1 0 0 CCPP21.3M.DIP 1 1 0 0 CCPP21.1M.LZ 1 0 1 0 CCPP21.2M.LZ 1 0 1 0 CCPP21.3M.LZ 1 0 1 0 CCPP21.1M.RS 1 0 0 1 CCPP21.2M.RS 1 0 0 1 CCPP21.3M.RS 1 0 0 1 CCPP21.1M.SP 1 0 0 0 CCPP21.2M.SP 1 0 0 0 CCPP21.3M.SP 1 0 0 0
When I determine differentially expressed genes I specify the coefficients in my model:
>fit <- glmFit(dge.mus, designmatrix.mus) >colnames(fit) [1] "(Intercept)" "cellDIP" "cellLZ" "cellRS" >de <- glmLRT(fit, coef=c(2:4))
Again thanks in advance for any advice!
Erica Larson
Aaron,
Thanks for the quick reply!
You are correct that I have to run each contrast in the "mycontrasts" list separately. I failed to mention in the original post that I have listed the contrasts as a matrix, which I then loop through and call a series of commands and plots for each contrast.
Thanks for the feedback regarding aim two. I had tried taking the average expression of the other three cell types and I ended up with a very large number of differentially expressed genes. I intuitively thought that the reason was as you said, one cell type may be different from two, but not the third, but I am glad to have confirmation that that my code was correct. In this case, I am looking only for genes that are higher in one than all the other three, so the intersection does seem to be the best method.
Finally, thanks for the tip on the ANOVA-like contrast. If the dispersion estimates are better when all samples are included, then it does seem preferable (and less code!) to use pairwise contrasts.