Hi,
I am running an RNA Seq analysis for differential gene expressions for the data described as follows with biological replicates: 3 Control, 3 Treatment a, 4 Treatment b and 3 Treatment c.
I need to make the following comparisons: Treatment a vs Control, Treatment b vs Control, Treatment c vs Control and Treatment b vs Treatment c. I am confused as to where should I write the contrasts? Can you please explain the difference between factor levels and contrasts?
Thank you for your help.
# load libraries
library(DESeq2)
# Step 1: preparing the count data
  # read in counts data
data <- read.csv('gene_count_matrix.csv', header = T, row.names = 1L) # Don't forget to put row names as ENSEMBL IDs, else code will not run
head(data)
##                    Control_1 Control_2 Control_3 Treatment a_1
## ENSMUSG00000028180           397           372           433       442
## ENSMUSG00000028182             39             21             59         31
## ENSMUSG00000028185              0              0              0          0
## ENSMUSG00000028184           884           991           735       903
## ENSMUSG00000028187           11           54           11       146
## ENSMUSG00000028186              0              0              0          3
##                    Treatment a_2 Treatment a_3 Treatment b_1 Treatment b_2
## ENSMUSG00000028180       427       473           397           318
## ENSMUSG00000028182         9         6             6             4
## ENSMUSG00000028185          0          0              0              0
## ENSMUSG00000028184       814       771          110          154
## ENSMUSG00000028187       142       173           13           22
## ENSMUSG00000028186          0          0              8              0
##                   Treatment b_3 Treatment b_4 Treatment c_1 Treatment c_2
## ENSMUSG00000028180           35           31       93       44
## ENSMUSG00000028182             38             24         40         42
## ENSMUSG00000028185              0              0          0          0
## ENSMUSG00000028184           926           857      1018      1960
## ENSMUSG00000028187           152           196       179       160
## ENSMUSG00000028186              0              0          4          4
##                   Treatment c_3
## ENSMUSG00000028180       3453
## ENSMUSG00000028182         38
## ENSMUSG00000028185          0
## ENSMUSG00000028184       9514
## ENSMUSG00000028187       1663
## ENSMUSG00000028186          0
  # read in sample info
pheno <- read.csv('pheno_sheet.csv', header = T, row.names = 1L)
  # making sure the row names in colData matches to column names in counts_data
all(colnames(data) %in% rownames(pheno))
## [1] TRUE
  # are they in the same order?
all(colnames(data) == rownames(pheno))
## [1] TRUE
# Step 2: construct a DESeqDataSet object
dds <- DESeqDataSetFromMatrix(countData = data,
                              colData = pheno,
                              design = ~ Sample)
## Warning in DESeqDataSet(se, design = design, ignoreRank): some variables in
## design formula are characters, converting to factors
##   Note: levels of factors in the design contain characters other than
##   letters, numbers, '_' and '.'. It is recommended (but not required) to use
##   only letters, numbers, and delimiters '_' or '.', as these are safe characters
##   for column names in R. [This is a message, not a warning or an error]
dds # n = 55,401
## class: DESeqDataSet 
## dim: 55401 13 
## metadata(1): version
## assays(1): counts
## rownames(55401): ENSMUSG00000028180 ENSMUSG00000028182 ...
##   ENSMUSG00000036958 ENSMUSG00000042678
## rowData names(0):
## colnames(13): Control_1 Control_2 ... Treatment c_2 Treatment c_3
## colData names(1): Sample
  # pre-filtering: removing rows with low gene counts
  # This is recommended step as it helps reduce the size of the dds object and increase speed of computation
  # keeping rows that have at least 10 reads total
keep <- rowSums(counts(dds)) >= 10  # >=10 because recommended in the vignette
dds <- dds[keep,]
dds # n= 24,038
## class: DESeqDataSet 
## dim: 24038 13 
## metadata(1): version
## assays(1): counts
## rownames(24038): ENSMUSG00000028180 ENSMUSG00000028182 ...
##   ENSMUSG00000036959 ENSMUSG00000042678
## rowData names(0):
## colnames(13): Control_1 Control_2 ... Treatment c_2 Treatment c_3
## colData names(1): Sample
# set the factor level
dds$Sample<- relevel(dds$Sample, ref = "Control") # If the reference level is not mentioned, then it will automatically choose the reference alphabetically
dds$Sample
##  [1] Control Control Control Treatment a     Treatment a     Treatment a     Treatment b Treatment b Treatment b
## [10] Treatment b Treatment c     Treatment c     Treatment c    
## Levels: Control Treatment a Treatment b Treatment c
# Step 3: Run DESeq
dds <- DESeq(dds)
## estimating size factors
##   Note: levels of factors in the design contain characters other than
##   letters, numbers, '_' and '.'. It is recommended (but not required) to use
##   only letters, numbers, and delimiters '_' or '.', as these are safe characters
##   for column names in R. [This is a message, not a warning or an error]
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
##   Note: levels of factors in the design contain characters other than
##   letters, numbers, '_' and '.'. It is recommended (but not required) to use
##   only letters, numbers, and delimiters '_' or '.', as these are safe characters
##   for column names in R. [This is a message, not a warning or an error]
## final dispersion estimates
##   Note: levels of factors in the design contain characters other than
##   letters, numbers, '_' and '.'. It is recommended (but not required) to use
##   only letters, numbers, and delimiters '_' or '.', as these are safe characters
##   for column names in R. [This is a message, not a warning or an error]
## fitting model and testing
res_Treatment a <- results(dds, contrast = c("Sample", "Treatment a", "Control"))
res_Treatment a
## log2 fold change (MLE): Sample Treatment a vs Control 
## Wald test p-value: Sample Treatment a vs Control 
## DataFrame with 24038 rows and 6 columns
##                       baseMean log2FoldChange     lfcSE       stat      pvalue
##                      <numeric>      <numeric> <numeric>  <numeric>   <numeric>
## ENSMUSG00000028180  3862.6757     -0.057747 0.070335 -0.756516  0.4286236
## ENSMUSG00000028182    41.9097     -0.017546 0.488878 -0.035134  0.9708700
## ENSMUSG00000028184  934.53415      0.213321 0.04522  2.638894  0.0081781
## ENSMUSG00000028187  163.88391      0.178966 0.095555  1.526086  0.1618069
## ENSMUSG00000028186     1.4051      3.8020443 3.816178  0.961816  0.3116193
## ...                        ...            ...       ...        ...         ...
## ENSMUSG00000106108    0.8605       0.000000 4.620361   0.00000 1.00000e+00
## ENSMUSG00000042675 5026.25958      -0.33392 0.069803  -5.128677 2.9785e-07
## ENSMUSG00000083767   50.466359       0.1324 0.435256   0.9007 7.3404e-01
## ENSMUSG00000036959 126.95391      -0.55607 0.126976  -4.27439 2.6366e-05
## ENSMUSG00000042678    3.27304       5.45275 3.965736   1.75693 1.8917e-01
##                           padj
##                      <numeric>
## ENSMUSG00000028180   0.647043
## ENSMUSG00000028182   0.985538
## ENSMUSG00000028184   0.03958
## ENSMUSG00000028187   0.3520263
## ENSMUSG00000028186          NA
## ...                        ...
## ENSMUSG00000106108          NA
## ENSMUSG00000042675 8.26780-06
## ENSMUSG00000083767 8.5413e-01
## ENSMUSG00000036959 3.4132e-04
## ENSMUSG00000042678          NA
summary(res_Treatment a)
## 
## out of 24038 with nonzero total read count
## adjusted p-value < 0.1
## LFC > 0 (up)       : 2002, 8.3%
## LFC < 0 (down)     : 2421, 10%
## outliers [1]       : 17, 0.071%
## low counts [2]     : 9315, 39%
## (mean count < 25)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results
# QUESTIONS:
# I SET OTHER CONTRASTS FOR COMPARISIONS AT THE SAME STEP WHERE I GENERATED Treatment a VS CONTROL?
# I use these contrasts to generate MA plots, differentially expressed genes 
and other data visualizations individually for each contrast?

Hi Michael, Thank you for the input.
Dear Michael Love . Some doubts on this question
1) Correct that as long as the contrasts are specified there's no need to set the factor level
2) If we were talking about multiple treatments and multiple time points, the way to go would be the same ?
e.g. sampleTable
3) is that important to filter genes like in the step described below? In the vignette it's written that this step can be skipped before running Deseq functions.
Thanks in advance
1)
https://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#note-on-factor-levels
2)
This is one approach.
3)
It can be skipped but there's no harm in removing low count genes, we usually use something like:
where
Xis the smallest group size in the experiment.