Hi,
I am running an RNA Seq analysis for differential gene expressions for the data described as follows with biological replicates: 3 Control, 3 Treatment a, 4 Treatment b and 3 Treatment c.
I need to make the following comparisons: Treatment a vs Control, Treatment b vs Control, Treatment c vs Control and Treatment b vs Treatment c. I am confused as to where should I write the contrasts? Can you please explain the difference between factor levels and contrasts?
Thank you for your help.
# load libraries
library(DESeq2)
# Step 1: preparing the count data
# read in counts data
data <- read.csv('gene_count_matrix.csv', header = T, row.names = 1L) # Don't forget to put row names as ENSEMBL IDs, else code will not run
head(data)
## Control_1 Control_2 Control_3 Treatment a_1
## ENSMUSG00000028180 397 372 433 442
## ENSMUSG00000028182 39 21 59 31
## ENSMUSG00000028185 0 0 0 0
## ENSMUSG00000028184 884 991 735 903
## ENSMUSG00000028187 11 54 11 146
## ENSMUSG00000028186 0 0 0 3
## Treatment a_2 Treatment a_3 Treatment b_1 Treatment b_2
## ENSMUSG00000028180 427 473 397 318
## ENSMUSG00000028182 9 6 6 4
## ENSMUSG00000028185 0 0 0 0
## ENSMUSG00000028184 814 771 110 154
## ENSMUSG00000028187 142 173 13 22
## ENSMUSG00000028186 0 0 8 0
## Treatment b_3 Treatment b_4 Treatment c_1 Treatment c_2
## ENSMUSG00000028180 35 31 93 44
## ENSMUSG00000028182 38 24 40 42
## ENSMUSG00000028185 0 0 0 0
## ENSMUSG00000028184 926 857 1018 1960
## ENSMUSG00000028187 152 196 179 160
## ENSMUSG00000028186 0 0 4 4
## Treatment c_3
## ENSMUSG00000028180 3453
## ENSMUSG00000028182 38
## ENSMUSG00000028185 0
## ENSMUSG00000028184 9514
## ENSMUSG00000028187 1663
## ENSMUSG00000028186 0
# read in sample info
pheno <- read.csv('pheno_sheet.csv', header = T, row.names = 1L)
# making sure the row names in colData matches to column names in counts_data
all(colnames(data) %in% rownames(pheno))
## [1] TRUE
# are they in the same order?
all(colnames(data) == rownames(pheno))
## [1] TRUE
# Step 2: construct a DESeqDataSet object
dds <- DESeqDataSetFromMatrix(countData = data,
colData = pheno,
design = ~ Sample)
## Warning in DESeqDataSet(se, design = design, ignoreRank): some variables in
## design formula are characters, converting to factors
## Note: levels of factors in the design contain characters other than
## letters, numbers, '_' and '.'. It is recommended (but not required) to use
## only letters, numbers, and delimiters '_' or '.', as these are safe characters
## for column names in R. [This is a message, not a warning or an error]
dds # n = 55,401
## class: DESeqDataSet
## dim: 55401 13
## metadata(1): version
## assays(1): counts
## rownames(55401): ENSMUSG00000028180 ENSMUSG00000028182 ...
## ENSMUSG00000036958 ENSMUSG00000042678
## rowData names(0):
## colnames(13): Control_1 Control_2 ... Treatment c_2 Treatment c_3
## colData names(1): Sample
# pre-filtering: removing rows with low gene counts
# This is recommended step as it helps reduce the size of the dds object and increase speed of computation
# keeping rows that have at least 10 reads total
keep <- rowSums(counts(dds)) >= 10 # >=10 because recommended in the vignette
dds <- dds[keep,]
dds # n= 24,038
## class: DESeqDataSet
## dim: 24038 13
## metadata(1): version
## assays(1): counts
## rownames(24038): ENSMUSG00000028180 ENSMUSG00000028182 ...
## ENSMUSG00000036959 ENSMUSG00000042678
## rowData names(0):
## colnames(13): Control_1 Control_2 ... Treatment c_2 Treatment c_3
## colData names(1): Sample
# set the factor level
dds$Sample<- relevel(dds$Sample, ref = "Control") # If the reference level is not mentioned, then it will automatically choose the reference alphabetically
dds$Sample
## [1] Control Control Control Treatment a Treatment a Treatment a Treatment b Treatment b Treatment b
## [10] Treatment b Treatment c Treatment c Treatment c
## Levels: Control Treatment a Treatment b Treatment c
# Step 3: Run DESeq
dds <- DESeq(dds)
## estimating size factors
## Note: levels of factors in the design contain characters other than
## letters, numbers, '_' and '.'. It is recommended (but not required) to use
## only letters, numbers, and delimiters '_' or '.', as these are safe characters
## for column names in R. [This is a message, not a warning or an error]
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## Note: levels of factors in the design contain characters other than
## letters, numbers, '_' and '.'. It is recommended (but not required) to use
## only letters, numbers, and delimiters '_' or '.', as these are safe characters
## for column names in R. [This is a message, not a warning or an error]
## final dispersion estimates
## Note: levels of factors in the design contain characters other than
## letters, numbers, '_' and '.'. It is recommended (but not required) to use
## only letters, numbers, and delimiters '_' or '.', as these are safe characters
## for column names in R. [This is a message, not a warning or an error]
## fitting model and testing
res_Treatment a <- results(dds, contrast = c("Sample", "Treatment a", "Control"))
res_Treatment a
## log2 fold change (MLE): Sample Treatment a vs Control
## Wald test p-value: Sample Treatment a vs Control
## DataFrame with 24038 rows and 6 columns
## baseMean log2FoldChange lfcSE stat pvalue
## <numeric> <numeric> <numeric> <numeric> <numeric>
## ENSMUSG00000028180 3862.6757 -0.057747 0.070335 -0.756516 0.4286236
## ENSMUSG00000028182 41.9097 -0.017546 0.488878 -0.035134 0.9708700
## ENSMUSG00000028184 934.53415 0.213321 0.04522 2.638894 0.0081781
## ENSMUSG00000028187 163.88391 0.178966 0.095555 1.526086 0.1618069
## ENSMUSG00000028186 1.4051 3.8020443 3.816178 0.961816 0.3116193
## ... ... ... ... ... ...
## ENSMUSG00000106108 0.8605 0.000000 4.620361 0.00000 1.00000e+00
## ENSMUSG00000042675 5026.25958 -0.33392 0.069803 -5.128677 2.9785e-07
## ENSMUSG00000083767 50.466359 0.1324 0.435256 0.9007 7.3404e-01
## ENSMUSG00000036959 126.95391 -0.55607 0.126976 -4.27439 2.6366e-05
## ENSMUSG00000042678 3.27304 5.45275 3.965736 1.75693 1.8917e-01
## padj
## <numeric>
## ENSMUSG00000028180 0.647043
## ENSMUSG00000028182 0.985538
## ENSMUSG00000028184 0.03958
## ENSMUSG00000028187 0.3520263
## ENSMUSG00000028186 NA
## ... ...
## ENSMUSG00000106108 NA
## ENSMUSG00000042675 8.26780-06
## ENSMUSG00000083767 8.5413e-01
## ENSMUSG00000036959 3.4132e-04
## ENSMUSG00000042678 NA
summary(res_Treatment a)
##
## out of 24038 with nonzero total read count
## adjusted p-value < 0.1
## LFC > 0 (up) : 2002, 8.3%
## LFC < 0 (down) : 2421, 10%
## outliers [1] : 17, 0.071%
## low counts [2] : 9315, 39%
## (mean count < 25)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results
# QUESTIONS:
# I SET OTHER CONTRASTS FOR COMPARISIONS AT THE SAME STEP WHERE I GENERATED Treatment a VS CONTROL?
# I use these contrasts to generate MA plots, differentially expressed genes
and other data visualizations individually for each contrast?
Hi Michael, Thank you for the input.
Dear Michael Love . Some doubts on this question
1) Correct that as long as the contrasts are specified there's no need to set the factor level
2) If we were talking about multiple treatments and multiple time points, the way to go would be the same ?
e.g. sampleTable
3) is that important to filter genes like in the step described below? In the vignette it's written that this step can be skipped before running Deseq functions.
Thanks in advance
1)
https://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#note-on-factor-levels
2)
This is one approach.
3)
It can be skipped but there's no harm in removing low count genes, we usually use something like:
where
X
is the smallest group size in the experiment.