Hello,
I'm analysing an RNA-Seq dataset where I have two cell lines (WT and mutant) and nine stimulation conditions. I want to find the condition effects for the mutant cell line (ie which genes are differentially expressed between the mutant and WT cells in each of the nine conditions). When I run the analysis two different ways that I believe should be basically equivalent, I get very different results.
When I run DEseq2 using the design formula ~ group (as illustrated on p. 24 of the Oct 13, 2015 manual, where group is a combination of the experimental factors cell_line and stimulant) to compare expression between mutant and WT under the control condition, I get 483 differentially expressed genes by using a cutoff of FC>= +/- 1.5 and adj. p-value <= 0.05 between two "groups". I used the following code in order to get these results:
#start R R #read in your csv sample sheet samples = read.csv("sample_meta_data.csv",header=TRUE) #start edgeR, which you will use to collapse replicates and make the count matrix library("edgeR") #read in count matrix counts = read.csv("combined_counts.csv",header=TRUE,row.names=1) #filter counts to remove low reads noint = rownames(counts) %in% c("__no_feature","__ambiguous","__too_low_aQual", "__not_aligned","__alignment_not_unique") cpms = cpm(counts) keep = rowSums(cpms >1) >=3 & !noint counts = counts[keep,] #attach the column data from the sample sheet to the variable coldata coldata = with(samples,data.frame(shortname = I(sample_name), cell_line = cell_line, stimulant = stimulant, screen = screen)) #start DESeq2 library("DESeq2") #construct your DESeq2 data set, making sure to specify the design matrix here dds <- DESeqDataSetFromMatrix(countData = counts, colData = coldata, design = ~ stimulant) #rename columns in dds so they match the sample metadata sheet colnames(dds) <- coldata$shortname #add the combined factor group to dds, reset design to ~group dds$group <- factor(paste0(dds$stimulant, dds$cell_line)) design(dds) <- ~group #relevel controls dds$stimulant <- relevel(dds$stimulant, "control_4hr") dds$cell_line <- relevel(dds$cell_line, "WT") #run DESeq2 on the dataset dds <- DESeq(dds, parallel=TRUE) #print out results names resultsNames(dds) #find DE genes between mutant control and WT control resdds <- results(dds, contrast = c('group' , 'mutantcontrol', 'WTcontrol'))
However, when I delete all stimulation conditions from my dataset except for control and use the design ~ cell_type, I get 62 differentially expressed genes using the same cutoffs. I use the following code to get these results:
#start R R #read in your csv sample sheet samples = read.csv("sample_meta_data.csv",header=TRUE) #start edgeR, which you will use to collapse replicates and make the count matrix library("edgeR") #read in count matrix counts = read.csv("combined_counts.csv",header=TRUE,row.names=1) #filter counts to remove low reads noint = rownames(counts) %in% c("__no_feature","__ambiguous","__too_low_aQual", "__not_aligned","__alignment_not_unique") cpms = cpm(counts) keep = rowSums(cpms >1) >=3 & !noint counts = counts[keep,] #attach the column data from the sample sheet to the variable coldata coldata = with(samples,data.frame(shortname = I(sample_name), cell_line = cell_line, stimulant = stimulant, screen = screen)) #start DESeq2 library("DESeq2") #construct your DESeq2 data set, making sure to specify the design matrix here dds <- DESeqDataSetFromMatrix(countData = counts, colData = coldata, design = ~ cell_line) #rename columns in dds so they match the sample metadata sheet colnames(dds) <- coldata$shortname #relevel controls dds$cell_line <- relevel(dds$cell_line, "WT") #run DESeq2 on the dataset dds <- DESeq(dds, parallel=TRUE) #print out results names resultsNames(dds) #find DE genes between mutant control and WT control resdds <- results(dds, contrast = c('cell_line' , 'mutant', 'WT'))
I'm not sure why this is happening. I understand that the first dataset is larger, and this has an effect on the statistical calculations of the DESeq2 package, but I thought that essentially the two comparisons that I'm performing are identical. Could you please explain why the results of these two analyses are so different?
Thanks!
Erin
This is the output of sessionInfo()
R version 3.2.2 (2015-08-14) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 14.04.3 LTS locale: [1] LC_CTYPE=en_CA.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_CA.UTF-8 LC_COLLATE=en_CA.UTF-8 [5] LC_MONETARY=en_CA.UTF-8 LC_MESSAGES=en_CA.UTF-8 [7] LC_PAPER=en_CA.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets [8] methods base other attached packages: [1] DESeq2_1.10.0 RcppArmadillo_0.6.200.2.0 [3] Rcpp_0.12.1 SummarizedExperiment_1.0.1 [5] Biobase_2.30.0 GenomicRanges_1.22.1 [7] GenomeInfoDb_1.6.1 IRanges_2.4.1 [9] S4Vectors_0.8.1 BiocGenerics_0.16.1 [11] edgeR_3.12.0 limma_3.26.2 loaded via a namespace (and not attached): [1] RColorBrewer_1.1-2 futile.logger_1.4.1 plyr_1.8.3 [4] XVector_0.10.0 futile.options_1.0.0 tools_3.2.2 [7] zlibbioc_1.16.0 rpart_4.1-10 digest_0.6.8 [10] RSQLite_1.0.0 annotate_1.48.0 gtable_0.1.2 [13] lattice_0.20-33 DBI_0.3.1 proto_0.3-10 [16] gridExtra_2.0.0 genefilter_1.52.0 cluster_2.0.3 [19] stringr_1.0.0 locfit_1.5-9.1 nnet_7.3-11 [22] grid_3.2.2 AnnotationDbi_1.32.0 XML_3.98-1.3 [25] survival_2.38-3 BiocParallel_1.4.0 foreign_0.8-66 [28] latticeExtra_0.6-26 Formula_1.2-1 geneplotter_1.48.0 [31] ggplot2_1.0.1 reshape2_1.4.1 lambda.r_1.1.7 [34] magrittr_1.5 scales_0.3.0 Hmisc_3.17-0 [37] splines_3.2.2 MASS_7.3-44 xtable_1.8-0 [40] colorspace_1.2-6 stringi_1.0-1 acepack_1.3-3.3 [43] munsell_0.4.2