DESeq2 different results from similar datasets
2
0
Entering edit mode
erin.gill81 ▴ 60
@eringill81-6831
Last seen 5.3 years ago
Canada

Hello,

I'm analysing an RNA-Seq dataset where I have two cell lines (WT and mutant) and nine stimulation conditions. I want to find the condition effects for the mutant cell line (ie which genes are differentially expressed between the mutant and WT cells in each of the nine conditions). When I run the analysis two different ways that I believe should be basically equivalent, I get very different results.

When I run DEseq2 using the design formula ~ group (as illustrated on p. 24 of the Oct 13, 2015 manual, where group is a combination of the experimental factors cell_line and stimulant) to compare expression between mutant and WT under the control condition, I get 483 differentially expressed genes by using a cutoff of FC>= +/- 1.5 and adj. p-value <= 0.05 between two "groups".  I used the following code in order to get these results:

#start R
R

#read in your csv sample sheet
samples = read.csv("sample_meta_data.csv",header=TRUE)

#start edgeR, which you will use to collapse replicates and make the count matrix
library("edgeR")

#read in count matrix
counts = read.csv("combined_counts.csv",header=TRUE,row.names=1)

#filter counts to remove low reads
noint = rownames(counts) %in%

  c("__no_feature","__ambiguous","__too_low_aQual",

  "__not_aligned","__alignment_not_unique")

cpms = cpm(counts)

keep = rowSums(cpms >1) >=3 & !noint

counts = counts[keep,]

#attach the column data from the sample sheet to the variable coldata
coldata = with(samples,data.frame(shortname = I(sample_name), cell_line = cell_line, stimulant = stimulant, screen = screen))

#start DESeq2
library("DESeq2")

#construct your DESeq2 data set, making sure to specify the design matrix here
dds <- DESeqDataSetFromMatrix(countData = counts,
colData = coldata,
design = ~ stimulant)

#rename columns in dds so they match the sample metadata sheet
colnames(dds) <- coldata$shortname

#add the combined factor group to dds, reset design to ~group
dds$group <- factor(paste0(dds$stimulant, dds$cell_line))
design(dds) <- ~group

#relevel controls
dds$stimulant <- relevel(dds$stimulant, "control_4hr")
dds$cell_line <- relevel(dds$cell_line, "WT")

#run DESeq2 on the dataset
dds <- DESeq(dds, parallel=TRUE)

#print out results names
resultsNames(dds)

#find DE genes between mutant control and WT control
resdds <- results(dds, contrast = c('group' , 'mutantcontrol', 'WTcontrol'))

 

However, when I delete all stimulation conditions from my dataset except for control and use the design ~ cell_type, I get 62 differentially expressed genes using the same cutoffs. I use the following code to get these results:

#start R
R

#read in your csv sample sheet
samples = read.csv("sample_meta_data.csv",header=TRUE)

#start edgeR, which you will use to collapse replicates and make the count matrix
library("edgeR")

#read in count matrix
counts = read.csv("combined_counts.csv",header=TRUE,row.names=1)


#filter counts to remove low reads
noint = rownames(counts) %in%

  c("__no_feature","__ambiguous","__too_low_aQual",

  "__not_aligned","__alignment_not_unique")

cpms = cpm(counts)

keep = rowSums(cpms >1) >=3 & !noint

counts = counts[keep,]

#attach the column data from the sample sheet to the variable coldata
coldata = with(samples,data.frame(shortname = I(sample_name), cell_line = cell_line, stimulant = stimulant, screen = screen))

#start DESeq2
library("DESeq2")

#construct your DESeq2 data set, making sure to specify the design matrix here
dds <- DESeqDataSetFromMatrix(countData = counts,
colData = coldata,
design = ~ cell_line)

#rename columns in dds so they match the sample metadata sheet
colnames(dds) <- coldata$shortname

#relevel controls
dds$cell_line <- relevel(dds$cell_line, "WT")

#run DESeq2 on the dataset
dds <- DESeq(dds, parallel=TRUE)

#print out results names
resultsNames(dds)

#find DE genes between mutant control and WT control
resdds <- results(dds, contrast = c('cell_line' , 'mutant', 'WT'))

 

I'm not sure why this is happening. I understand that the first dataset is larger, and this has an effect on the statistical calculations of the DESeq2 package, but I thought that essentially the two comparisons that I'm performing are identical. Could you please explain why the results of these two analyses are so different?

Thanks!

Erin

 

This is the output of sessionInfo()

R version 3.2.2 (2015-08-14)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.3 LTS

locale:
 [1] LC_CTYPE=en_CA.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_CA.UTF-8        LC_COLLATE=en_CA.UTF-8    
 [5] LC_MONETARY=en_CA.UTF-8    LC_MESSAGES=en_CA.UTF-8   
 [7] LC_PAPER=en_CA.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets
[8] methods   base     

other attached packages:
 [1] DESeq2_1.10.0              RcppArmadillo_0.6.200.2.0
 [3] Rcpp_0.12.1                SummarizedExperiment_1.0.1
 [5] Biobase_2.30.0             GenomicRanges_1.22.1      
 [7] GenomeInfoDb_1.6.1         IRanges_2.4.1             
 [9] S4Vectors_0.8.1            BiocGenerics_0.16.1       
[11] edgeR_3.12.0               limma_3.26.2              

loaded via a namespace (and not attached):
 [1] RColorBrewer_1.1-2   futile.logger_1.4.1  plyr_1.8.3          
 [4] XVector_0.10.0       futile.options_1.0.0 tools_3.2.2         
 [7] zlibbioc_1.16.0      rpart_4.1-10         digest_0.6.8        
[10] RSQLite_1.0.0        annotate_1.48.0      gtable_0.1.2        
[13] lattice_0.20-33      DBI_0.3.1            proto_0.3-10        
[16] gridExtra_2.0.0      genefilter_1.52.0    cluster_2.0.3       
[19] stringr_1.0.0        locfit_1.5-9.1       nnet_7.3-11         
[22] grid_3.2.2           AnnotationDbi_1.32.0 XML_3.98-1.3        
[25] survival_2.38-3      BiocParallel_1.4.0   foreign_0.8-66      
[28] latticeExtra_0.6-26  Formula_1.2-1        geneplotter_1.48.0  
[31] ggplot2_1.0.1        reshape2_1.4.1       lambda.r_1.1.7      
[34] magrittr_1.5         scales_0.3.0         Hmisc_3.17-0        
[37] splines_3.2.2        MASS_7.3-44          xtable_1.8-0        
[40] colorspace_1.2-6     stringi_1.0-1        acepack_1.3-3.3     
[43] munsell_0.4.2  
deseq2 differential gene expression rna-seq • 3.0k views
ADD COMMENT
0
Entering edit mode
@mikelove
Last seen 6 days ago
United States

This isn't an indication of anything wrong, and in fact, the results are probably not so different, just that the p-values moved slightly above your threshold. It shows how filtering on p-values can give you a variable number when you change parameters. The numbers here change, because you have removed samples which were used to estimate parameters. In particular, the dispersion estimation will depend on what samples are in the dataset. Typically it's better to have all the samples in the dataset, as this gives more robust estimates of dispersion.

One way this could have happened is: for some genes, the stimulation samples may have had lower within-group variability, and this could have brought down the dispersion estimate for those genes. When you remove those samples, the dispersion estimate may have risen. Note that even with a small change in the estimate, the p-values (and adjusted p-values) can go one way or the other past a threshold, and you observe a bigger or smaller number.

In addition, it looks like you are performing different pre-filtering here. (Note that results() function performs data-driven mean filtering automatically, so you could skip this step).

ADD COMMENT
0
Entering edit mode
erin.gill81 ▴ 60
@eringill81-6831
Last seen 5.3 years ago
Canada

Thank you so much for your explanation. I really appreciate it.

ADD COMMENT

Login before adding your answer.

Traffic: 589 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6