Hello,
Recently, I got RNA-Seq data and analyzed to find interesting genes in our mice model. While I was using DESeq2, I got some questions described below.
We have 4 experiment groups, such as Disease models at 8 weeks (D8) and at 16 weeks (D16) and Control model at 8 weeks (C8) and 16 weeks (C16). Each of groups have several biological replicates. Our goal is to find DEG by disease model (D8 vs. D16), not by aging effect (C8 vs. C16). Therefore, we obtained two DEG lists and did DEG(D8 vs. D16) — DEG(C8 vs. C16). To do this analysis, I used nbinomWaldTest of DESeq2 and I got a different p-values and logfc from below two different analysis strategies.
I am not sure which strategy is better in DESeq2. To me, the first strategy makes more sense than the second one. Could you help me to choose better analysis way using DESeq2? Also, we found logfc of many genes in DESeq2 analysis results are different from logfc we manually calculated from RPKMs normalized by DESeq2. I guess this is because DESeq2 reports estimated logfc. However, sometimes sign (- or + sign) is changed. Could you help us to understand why sign can be changed?
Analysis strategies
1. Read whole 4 groups together and compare C8 vs. C16 and D8 vs. D16 separately like below R code. In this strategy, I assumed those 4 groups are one dataset, so I normalized and estimated dispersions together, then I compared separately using 'contrast' option of 'result' function. When I read a document of "results" function of DESeq2, "contrast" option can extract specific comparisons from the model. However, "logfc" of results extracted by "contrast" is different from what I manually calculated from DESeq2's normalized RPKMs. Sometimes +/- sign is changed.
# Read whole groups together
countData <- Whole data frame
colData <- Whole data information
dds <- DESeqDataSetFromMatrix(countData = countData, colData = colData, design=~group)
dds_norm <- estimateSizeFactors(dds)
sizeFactors(dds_norm)
normalized_countData <- counts(dds_norm, normalized=TRUE)
dds_estimated <- estimateDispersions(dds_norm)
model <- nbinomWaldTest(dds_estimated)
# Within Disease model
res <- results(model, contrast=c("group”,"D16”,"D8"))
tmp_df <- data.frame(res@listData)
row.names(tmp_df) <- res@rownames
write.csv(tmp_df, "./DESeq2_results/Wald_D.csv")
plotMA(res, alpha=0.05, main="Wald_Within_D") # MA plot
# Within Wide types
res2 <- results(model, contrast=c("group”,"C16”,"C8"))
tmp_df <- data.frame(res2@listData)
row.names(tmp_df) <- res2@rownames
write.csv(tmp_df, "./DESeq2_results/Wald_C.csv")
plotMA(res2, alpha=0.05, main="Wald_Within_C") # MA plot
2. Read separately and compare separately. I separated D8 and D16 from whole dataset and did same to C8 and C16, then normalized, estimated dispersions and compared separately.
# Read Disease model data only
countData_D <- Disease model only
colData_D <- Disease model only
# Comparison within Disease model
dds <- DESeqDataSetFromMatrix(countData = countData_D, colData = colData_D, design=~group)
dds_norm <- estimateSizeFactors(dds)
sizeFactors(dds_norm)
normalized_countData <- counts(dds_norm, normalized=TRUE)
dds_estimated <- estimateDispersions(dds_norm)
model3 <- nbinomWaldTest(dds_estimated)
res3 <- results(model3)
tmp_df <- data.frame(res3@listData)
row.names(tmp_df) <- res3@rownames
write.csv(tmp_df, "./DESeq2_results/Wald_Within_D.csv”)
plotMA(res3, alpha=0.05, main="Wald_Within_D") # MA plot
# Read Control model data only
countData_C <- Control model only
colData_C <- Control model only
# Comparison within Control model
dds <- DESeqDataSetFromMatrix(countData = countData_C, colData = colData_C, design=~group)
dds_norm <- estimateSizeFactors(dds)
sizeFactors(dds_norm)
normalized_countData <- counts(dds_norm, normalized=TRUE)
dds_estimated <- estimateDispersions(dds_norm)
model4 <- nbinomWaldTest(dds_estimated)
res4 <- results(model4)
tmp_df <- data.frame(res4@listData)
row.names(tmp_df) <- res4@rownames
write.csv(tmp_df, "./DESeq2_results/Wald_Within_D.csv”)
plotMA(res4, alpha=0.05, main="Wald_Within_D") # MA plot
No errors were in there two analyses.
> sessionInfo()
R version 3.2.1 (2015-06-18)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.11.1 (unknown)
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets methods base
other attached packages:
[1] DESeq2_1.8.2 RcppArmadillo_0.6.200.2.0 Rcpp_0.12.1 GenomicRanges_1.20.8
[5] GenomeInfoDb_1.4.3 IRanges_2.2.9 S4Vectors_0.6.6 Biobase_2.28.0
[9] BiocGenerics_0.14.0 ggplot2_1.0.1
loaded via a namespace (and not attached):
[1] RColorBrewer_1.1-2 futile.logger_1.4.1 plyr_1.8.3 XVector_0.8.0 futile.options_1.0.0
[6] tools_3.2.1 rpart_4.1-10 digest_0.6.8 RSQLite_1.0.0 annotate_1.46.1
[11] gtable_0.1.2 lattice_0.20-33 DBI_0.3.1 proto_0.3-10 gridExtra_2.0.0
[16] genefilter_1.50.0 stringr_1.0.0 cluster_2.0.3 locfit_1.5-9.1 grid_3.2.1
[21] nnet_7.3-11 AnnotationDbi_1.30.1 XML_3.98-1.3 survival_2.38-3 BiocParallel_1.2.22
[26] foreign_0.8-66 latticeExtra_0.6-26 Formula_1.2-1 geneplotter_1.46.0 reshape2_1.4.1
[31] lambda.r_1.1.7 magrittr_1.5 scales_0.3.0 Hmisc_3.17-0 MASS_7.3-44
[36] splines_3.2.1 xtable_1.8-0 colorspace_1.2-6 stringi_1.0-1 acepack_1.3-3.3
[41] munsell_0.4.2