Question

Data handling in DESeq2

0

Entering edit mode

bioboy • 0

@bioboy-9172

Last seen 5.7 years ago

United States

Hello,

Recently, I got RNA-Seq data and analyzed to find interesting genes in our mice model. While I was using DESeq2, I got some questions described below.

We have 4 experiment groups, such as Disease models at 8 weeks (D8) and at 16 weeks (D16) and Control model at 8 weeks (C8) and 16 weeks (C16). Each of groups have several biological replicates. Our goal is to find DEG by disease model (D8 vs. D16), not by aging effect (C8 vs. C16). Therefore, we obtained two DEG lists and did DEG(D8 vs. D16) — DEG(C8 vs. C16). To do this analysis, I used nbinomWaldTest of DESeq2 and I got a different p-values and logfc from below two different analysis strategies.

I am not sure which strategy is better in DESeq2. To me, the first strategy makes more sense than the second one. Could you help me to choose better analysis way using DESeq2? Also, we found logfc of many genes in DESeq2 analysis results are different from logfc we manually calculated from RPKMs normalized by DESeq2. I guess this is because DESeq2 reports estimated logfc. However, sometimes sign (- or + sign) is changed. Could you help us to understand why sign can be changed?

Analysis strategies

1. Read whole 4 groups together and compare C8 vs. C16 and D8 vs. D16 separately like below R code. In this strategy, I assumed those 4 groups are one dataset, so I normalized and estimated dispersions together, then I compared separately using 'contrast' option of 'result' function. When I read a document of "results" function of DESeq2, "contrast" option can extract specific comparisons from the model. However, "logfc" of results extracted by "contrast" is different from what I manually calculated from DESeq2's normalized RPKMs. Sometimes +/- sign is changed.

# Read whole groups together

countData <- Whole data frame

colData <- Whole data information

dds <- DESeqDataSetFromMatrix(countData = countData, colData = colData, design=~group)

dds_norm <- estimateSizeFactors(dds)

sizeFactors(dds_norm)

normalized_countData <- counts(dds_norm, normalized=TRUE)

dds_estimated <- estimateDispersions(dds_norm)

model <- nbinomWaldTest(dds_estimated)

# Within Disease model

res <- results(model, contrast=c("group”,"D16”,"D8"))

tmp_df <- data.frame(res@listData)

row.names(tmp_df) <- res@rownames

write.csv(tmp_df, "./DESeq2_results/Wald_D.csv")

plotMA(res, alpha=0.05, main="Wald_Within_D") # MA plot

# Within Wide types

res2 <- results(model, contrast=c("group”,"C16”,"C8"))

tmp_df <- data.frame(res2@listData)

row.names(tmp_df) <- res2@rownames

write.csv(tmp_df, "./DESeq2_results/Wald_C.csv")

plotMA(res2, alpha=0.05, main="Wald_Within_C") # MA plot

2. Read separately and compare separately. I separated D8 and D16 from whole dataset and did same to C8 and C16, then normalized, estimated dispersions and compared separately.

# Read Disease model data only

countData_D <- Disease model only

colData_D <- Disease model only

# Comparison within Disease model

dds <- DESeqDataSetFromMatrix(countData = countData_D, colData = colData_D, design=~group)

dds_norm <- estimateSizeFactors(dds)

sizeFactors(dds_norm)

normalized_countData <- counts(dds_norm, normalized=TRUE)

dds_estimated <- estimateDispersions(dds_norm)

model3 <- nbinomWaldTest(dds_estimated)

res3 <- results(model3)

tmp_df <- data.frame(res3@listData)

row.names(tmp_df) <- res3@rownames

write.csv(tmp_df, "./DESeq2_results/Wald_Within_D.csv”)

plotMA(res3, alpha=0.05, main="Wald_Within_D") # MA plot

# Read Control model data only

countData_C <- Control model only

colData_C <- Control model only

# Comparison within Control model

dds <- DESeqDataSetFromMatrix(countData = countData_C, colData = colData_C, design=~group)

dds_norm <- estimateSizeFactors(dds)

sizeFactors(dds_norm)

normalized_countData <- counts(dds_norm, normalized=TRUE)

dds_estimated <- estimateDispersions(dds_norm)

model4 <- nbinomWaldTest(dds_estimated)

res4 <- results(model4)

tmp_df <- data.frame(res4@listData)

row.names(tmp_df) <- res4@rownames

write.csv(tmp_df, "./DESeq2_results/Wald_Within_D.csv”)

plotMA(res4, alpha=0.05, main="Wald_Within_D") # MA plot

No errors were in there two analyses.

> sessionInfo()
R version 3.2.1 (2015-06-18)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.11.1 (unknown)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets methods base

other attached packages:
[1] DESeq2_1.8.2 RcppArmadillo_0.6.200.2.0 Rcpp_0.12.1 GenomicRanges_1.20.8
[5] GenomeInfoDb_1.4.3 IRanges_2.2.9 S4Vectors_0.6.6 Biobase_2.28.0
[9] BiocGenerics_0.14.0 ggplot2_1.0.1

loaded via a namespace (and not attached):
[1] RColorBrewer_1.1-2 futile.logger_1.4.1 plyr_1.8.3 XVector_0.8.0 futile.options_1.0.0
[6] tools_3.2.1 rpart_4.1-10 digest_0.6.8 RSQLite_1.0.0 annotate_1.46.1
[11] gtable_0.1.2 lattice_0.20-33 DBI_0.3.1 proto_0.3-10 gridExtra_2.0.0
[16] genefilter_1.50.0 stringr_1.0.0 cluster_2.0.3 locfit_1.5-9.1 grid_3.2.1
[21] nnet_7.3-11 AnnotationDbi_1.30.1 XML_3.98-1.3 survival_2.38-3 BiocParallel_1.2.22
[26] foreign_0.8-66 latticeExtra_0.6-26 Formula_1.2-1 geneplotter_1.46.0 reshape2_1.4.1
[31] lambda.r_1.1.7 magrittr_1.5 scales_0.3.0 Hmisc_3.17-0 MASS_7.3-44
[36] splines_3.2.1 xtable_1.8-0 colorspace_1.2-6 stringi_1.0-1 acepack_1.3-3.3
[41] munsell_0.4.2

deseq2 data handling comparing more than two groups normalization logfc • 1.7k views

ADD COMMENT • link updated 8.4 years ago by Michael Love 41k • written 8.4 years ago by bioboy • 0

score 1 · Answer 1 · 2015-11-13

hi,

There is not one correct approach here. In the first, the dispersions are estimated using all the samples. The benefit is that there are more samples for a more robust estimate. But this assumes that there is a common overdispersion parameter for all groups (a value linking the extra variance from biological and technical sources). We have found this assumption to nevertheless give good results for many experiments, so we typically recommend running all groups together.

The second approach you estimate dispersions for the subsets separately. The advantages and disadvantages are flipped: less samples so less robust estimates of dispersion, but if the dispersion are very different across the subsets, this approach will more estimate them more accurately.

You can use plotPCA (see vignette) to get an idea if the within-group variance is wildly different across the subsets you propose. If not, that is, if the groups do not have wildly different amount of spread of points (samples) in the PCA plot, I'd recommend the first approach.