Dear Michael,
Sorry in advance for the long question. My name is Sara, and I am a PhD student doing a metatranscriptomic study in some salt ponds in Mallorca and studying the changes in gene expression at various times of the day. As I am relatively new to bioinformatics, a colleague told me about the DEApp application (https://yanli.shinyapps.io/DEApp/), where they implement your DESeq2 package. However, I like to understand what is being done, so I have tried to replicate the results using DESeq2 in R in a standalone way. Unsuccessfully.
The baseMean, pvalue or lfcSE values are quite similar, but the log2FoldChange is really different. I contacted the App developer and he told me that the code I was using seemed correct, and that maybe the problem was with DESeq2 and the new version of R (>4), where the 'results()' function does not work maybe well. Again I tried to implement it in a previous version, without success. I also tried the same code without removing the low expression tags from the counts.
I assume this is a bug in my code and was wondering if you could help me.
counts <- read.delim(paste("Raw_reads_D1_vs_All.txt",sep="/"), header=T, row.names=1, skipNul = T)
head(counts)
Day1A Day1B Day1C Avg1 Avg2 Avg3 Avg4 Avg5 Avg6 Avg7 Avg8 Avg9 Avg10 Avg11 Avg12 Avg13 Avg14
samplewt_001 0 0 0 0 0 0 2 0 2 0 1 0 1 3 0 1 1
samplewt_002 24 19 29 24 19 29 19 65 86 105 86 97 125 130 99 86 74
samplewt_003 2 1 0 2 1 0 7 3 0 3 4 2 4 3 1 4 2
samplewt_004 21 21 12 21 21 12 52 37 41 25 30 19 15 28 31 32 33
samplewt_005 29 19 16 29 19 16 15 35 16 50 53 59 34 41 25 31 32
samplewt_006 38 29 23 38 29 23 38 41 29 64 54 73 40 35 38 39 39
Avg15 Avg16 Avg17 Avg18
samplewt_001 2 4 2 1
samplewt_002 45 43 27 43
samplewt_003 2 2 2 4
samplewt_004 45 27 18 39
samplewt_005 28 25 30 22
samplewt_006 38 31 35 24
coldata <- read.delim(paste("Metatable_D1_vs_All.txt",sep=""), header=T, row.names = 1)
coldata
Treatment
Day1A Day1
Day1B Day1
Day1C Day1
Avg1 Avg
Avg2 Avg
Avg3 Avg
Avg4 Avg
Avg5 Avg
Avg6 Avg
Avg7 Avg
Avg8 Avg
Avg9 Avg
Avg10 Avg
Avg11 Avg
Avg12 Avg
Avg13 Avg
Avg14 Avg
Avg15 Avg
Avg16 Avg
Avg17 Avg
Avg18 Avg
coldata$Treatment <- factor(coldata$Treatment)
dds <- DESeqDataSetFromMatrix(countData = counts, colData = coldata, design = ~Treatment)
dge.count <- calcNormFactors(dds)
cpm.count <- cpm(dge.count$counts)
keep <- rowSums(cpm.count > 1) >=2
dge.count.rmlow <- dge.count[keep,]
dge.count.rmlow$samples$lib.size <- colSums(dge.count.rmlow$counts)
dge.count.rmlow <- calcNormFactors(dge.count.rmlow, lib.size = TRUE, method = "TMM")
counts2 <- dge.count.rmlow$counts
coldata <- dge.count.rmlow$samples$Treatment
colData <- data.frame(coldata)
dds <- DESeqDataSetFromMatrix(counts2, colData = colData, design = formula(~coldata))
dds <- DESeq(dds, test = "Wald")
resultsNames(dds)
[1] "Intercept" "coldata_Day1_vs_Avg"
res <- results(dds, contrast = c("coldata", "Day1","Avg"))
log2 fold change (MLE): coldata Day1 vs Avg
Wald test p-value: coldata Day1 vs Avg
DataFrame with 21692 rows and 6 columns
baseMean log2FoldChange lfcSE stat pvalue padj
<numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
samplewt_001 0.889321 -2.385359 1.442652 -1.653454 0.09823858 NA
samplewt_002 57.939492 -1.240640 0.469598 -2.641921 0.00824373 0.205107
samplewt_003 2.262310 -1.169396 0.963996 -1.213071 0.22510264 NA
samplewt_004 26.855727 -0.517509 0.339327 -1.525105 0.12723290 0.411924
samplewt_005 29.231810 -0.357366 0.368943 -0.968621 0.33273415 0.630221
... ... ... ... ... ... ...
samplewt_9993 38.88551 0.1753683 0.266028 0.659209 0.50976148 0.768292
samplewt_9994 35.78226 -0.0505332 0.266456 -0.189649 0.84958409 0.943335
samplewt_9995 45.59429 -0.7032288 0.261269 -2.691593 0.00711117 0.205107
samplewt_9996 4.82784 -0.0637269 0.557457 -0.114317 0.90898626 NA
samplewt_9999 20.25589 0.1051234 0.421189 0.249587 0.80290656 0.921288
sessionInfo()
R version 4.1.0 (2021-05-18)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19042)
Matrix products: default
locale:
[1] LC_COLLATE=Spanish_Spain.1252 LC_CTYPE=Spanish_Spain.1252 LC_MONETARY=Spanish_Spain.1252
[4] LC_NUMERIC=C LC_TIME=Spanish_Spain.1252
attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets methods base
other attached packages:
[1] edgeR_3.34.0 limma_3.48.0 DESeq2_1.32.0
[4] SummarizedExperiment_1.22.0 Biobase_2.52.0 MatrixGenerics_1.4.0
[7] matrixStats_0.59.0 GenomicRanges_1.44.0 GenomeInfoDb_1.28.0
[10] IRanges_2.26.0 S4Vectors_0.30.0 BiocGenerics_0.38.0
loaded via a namespace (and not attached):
[1] KEGGREST_1.32.0 genefilter_1.74.0 locfit_1.5-9.4 splines_4.1.0
[5] lattice_0.20-44 colorspace_2.0-1 vctrs_0.3.8 utf8_1.2.1
[9] blob_1.2.1 XML_3.99-0.6 survival_3.2-11 rlang_0.4.11
[13] pillar_1.6.1 glue_1.4.2 DBI_1.1.1 BiocParallel_1.26.0
[17] bit64_4.0.5 RColorBrewer_1.1-2 GenomeInfoDbData_1.2.6 lifecycle_1.0.0
[21] zlibbioc_1.38.0 Biostrings_2.60.0 munsell_0.5.0 gtable_0.3.0
[25] memoise_2.0.0 geneplotter_1.70.0 fastmap_1.1.0 fansi_0.5.0
[29] AnnotationDbi_1.54.0 Rcpp_1.0.6 xtable_1.8-4 scales_1.1.1
[33] cachem_1.0.5 DelayedArray_0.18.0 annotate_1.70.0 XVector_0.32.0
[37] bit_4.0.4 ggplot2_3.3.3 png_0.1-7 grid_4.1.0
[41] tools_4.1.0 bitops_1.0-7 magrittr_2.0.1 RCurl_1.98-1.3
[45] RSQLite_2.2.7 tibble_3.1.2 pkgconfig_2.0.3 crayon_1.4.1
[49] ellipsis_0.3.2 Matrix_1.3-3 httr_1.4.2 R6_2.5.0
[53] compiler_4.1.0
And here you have the same log2FoldChange for that genes through the DEApp software that I cannot replicate:
baseMean log2FoldChange lfcSE stat pvalue padj
samplewt_001 0.889314286 -0.602643368 0.373997325 -1.61135743 0.10710184 NA
samplewt_002 57.93917767 -0.931255767 0.355965285 -2.6161421 0.008892954 0.201065875
samplewt_003 2.262309455 -0.534065195 0.420066974 -1.271381062 0.203593117 NA
samplewt_004 26.85569236 -0.444635895 0.290359802 -1.531327314 0.125688518 0.405507326
samplewt_005 29.23173355 -0.299055164 0.308682478 -0.968811597 0.332639199 0.626920005
........
samplewt_9993 38.88541672 0.159459313 0.241886366 0.659232333 0.509746587 0.765425095
samplewt_9994 35.78216478 -0.045925545 0.241822378 -0.189914372 0.84937623 0.943275421
samplewt_9995 45.59412138 -0.641157931 0.236590841 -2.70998627 0.006728599 0.201065875
samplewt_9996 4.827812856 -0.044329744 0.387192212 -0.114490277 0.908849153 NA
samplewt_9997 0.466748965 -0.067189502 0.33105697 -0.2029545 0.839170596 NA
samplewt_9998 0.227260594 0.089864329 0.263303759 0.341295277 0.732881303 NA
samplewt_9999 20.25584553 0.083946598 0.337284386 0.248889666 0.803446129 0.921117579
Thank you in advance for your help and for the great package!
I totally agree with you about GUI interfaces and I trust above all DESeq2 results. However, I cannot really understand the difference, since I provide exactly the same count tables for both. I also tried omitting the filtering before providing counts to DESeq2, with the same result. I will try to contact again the developer of the App to ask if there is any change with respect to the original code in DESeq2, although I cannot see anything at first sight.
Thank you for your answer!