Hello,
I have recently been trying to analyse 12 non-replicated metagenome samples using DESeq2 as an exploratory analysis to find genes with differing abundance to allow for inferences to be obtained with respect to GO terms and KEGG pathways associated with these genes. I have formatted a counts table from a Bowtie2 mapping into HTSeq gene region count extraction procedure which I can successfully read into DESeq2. However, when I attempt to run the program, I receive the following error (note that I'm running it in parallel, and my data set is very, very large):
mean-dispersion relationship
final dispersion estimates, MLE betas: 12 workers
fitting model and testing: 12 workers
-- replacing outliers and refitting for 2151512 genes
-- DESeq argument 'minReplicatesForReplace' = 7
=-- original counts are preserved in counts(dds)
estimating dispersions
Error in estimateDispersionsGeneEst(objectSub, quiet = quiet, modelMatrix = modelMatrix) :
the number of samples and the number of model coefficients are equal,
i.e., there are no replicates to estimate the dispersion.
use an alternate design formula
Calls: DESeq -> refitWithoutOutliers -> estimateDispersionsGeneEst
In addition: Warning message:
In checkForExperimentalReplicates(object, modelMatrix) :
same number of samples and coefficients to fit,
estimating dispersion by treating samples as replicates.
read the ?DESeq section on 'Experiments without replicates'
Execution halted
The R output including headers for my input files as well as the sessionInfo are provided below:
R version 3.2.5 (2016-04-14) -- "Very, Very Secure Dishes" Copyright (C) 2016 The R Foundation for Statistical Computing Platform: x86_64-pc-linux-gnu (64-bit)
R is free software and comes with ABSOLUTELY NO WARRANTY. You are welcome to redistribute it under certain conditions. Type 'license()' or 'licence()' for distribution details.
Natural language support but running in an English locale
R is a collaborative project with many contributors. Type 'contributors()' for more information and 'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or 'help.start()' for an HTML browser interface to help. Type 'q()' to quit R.
> ############## START > > # Load deseq2 and set working directory > library(DESeq2) > library(pheatmap) > library(RColorBrewer) > library(BiocParallel) > setwd("/home/workingdir") > > register(MulticoreParam(12)) > > # Load in data > countData = as.matrix(read.table(file='combined.htseq.counts', header = TRUE, sep = "\t", stringsAsFactors = FALSE, row.names="gene_id")) > head(countData) AR DW GH1 LS OF STP1 STP2 STP11A STP12B STP13A STP15C WS OF0002336 0 8 0 0 1744 0 0 0 0 0 0 0 OF0057878 0 0 0 0 1560 0 6 2 4 0 0 0 OF0008217 0 0 0 0 1004 0 0 0 0 0 0 0 STP11A0014293 0 0 2 0 33 2 378 697 1674 294 474 0 LS0032436 0 0 0 981 0 0 0 0 0 1 0 149 OF0055859 0 0 0 0 892 0 0 0 1 1 0 0 > > colData = read.csv('colData.csv') > head(colData) condition 1 AR 2 DW 3 GH1 4 LS 5 OF 6 STP1 > > # Create DESeq2 data set > dds <- DESeqDataSetFromMatrix(countData = countData, colData = colData, design = ~ condition) > dds <- dds[ rowSums(counts(dds)) > 1, ] # 1.3.6 Pre-filtering - removes genes for which only 0 or 1 reads are present > > # Reduce RAM usage > rm(countData) > rm(colData) > > # Session info > sessionInfo() R version 3.2.5 (2016-04-14) Platform: x86_64-pc-linux-gnu (64-bit) Running under: SUSE Linux Enterprise Server 11 SP4
locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets [8] methods base
other attached packages: [1] BiocParallel_1.4.3 RColorBrewer_1.1-2 [3] pheatmap_1.0.8 DESeq2_1.10.1 [5] RcppArmadillo_0.7.500.0.0 Rcpp_0.12.8 [7] SummarizedExperiment_1.0.2 Biobase_2.30.0 [9] GenomicRanges_1.22.4 GenomeInfoDb_1.6.3 [11] IRanges_2.4.8 S4Vectors_0.8.11 [13] BiocGenerics_0.16.1
loaded via a namespace (and not attached): [1] genefilter_1.52.1 locfit_1.5-9.1 splines_3.2.5 [4] lattice_0.20-33 colorspace_1.3-1 htmltools_0.3.5 [7] chron_2.3-47 survival_2.40-1 XML_3.98-1.5 [10] foreign_0.8-66 DBI_0.5-1 lambda.r_1.1.9 [13] plyr_1.8.4 stringr_1.1.0 zlibbioc_1.16.0 [16] munsell_0.4.3 gtable_0.2.0 futile.logger_1.4.3 [19] latticeExtra_0.6-28 knitr_1.15.1 geneplotter_1.48.0 [22] AnnotationDbi_1.32.3 htmlTable_1.7 acepack_1.4.1 [25] xtable_1.8-2 scales_0.4.1 Hmisc_4.0-0 [28] annotate_1.48.0 XVector_0.10.0 gridExtra_2.2.1 [31] ggplot2_2.2.0 digest_0.6.10 stringi_1.1.2 [34] grid_3.2.5 tools_3.2.5 magrittr_1.5 [37] lazyeval_0.2.0 tibble_1.2 RSQLite_1.0.0 [40] Formula_1.2-1 cluster_2.0.5 futile.options_1.0.0 [43] Matrix_1.2-4 data.table_1.9.6 assertthat_0.1 [46] rpart_4.1-10 nnet_7.3-12 > > # Perform the differential expression (1.4) > dds <- DESeq(dds, parallel=TRUE)
^^ This is where the code dies
If someone could help me figure out why DESeq2 is not allowing me to run without replicates (as I understand it is capable of despite encouraging you not to take any results as conclusive) that would be appreciated. Thanks.
Hi Simon,
Thanks for your response. From reading elsewhere I was given the impression that running DESeq() normally should still provide me with some level of results rather than crashing, but I may have misinterpreted these comments. I will look into how I can best analyse rlog data in as scientific a manner as possible, then.