Dear DESeq development team,
My name is Alex SMITH and I am currently using DESeq2 to analyse mouse RNA-seq data across 2 conditions with 3 samples each.
I recently upgraded my DESeq2 version from ~1.5 to 1.6.1 (R version: 3.1.0, session info below) to access the new functionalities. I was investigating the reason why my differentially expressed gene list had increased from ~530 out of ~10 500 tested (~5%) to ~760 out of ~11 250 (~7%) when I came across something that might be a bug (but which is not, in fact, related to the change in DEG counts).
After running the DESeq command as below, I checked the maxCooks column, and it is full of NA's:
dgd <- DESeq(dgd, test="Wald", fitType="parametric", parallel=TRUE) summary( mcols(dgd)$maxCooks ) # => only NA's
I have temporarily replaced the values as calculated using the code in the vignette:
maxCooks <- apply(assays(dgd)[['cooks']],1,max) summary( maxCooks ) # => no NA's mcols(dgd)$maxCooks <- maxCooks
In my data, replacing the NA's with these calculated values does not impact the DEG count by much (less than 5 different), but this does indicate that adding these values back in does lead to outlier removal.
I have not been able to find an explanation to this column of NA's. Is this a bug? Or is it due to how I am performing the analysis?
Thanks in advance and best regards,
-- Alex
> sessionInfo() R version 3.1.0 (2014-04-10) Platform: x86_64-redhat-linux-gnu (64-bit) locale: [1] C attached base packages: [1] parallel stats4 grDevices utils datasets graphics stats methods base other attached packages: [1] FactoMineR_1.27 BiocInstaller_1.16.0 DESeq2_1.6.1 RcppArmadillo_0.4.450.1.0 Rcpp_0.11.3 [6] GenomicRanges_1.18.1 GenomeInfoDb_1.2.0 IRanges_2.0.0 S4Vectors_0.4.0 BiocGenerics_0.12.0 [11] ellipse_0.3-8 car_2.0-21 loaded via a namespace (and not attached): [1] AnnotationDbi_1.28.0 BBmisc_1.7 BatchJobs_1.4 Biobase_2.26.0 BiocParallel_1.0.0 DBI_0.3.1 Formula_1.1-2 [8] Hmisc_3.14-5 MASS_7.3-31 RColorBrewer_1.0-5 RSQLite_0.11.4 XML_3.98-1.1 XVector_0.6.0 acepack_1.3-3.3 [15] annotate_1.44.0 base64enc_0.1-2 brew_1.0-6 checkmate_1.5.0 cluster_1.15.2 codetools_0.2-8 colorspace_1.2-4 [22] digest_0.6.4 fail_1.2 foreach_1.4.2 foreign_0.8-61 genefilter_1.48.1 geneplotter_1.44.0 ggplot2_1.0.0 [29] grid_3.1.0 gtable_0.1.2 iterators_1.0.7 labeling_0.3 lattice_0.20-29 latticeExtra_0.6-26 leaps_2.9 [36] locfit_1.5-9.1 munsell_0.4.2 nnet_7.3-8 plyr_1.8.1 proto_0.3-10 reshape2_1.4 rpart_4.1-8 [43] scales_0.2.4 scatterplot3d_0.3-35 sendmailR_1.2-1 splines_3.1.0 stringr_0.6.2 survival_2.37-7 tools_3.1.0 [50] xtable_1.7-4
Can you provide the code used to construct the dds in particular the design and show colData(dds)?
hi Alex,
Thanks for your report. I haven't been able to replicate this using the same code on an example data set with condition A and B and 3 samples each. If you can send me a reproducible example I can try to figure out what's going on.
By two conditions, do you mean condition A/B, or do you mean condition1 A/B and condition2 A/B?
Note that version 1.5 is a development version (odd second digit) which is the branch to allow developer to experiment and add new features and functionality. You can't really count on something like sum(padj < threshold) -- where p-values are tail probabilities which are highly sensitive to slight changes in the model parameters -- staying constant between a development version and a release version.
Hi Michael,
Thanks for your reply! Sorry for my delay - I've been away during the NZ bank holiday...
My main experimental setup is a treatment vs control one: I am doing the DEA of 3 "Nb"-treated samples vs 3 "PBS" control samples.
I agree that I would not expect constant DEGs counts between versions, though the change was more than I would have expected. However I'm quite surprised that I had a development version to begin with!
I have prepared some slightly-reduced code and some partial data to reproduce my observation. If you wish to run it yourself, you can find a ~300Kb file here: https://www.dropbox.com/s/r5c1v0oaubma2rj/test.tar.gz?dl=0
In essence, my "experimental setup" matrix looks like this:
My call to prepare the HTSeq count data for DESeq2 is:
I get a warning:
But this is due to how I've designed my pipeline and I can't see how this would affect anything.
The rest you already know.
Thanks again and best regards,
-- Alex