I'm wondering why some specific genes are not being filtered out based on Cook's distance. Specifically, I'm doing a test for F1 vs F3. Both have more than 3 samples, hence outliers based on Cook's distance should be flagging this gene, but the p-values are not being set to NA. Should it be? If not, why? I know I can override it with my own filtering, but curious why this would happen. Thanks!
Edit: link to a plot of counts (http://figshare.com/articles/abo_png/1269307)
> dds <- DESeqDataSetFromHTSeqCount(sampleTableTmp, directory=htseqfiledir, design=~ covariate) > dds <- DESeq(dds, parallel=TRUE) estimating size factors estimating dispersions gene-wise dispersion estimates: 8 workers mean-dispersion relationship final dispersion estimates, MLE betas: 8 workers fitting model and testing: 8 workers -- replacing outliers and refitting for 29047 genes -- DESeq argument 'minReplicatesForReplace' = 7 -- original counts are preserved in counts(dds) estimating dispersions fitting model and testing > with(colData(dds), table(covariate)) covariate F1 F2 F3 F4 F5 F6 6 1 5 9 1 6 > res <- results(dds, contrast=c("covariate", "F1", "F3"), addMLE=TRUE) > res[ensid, ] log2 fold change (MAP): covariate F1 vs F3 Wald test p-value: covariate F1 vs F3 DataFrame with 1 row and 7 columns baseMean log2FoldChange lfcMLE lfcSE stat pvalue padj <numeric> <numeric> <numeric> <numeric> <numeric> <numeric> <numeric> mygeneofinterest 58.3895 -2.311909 -5.585649 0.3041011 -7.602435 2.906089e-14 6.310572e-10 > mcols(dds, use.names=TRUE)[ensid, c("dispOutlier", "replace", "maxCooks")] DataFrame with 1 row and 3 columns dispOutlier replace maxCooks <logical> <logical> <numeric> mygeneofinterest TRUE TRUE 0.3178658 > sum(assays(dds)[["cooks"]][ensid, ] > mcols(dds, use.names=TRUE)[ensid, ]$maxCooks) 2
> sessionInfo() R version 3.1.2 (2014-10-31) Platform: x86_64-redhat-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 [6] LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets methods base other attached packages: [1] mygene_1.0.1 GenomicFeatures_1.18.1 AnnotationDbi_1.28.0 Biobase_2.26.0 gtools_3.4.1 BiocParallel_1.0.0 [7] ggplot2_1.0.0 knitr_1.7 synapseClient_1.4-6 reshape2_1.4 plyr_1.8.1 dplyr_0.3.0.2 [13] DESeq2_1.6.1 RcppArmadillo_0.4.500.0 Rcpp_0.11.3 GenomicRanges_1.18.1 GenomeInfoDb_1.2.0 IRanges_2.0.0 [19] S4Vectors_0.4.0 BiocGenerics_0.12.0 loaded via a namespace (and not attached): [1] acepack_1.3-3.3 annotate_1.44.0 assertthat_0.1 base64enc_0.1-2 BatchJobs_1.4 BBmisc_1.7 [7] biomaRt_2.22.0 Biostrings_2.34.0 bitops_1.0-6 brew_1.0-6 checkmate_1.5.0 chron_2.3-45 [13] cluster_1.15.3 codetools_0.2-9 colorspace_1.2-4 DBI_0.3.1 digest_0.6.4 evaluate_0.5.5 [19] fail_1.2 foreach_1.4.2 foreign_0.8-61 formatR_1.0 Formula_1.1-2 genefilter_1.48.1 [25] geneplotter_1.44.0 GenomicAlignments_1.2.0 grid_3.1.2 gsubfn_0.6-6 gtable_0.1.2 Hmisc_3.14-5 [31] httr_0.5 iterators_1.0.7 jsonlite_0.9.13 labeling_0.3 lattice_0.20-29 latticeExtra_0.6-26 [37] lazyeval_0.1.9 locfit_1.5-9.1 magrittr_1.0.1 MASS_7.3-35 munsell_0.4.2 nnet_7.3-8 [43] proto_0.3-10 RColorBrewer_1.0-5 RCurl_1.95-4.3 RJSONIO_1.3-0 rpart_4.1-8 Rsamtools_1.18.0 [49] RSQLite_0.11.4 RSQLite.extfuns_0.0.1 rtracklayer_1.26.1 scales_0.2.4 sendmailR_1.2-1 splines_3.1.2 [55] sqldf_0.4-7.1 stringr_0.6.2 survival_2.37-7 tools_3.1.2 XML_3.98-1.1 xtable_1.7-4 [61] XVector_0.6.0 zlibbioc_1.12.0
Hmm! I'm actually surprised that the two samples in F3 are not the ones flagged (which indeed is the case and I shouldn't have assumed).
hi Kenneth,
Because there are 2 out of only 5 samples with high counts in that group, it appears as high within-group variability to the outlier mechanism, rather than as 2 outliers and the other 3 samples being "normal".
I see, that's what I'm gathering. Interesting because this is the most differentially expressed gene in this comparison, but given the high within-group variability I would not have expected that.
The two outliers were actually the singletons in F2 and F5:
What steps might you recommend to account for this kind of difference? Set a lower `minReplicateForReplace`? Or, post-fit filtering?
Thanks again for your feedback.
The singletons won't be replaced or filtered (nor will samples in groups with only 2 replicates). You can find the replaced count by comparing the replaced counts and the original counts. See the paragraph in the Details section of ?DESeq on how to access these matrices.
I'm familiar with the sections you're referring to. I was just curious why the Cook's distance would be so large for the singletons (even though there are no downstream consequences, and I'm not doing comparisons with those). I suppose I should remove those factor levels before fitting instead of when deciding which Wald tests I'll do!
It was unfortunately coincidental that there were two 'outliers', and two different samples that looked like outliers.
Thanks again!
Hi Kenneth,
Cook's distance is a scaled version of the distance the beta vector (the intercept and log fold changes) moves if the sample was removed. If you have a group with a single sample, one dimension of the beta vector will be the fold change involving that group. So the actual value of the distance is probably undefined (and we don't use these values in the software) because the fold change cannot be defined in that dimension when the only sample in that group is removed.