DESeq2 outlier detection not filtering

Entering edit mode

kenneth.daily • 0

@kennethdaily-7167

Last seen 10.2 years ago

United States

I'm wondering why some specific genes are not being filtered out based on Cook's distance. Specifically, I'm doing a test for F1 vs F3. Both have more than 3 samples, hence outliers based on Cook's distance should be flagging this gene, but the p-values are not being set to NA. Should it be? If not, why? I know I can override it with my own filtering, but curious why this would happen. Thanks!

Edit: link to a plot of counts (http://figshare.com/articles/abo_png/1269307)

> dds <- DESeqDataSetFromHTSeqCount(sampleTableTmp,
                                    directory=htseqfiledir,
                                    design=~ covariate)
> dds <- DESeq(dds, parallel=TRUE)
estimating size factors
estimating dispersions
gene-wise dispersion estimates: 8 workers
mean-dispersion relationship
final dispersion estimates, MLE betas: 8 workers
fitting model and testing: 8 workers
-- replacing outliers and refitting for 29047 genes
-- DESeq argument 'minReplicatesForReplace' = 7
-- original counts are preserved in counts(dds)
estimating dispersions
fitting model and testing

> with(colData(dds), table(covariate))
covariate
          F1         F2 F3          F4        F5      F6
            6             1             5             9             1             6

> res <- results(dds, contrast=c("covariate", "F1", "F3"), addMLE=TRUE)
> res[ensid, ]
log2 fold change (MAP): covariate F1 vs F3
Wald test p-value: covariate F1 vs F3
DataFrame with 1 row and 7 columns
                   baseMean log2FoldChange    lfcMLE     lfcSE      stat       pvalue         padj
                  <numeric>      <numeric> <numeric> <numeric> <numeric>    <numeric>    <numeric>
mygeneofinterest   58.3895      -2.311909 -5.585649 0.3041011 -7.602435 2.906089e-14 6.310572e-10

> mcols(dds, use.names=TRUE)[ensid, c("dispOutlier", "replace", "maxCooks")]
DataFrame with 1 row and 3 columns
                  dispOutlier   replace  maxCooks
                    <logical> <logical> <numeric>
mygeneofinterest        TRUE      TRUE 0.3178658

> sum(assays(dds)[["cooks"]][ensid, ] > mcols(dds, use.names=TRUE)[ensid, ]$maxCooks)
2

> sessionInfo()
R version 3.1.2 (2014-10-31)
Platform: x86_64-redhat-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8   
 [6] LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] mygene_1.0.1            GenomicFeatures_1.18.1  AnnotationDbi_1.28.0    Biobase_2.26.0          gtools_3.4.1            BiocParallel_1.0.0     
 [7] ggplot2_1.0.0           knitr_1.7               synapseClient_1.4-6     reshape2_1.4            plyr_1.8.1              dplyr_0.3.0.2          
[13] DESeq2_1.6.1            RcppArmadillo_0.4.500.0 Rcpp_0.11.3             GenomicRanges_1.18.1    GenomeInfoDb_1.2.0      IRanges_2.0.0          
[19] S4Vectors_0.4.0         BiocGenerics_0.12.0    

loaded via a namespace (and not attached):
 [1] acepack_1.3-3.3         annotate_1.44.0         assertthat_0.1          base64enc_0.1-2         BatchJobs_1.4           BBmisc_1.7             
 [7] biomaRt_2.22.0          Biostrings_2.34.0       bitops_1.0-6            brew_1.0-6              checkmate_1.5.0         chron_2.3-45           
[13] cluster_1.15.3          codetools_0.2-9         colorspace_1.2-4        DBI_0.3.1               digest_0.6.4            evaluate_0.5.5         
[19] fail_1.2                foreach_1.4.2           foreign_0.8-61          formatR_1.0             Formula_1.1-2           genefilter_1.48.1      
[25] geneplotter_1.44.0      GenomicAlignments_1.2.0 grid_3.1.2              gsubfn_0.6-6            gtable_0.1.2            Hmisc_3.14-5           
[31] httr_0.5                iterators_1.0.7         jsonlite_0.9.13         labeling_0.3            lattice_0.20-29         latticeExtra_0.6-26    
[37] lazyeval_0.1.9          locfit_1.5-9.1          magrittr_1.0.1          MASS_7.3-35             munsell_0.4.2           nnet_7.3-8             
[43] proto_0.3-10            RColorBrewer_1.0-5      RCurl_1.95-4.3          RJSONIO_1.3-0           rpart_4.1-8             Rsamtools_1.18.0       
[49] RSQLite_0.11.4          RSQLite.extfuns_0.0.1   rtracklayer_1.26.1      scales_0.2.4            sendmailR_1.2-1         splines_3.1.2          
[55] sqldf_0.4-7.1           stringr_0.6.2           survival_2.37-7         tools_3.1.2             XML_3.98-1.1            xtable_1.7-4           
[61] XVector_0.6.0           zlibbioc_1.12.0

DESeq2 outlier RNASeq • 4.8k views

ADD COMMENT • link updated 10.2 years ago by Michael Love 43k • written 10.2 years ago by kenneth.daily • 0

Entering edit mode

Michael Love 43k

@mikelove

Last seen 3 hours ago

United States

hi Kenneth,

It looks like there was an outlier in the group with 9 samples, and it was replaced (see the note during the DESeq() run). This is indicated with the column replace set to TRUE. There is more information about this replacement behavior in the details of ?DESeq. If you prefer filtering alone, you can set DESeq()'s minReplicateForReplace=Inf. You can also set results()'s cooksCutoff=FALSE and perform manual inspection of those genes with large mcols(dds)$maxCooks.

Note that dispOutlier is indicating something other than the Cook's distance outliers. The dispersion outliers are genes whose dispersion is not shrunk because they more than 2 SDs above the trend. Cook's filtering is a separate step regarding the individual counts and their influence on the log fold change vector.

ADD COMMENT • link 10.2 years ago Michael Love 43k

Entering edit mode

Hmm! I'm actually surprised that the two samples in F3 are not the ones flagged (which indeed is the case and I shouldn't have assumed).

ADD REPLY • link 10.2 years ago kenneth.daily • 0

Entering edit mode

hi Kenneth,

Because there are 2 out of only 5 samples with high counts in that group, it appears as high within-group variability to the outlier mechanism, rather than as 2 outliers and the other 3 samples being "normal".

ADD REPLY • link 10.2 years ago Michael Love 43k

Entering edit mode

I see, that's what I'm gathering. Interesting because this is the most differentially expressed gene in this comparison, but given the high within-group variability I would not have expected that.

The two outliers were actually the singletons in F2 and F5:

sample	count	cooks	covariate
5	4	4.44E+008	F2
9	6	6.43E+008	F5
1	7	3.04E-004	F6
2	1	7.56E-004	F4
3	7	8.25E-004	F1
4	572	4.51E-002	F3
6	6	3.35E-004	F4
7	3	8.59E-004	F1
8	7	3.56E-004	F4
10	1297	3.18E-001	F3
11	2	7.51E-004	F4
12	4	4.82E-004	F4
13	8	2.85E-004	F4
14	5	1.06E-003	F1
15	12	5.49E-004	F6
16	1	1.04E-003	F1
17	3	6.06E-004	F1
18	3	7.33E-004	F4
19	12	1.08E-004	F4
20	5	1.99E-003	F3
21	6	7.76E-004	F6
22	11	3.01E-005	F1
23	9	1.47E-004	F4
24	5	1.92E-003	F3
25	11	5.22E-004	F6
26	8	8.63E-004	F6
27	3	2.00E-003	F3
28	8	5.74E-004	F6

view raw deseqoutlier.tsv hosted with ❤ by GitHub

What steps might you recommend to account for this kind of difference? Set a lower `minReplicateForReplace`? Or, post-fit filtering?

Thanks again for your feedback.

ADD REPLY • link 10.2 years ago kenneth.daily • 0

Entering edit mode

The singletons won't be replaced or filtered (nor will samples in groups with only 2 replicates). You can find the replaced count by comparing the replaced counts and the original counts. See the paragraph in the Details section of ?DESeq on how to access these matrices.

ADD REPLY • link 10.2 years ago Michael Love 43k

Entering edit mode

I'm familiar with the sections you're referring to. I was just curious why the Cook's distance would be so large for the singletons (even though there are no downstream consequences, and I'm not doing comparisons with those). I suppose I should remove those factor levels before fitting instead of when deciding which Wald tests I'll do!

It was unfortunately coincidental that there were two 'outliers', and two different samples that looked like outliers.

Thanks again!

ADD REPLY • link 10.2 years ago kenneth.daily • 0

Entering edit mode

Hi Kenneth,

Cook's distance is a scaled version of the distance the beta vector (the intercept and log fold changes) moves if the sample was removed. If you have a group with a single sample, one dimension of the beta vector will be the fold change involving that group. So the actual value of the distance is probably undefined (and we don't use these values in the software) because the fold change cannot be defined in that dimension when the only sample in that group is removed.

ADD REPLY • link 10.2 years ago Michael Love 43k