DESeq2 outlier detection not filtering
1
0
Entering edit mode
@kennethdaily-7167
Last seen 9.9 years ago
United States

I'm wondering why some specific genes are not being filtered out based on Cook's distance. Specifically, I'm doing a test for F1 vs F3. Both have more than 3 samples, hence outliers based on Cook's distance should be flagging this gene, but the p-values are not being set to NA. Should it be? If not, why? I know I can override it with my own filtering, but curious why this would happen. Thanks!

Edit: link to a plot of counts (http://figshare.com/articles/abo_png/1269307)

> dds <- DESeqDataSetFromHTSeqCount(sampleTableTmp,
                                    directory=htseqfiledir,
                                    design=~ covariate)
> dds <- DESeq(dds, parallel=TRUE)
estimating size factors
estimating dispersions
gene-wise dispersion estimates: 8 workers
mean-dispersion relationship
final dispersion estimates, MLE betas: 8 workers
fitting model and testing: 8 workers
-- replacing outliers and refitting for 29047 genes
-- DESeq argument 'minReplicatesForReplace' = 7
-- original counts are preserved in counts(dds)
estimating dispersions
fitting model and testing

> with(colData(dds), table(covariate))
covariate
          F1         F2 F3          F4        F5      F6
            6             1             5             9             1             6

> res <- results(dds, contrast=c("covariate", "F1", "F3"), addMLE=TRUE)
> res[ensid, ]
log2 fold change (MAP): covariate F1 vs F3
Wald test p-value: covariate F1 vs F3
DataFrame with 1 row and 7 columns
                   baseMean log2FoldChange    lfcMLE     lfcSE      stat       pvalue         padj
                  <numeric>      <numeric> <numeric> <numeric> <numeric>    <numeric>    <numeric>
mygeneofinterest   58.3895      -2.311909 -5.585649 0.3041011 -7.602435 2.906089e-14 6.310572e-10

> mcols(dds, use.names=TRUE)[ensid, c("dispOutlier", "replace", "maxCooks")]
DataFrame with 1 row and 3 columns
                  dispOutlier   replace  maxCooks
                    <logical> <logical> <numeric>
mygeneofinterest        TRUE      TRUE 0.3178658

> sum(assays(dds)[["cooks"]][ensid, ] > mcols(dds, use.names=TRUE)[ensid, ]$maxCooks)
2

 

> sessionInfo()
R version 3.1.2 (2014-10-31)
Platform: x86_64-redhat-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8   
 [6] LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] mygene_1.0.1            GenomicFeatures_1.18.1  AnnotationDbi_1.28.0    Biobase_2.26.0          gtools_3.4.1            BiocParallel_1.0.0     
 [7] ggplot2_1.0.0           knitr_1.7               synapseClient_1.4-6     reshape2_1.4            plyr_1.8.1              dplyr_0.3.0.2          
[13] DESeq2_1.6.1            RcppArmadillo_0.4.500.0 Rcpp_0.11.3             GenomicRanges_1.18.1    GenomeInfoDb_1.2.0      IRanges_2.0.0          
[19] S4Vectors_0.4.0         BiocGenerics_0.12.0    

loaded via a namespace (and not attached):
 [1] acepack_1.3-3.3         annotate_1.44.0         assertthat_0.1          base64enc_0.1-2         BatchJobs_1.4           BBmisc_1.7             
 [7] biomaRt_2.22.0          Biostrings_2.34.0       bitops_1.0-6            brew_1.0-6              checkmate_1.5.0         chron_2.3-45           
[13] cluster_1.15.3          codetools_0.2-9         colorspace_1.2-4        DBI_0.3.1               digest_0.6.4            evaluate_0.5.5         
[19] fail_1.2                foreach_1.4.2           foreign_0.8-61          formatR_1.0             Formula_1.1-2           genefilter_1.48.1      
[25] geneplotter_1.44.0      GenomicAlignments_1.2.0 grid_3.1.2              gsubfn_0.6-6            gtable_0.1.2            Hmisc_3.14-5           
[31] httr_0.5                iterators_1.0.7         jsonlite_0.9.13         labeling_0.3            lattice_0.20-29         latticeExtra_0.6-26    
[37] lazyeval_0.1.9          locfit_1.5-9.1          magrittr_1.0.1          MASS_7.3-35             munsell_0.4.2           nnet_7.3-8             
[43] proto_0.3-10            RColorBrewer_1.0-5      RCurl_1.95-4.3          RJSONIO_1.3-0           rpart_4.1-8             Rsamtools_1.18.0       
[49] RSQLite_0.11.4          RSQLite.extfuns_0.0.1   rtracklayer_1.26.1      scales_0.2.4            sendmailR_1.2-1         splines_3.1.2          
[55] sqldf_0.4-7.1           stringr_0.6.2           survival_2.37-7         tools_3.1.2             XML_3.98-1.1            xtable_1.7-4           
[61] XVector_0.6.0           zlibbioc_1.12.0        

 

DESeq2 outlier RNASeq • 4.7k views
ADD COMMENT
1
Entering edit mode
@mikelove
Last seen 1 hour ago
United States

hi Kenneth,

It looks like there was an outlier in the group with 9 samples, and it was replaced (see the note during the DESeq() run). This is indicated with the column replace set to TRUE. There is more information about this replacement behavior in the details of ?DESeq. If you prefer filtering alone, you can set DESeq()'s minReplicateForReplace=Inf. You can also set results()'s cooksCutoff=FALSE and perform manual inspection of those genes with large mcols(dds)$maxCooks.

Note that dispOutlier is indicating something other than the Cook's distance outliers. The dispersion outliers are genes whose dispersion is not shrunk because they more than 2 SDs above the trend. Cook's filtering is a separate step regarding the individual counts and their influence on the log fold change vector. 

 

ADD COMMENT
0
Entering edit mode

Hmm! I'm actually surprised that the two samples in F3 are not the ones flagged (which indeed is the case and I shouldn't have assumed).

ADD REPLY
0
Entering edit mode

hi Kenneth, 

Because there are 2 out of only 5 samples with high counts in that group, it appears as high within-group variability to the outlier mechanism, rather than as 2 outliers and the other 3 samples being "normal".

ADD REPLY
0
Entering edit mode

I see, that's what I'm gathering. Interesting because this is the most differentially expressed gene in this comparison, but given the high within-group variability I would not have expected that.

The two outliers were actually the singletons in F2 and F5:

What steps might you recommend to account for this kind of difference? Set a lower `minReplicateForReplace`? Or, post-fit filtering?

Thanks again for your feedback.

ADD REPLY
0
Entering edit mode

The singletons won't be replaced or filtered (nor will samples in groups with only 2 replicates). You can find the replaced count by comparing the replaced counts and the original counts. See the paragraph in the Details section of ?DESeq on how to access these matrices.

ADD REPLY
0
Entering edit mode

I'm familiar with the sections you're referring to. I was just curious why the Cook's distance would be so large for the singletons (even though there are no downstream consequences, and I'm not doing comparisons with those). I suppose I should remove those factor levels before fitting instead of when deciding which Wald tests I'll do!

It was unfortunately coincidental that there were two 'outliers', and two different samples that looked like outliers.

Thanks again!

ADD REPLY
0
Entering edit mode

Hi Kenneth,

Cook's distance is a scaled version of the distance the beta vector (the intercept and log fold changes) moves if the sample was removed. If you have a group with a single sample, one dimension of the beta vector will be the fold change involving that group. So the actual value of the distance is probably undefined (and we don't use these values in the software) because the fold change cannot be defined in that dimension when the only sample in that group is removed.

ADD REPLY

Login before adding your answer.

Traffic: 628 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6