DESeq outlier detection for unbalanced groups
1
0
Entering edit mode
akaever ▴ 30
@akaever-7380
Last seen 7.8 years ago

In case of an unbalanced number of samples per group, the standard DESeq outlier replacement (minReplicatesForReplace=7) can result in drastically reduced p-values. This happens when the trimmed mean replacement leaves out all samples from the smaller group. See the following example:

First two values belong to the smaller group. The last value (larger group) is replaced by 154:

1272, 751, 275, 298, 113, 116, 161, 176, 294, 172, 327,  93, 108,  84, 151, 728

I am aware that unbalanced groups and small sample numbers should be avoided, but this happens quite often in reality ;-). I would prefer having the outlier replacement deactivated by default or a check for unbalanced groups...

deseq2 deseq outliers • 1.2k views
ADD COMMENT
0
Entering edit mode

moved comment to answer below

ADD REPLY
0
Entering edit mode

You might want to experiment with edgeR's quasi likelihood framework to mitigate the affect that outlier observations have on your differential expression statistics.

Given that you're looking at a very specific use case and have observed specific instances of behavior that might not be ideal with your current workflow, it would also be interesting and valuable to the community if you tried this and come back with a report of your findings ;-)

 

ADD REPLY
0
Entering edit mode
@mikelove
Last seen 2 days ago
United States

The most reasonable approach to outliers is certainly a bit of a trade-off, in terms of catching the obvious technical artifacts (what we call in the paper "extreme count outliers"), not losing control of FDR for data with just high variability, and meanwhile not reducing sensitivity when there are many samples.

The default outlier replacement procedure (replace outliers if detected only in those groups with 7 or more samples) we feel does a reasonable thing for most designs and RNA-seq data we encounter, but it's hard to know in advance what designs it may reduce sensitivity for. While it seems like we could just add more rules onto the procedure, we don't want to have too complicated of a rule to explain to users.

For this dataset and others with unbalanced designs, I'd recommend you turn off outlier replacement (minReplicatesForReplace=Inf) and outlier filtering (cooksCutoff=FALSE) and just check rows with high mcols(dds)$maxCooks by eye.

ADD COMMENT

Login before adding your answer.

Traffic: 376 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6