Question

Handling outliers in DESeq2?

0

Entering edit mode

supremerulersuraj • 0

@supremerulersuraj-12999

Last seen 5.8 years ago

I am currently using DESeq2 to analyze differential expression in some bulk RNA-seq samples, and am running into an issue that I assume is an outlier issue? I essentially have two groups with two conditions (in vitro and invivo, tbx and hcn). I perform DESeq analyses on the in vitro and in vivo samples separately (tbx vs hcn for both). The invivo group is rather small - 5 hcn samples and 4 tbx samples (the in vitro group is relatively larger, with about 40 samples each). For the in vivo sample, I find that I have significantly fewer differentially expressed genes (we'd expect them to be similar). A large number (about 12,000/25,000) of genes are marked with p-value NA, which I presume is because there was an outlier (detected by Cook's distance) throwing the error - and since the sample size is under 7 the trimmed means estimate isn't used (I should mention that this number of NA genes didn't notably change when I used the replaceOutliersWithTrimmedMeans function and minReplicates = 3).

However, even ignoring these, I am finding some strange results. For example, the CXCR4 gene has p-value 0.898 (padj 0.965), but as you can see from the attached boxplot (which plots log-transformed DESeq-scaled values), it appears by eye to be significantly different. There is, however, one major outlier in the in vivo HCN population (both noticed visually and by Cook's distance).

Is there a way to try and account for this in my small sample number dataset without going manually through genes of interest?

deseq2 sample outliers • 3.9k views

ADD COMMENT • link updated 7.0 years ago by Michael Love 41k • written 7.0 years ago by supremerulersuraj • 0

score 0 · Answer 1 · 2017-05-09

0

Entering edit mode

Michael Love 41k

@mikelove

Last seen 7 hours ago

United States

hi,

Note that you can use the plotCounts() function to make these plots, which in my opinion is better, as it let's you see the individual observations.

For this gene, you have, in HCN 4 samples at 0 and one sample at some positive value larger than all the four samples in TBX. Then it makes sense that the p-value is not small for this gene.

I would not remove outliers with only 5 and 4 samples. Is it the case that one sample is consistently an outlier?

ADD COMMENT • link 7.0 years ago Michael Love 41k

0

Entering edit mode

Thanks for the tip on plotCounts()!

This sample is not consistently an outlier. It seems most probable that that value is technical error/noise as opposed to real signal. I understand that n = 4, 5 is small, but for the sake of my understanding, is there a reason why replaceOutliersWithTrimmedMeans(minReplicates = 3) doesn't change the p-value here?

edit: Actually, I'm guessing it doesn't because that value isn't considered an outlier based on default criteria?

ADD REPLY • link 7.0 years ago supremerulersuraj • 0

0

Entering edit mode

It could be because it is not an outlier. The outlier replacement procedure is designed for replacing likely technical outliers, e.g. a count of 100,000 when all the other samples have counts in the range 0-10 is likely a technical artifact. If you are showing natural log scale above, that's a count of 7 with the other samples 0. The outlier replacement, if it went after what I would call "inconvienient samples" would make the method lose control of the false positive rate and false discovery rate.

ADD REPLY • link 7.0 years ago Michael Love 41k