Question

DESeq2 minReplicatesForReplace with different group size

1

Entering edit mode

Jane Merlevede ▴ 90

@jane-merlevede-5019

Last seen 5.6 years ago

Dear DESeq2 users,

I have an RNASeq experiment with 18 samples, 4 normal and 14 tumor. I am comparing the 2 conditions: normal vs tumor.

I ran DESeq(dds, minReplicatesForReplace=7) in order to remove outliers. By doing so, I remove outliers only from the tumor group from what I understand. I get:

summary(res)
out of 30754 with nonzero total read count
adjusted p-value < 0.01
LFC > 0 (up)     : 520, 1.7%
LFC < 0 (down)   : 268, 0.87%
outliers [1]     : 472, 1.5%
low counts [2]   : 3523, 11%
(mean count < 1)

I hesitate to use DESeq(dds, minReplicatesForReplace=4) in order to account for outliers from the normal group as well, since it is recommended to use this option with at least 7 replicates per condition.

In this case, I get:

summary(res)
out of 30751 with nonzero total read count
adjusted p-value < 0.01
LFC > 0 (up)     : 522, 1.7%
LFC < 0 (down)   : 271, 0.88%
outliers [1]     : 0, 0%
low counts [2]   : 4196, 14%
(mean count < 1)

The results are pretty similar. Do you think one way is better compared to the other?

Thank you in advance

DESeq2 • 3.7k views

ADD COMMENT • link 6.9 years ago Jane Merlevede ▴ 90

0

Entering edit mode

Thank you Michael for your answer.

When I use minReplicatesForReplace=Inf, there are 1295 outliers, while there are 472 with minReplicatesForReplace=7.

With minReplicatesForReplace=7 (or minReplicatesForReplace=14), if I am right, outliers among the 14 tumor samples will be replaced by the mean of the 18 samples. While outliers among the 4 normal samples won't be replaced. Do you think this could be too aggressive?

I look at some gene counts with plotCounts(). It seems hard to detect an outlier within 4 values only.

I did not catch what you meant with: "if a particular count was called as an outlier, does it end up in the results when you turn outlier calling off, and if so, is that reasonable"

Do you mean to check if the 472 outliers are included in the 1295 outliers obtained with minReplicatesForReplace=Inf?

ADD REPLY • link 6.9 years ago Jane Merlevede ▴ 90

0

Entering edit mode

If you don't see extreme outliers (what I mean by this is, e.g. 3 samples with single digit counts and 1 sample with a count of 1000+), then I would continue with the default (where outliers are only replaced among the 14). Re: my comment which you didn't catch, if you don't see outliers in the genes which remain flagged (the genes designated in the 'filtered' variable I defined below) then you might try to go ahead and set cooksCutoff=FALSE in results(). I was recommending you to check by eye that any significance among these genes (defined by 'filtered') is not driven by extreme outliers.

You can find the largest outliers by looking at the Cook's distances. For example, to get the maximum Cook's distance for the samples in condition "A", you would use some code like this:

maxCooksConditionA <- apply(assays(dds)[["cooks"]][ ,dds$condition == "A" ], 1, max)

ADD REPLY • link 6.9 years ago Michael Love 41k

0

Entering edit mode

Thank you Michael.

If there were no extreme outlier, would you continue with minReplicatesForReplace=7 and cooksCutoff=FALSE or minReplicatesForReplace=4?

That's just for curiosity because there are "extreme outliers". For example, there are 3 samples around 0 and 1 at 100 or 3 samples around 50 and 1 at ~10000.

With "extreme outliers", would you use minReplicatesForReplace=7 and cooksCutoff=TRUE or minReplicatesForReplace=4?

In my case, I looked at the 20 "largest outliers":

maxCooksConditionCtrl <- apply(assays(dds)[["cooks"]][ ,dds$Status == "Ctrl" ], 1, max)
maxoutlier=order(maxCooksConditionCtrl, decreasing=TRUE)

16/20 showed 1/4 outsider value.

Using minReplicatesForReplace=7 and cooksCutoff=TRUE, pvalue and padj were of course set to NA.

Using minReplicatesForReplace=7 and cooksCutoff=FALSE, 8/16 padj were below the significant level (0.01).

Using minReplicatesForReplace=4 and cooksCutoff=TRUE/FALSE, 0/16 padj were below the significant level (9/16 padj were set to NA. In fact, baseMean was very low for most of these genes.)

minReplicatesForReplace=7 and cooksCutoff=TRUE and minReplicatesForReplace=4 induce no significant pvalues on these genes, that are probably not deregulated. I would not use minReplicatesForReplace=7 and cooksCutoff=FALSE, since the significativity seems to be due to a single sample.

I have not checked for the 472 genes if minReplicatesForReplace=4 lead to a significant padj.

Finally, I still do not know which is the best, minReplicatesForReplace=7 and cooksCutoff=TRUE or minReplicatesForReplace=4

ADD REPLY • link 6.9 years ago Jane Merlevede ▴ 90

0

Entering edit mode

From this, I'd recommend the default settings then minRep=7 and using cooksCutoff to set the outliers to NA in the small group

ADD REPLY • link 6.9 years ago Michael Love 41k

0

Entering edit mode

Okay.

Thank you a lot for your help

ADD REPLY • link 6.9 years ago Jane Merlevede ▴ 90

score 1 · Answer 1 · 2017-05-23

hi Jane,

I'd look into the counts for the ~500 genes that have outliers in the small group, before trusting whether outlier replacement is reasonable for those genes. You can find which genes were filtered due to outliers by looking at:

filtered <- which(is.na(res$pvalue) & res$baseMean > 0)

Then you can look at individual genes and their counts with plotCounts(). It may be that the outlier calling is too aggressive potentially on this dataset, and then you could just turn it off with cooksCutoff=FALSE to results(). I would check by eye using plotCounts() to see that if a particular count was called as an outlier, does it end up in the results when you turn outlier calling off, and if so, is that reasonable.