Dear DESeq2 users,
I have an RNASeq experiment with 18 samples, 4 normal and 14 tumor. I am comparing the 2 conditions: normal vs tumor.
I ran DESeq(dds, minReplicatesForReplace=7) in order to remove outliers. By doing so, I remove outliers only from the tumor group from what I understand. I get:
summary(res) out of 30754 with nonzero total read count adjusted p-value < 0.01 LFC > 0 (up) : 520, 1.7% LFC < 0 (down) : 268, 0.87% outliers [1] : 472, 1.5% low counts [2] : 3523, 11% (mean count < 1)
I hesitate to use DESeq(dds, minReplicatesForReplace=4) in order to account for outliers from the normal group as well, since it is recommended to use this option with at least 7 replicates per condition.
In this case, I get:
summary(res) out of 30751 with nonzero total read count adjusted p-value < 0.01 LFC > 0 (up) : 522, 1.7% LFC < 0 (down) : 271, 0.88% outliers [1] : 0, 0% low counts [2] : 4196, 14% (mean count < 1)
The results are pretty similar. Do you think one way is better compared to the other?
Thank you in advance
Thank you Michael for your answer.
When I use minReplicatesForReplace=Inf, there are 1295 outliers, while there are 472 with minReplicatesForReplace=7.
With minReplicatesForReplace=7 (or minReplicatesForReplace=14), if I am right, outliers among the 14 tumor samples will be replaced by the mean of the 18 samples. While outliers among the 4 normal samples won't be replaced. Do you think this could be too aggressive?
I look at some gene counts with plotCounts(). It seems hard to detect an outlier within 4 values only.
I did not catch what you meant with: "if a particular count was called as an outlier, does it end up in the results when you turn outlier calling off, and if so, is that reasonable"
Do you mean to check if the 472 outliers are included in the 1295 outliers obtained with minReplicatesForReplace=Inf?
If you don't see extreme outliers (what I mean by this is, e.g. 3 samples with single digit counts and 1 sample with a count of 1000+), then I would continue with the default (where outliers are only replaced among the 14). Re: my comment which you didn't catch, if you don't see outliers in the genes which remain flagged (the genes designated in the 'filtered' variable I defined below) then you might try to go ahead and set cooksCutoff=FALSE in results(). I was recommending you to check by eye that any significance among these genes (defined by 'filtered') is not driven by extreme outliers.
You can find the largest outliers by looking at the Cook's distances. For example, to get the maximum Cook's distance for the samples in condition "A", you would use some code like this:
Thank you Michael.
If there were no extreme outlier, would you continue with minReplicatesForReplace=7 and cooksCutoff=FALSE or minReplicatesForReplace=4?
That's just for curiosity because there are "extreme outliers". For example, there are 3 samples around 0 and 1 at 100 or 3 samples around 50 and 1 at ~10000.
With "extreme outliers", would you use minReplicatesForReplace=7 and cooksCutoff=TRUE or minReplicatesForReplace=4?
In my case, I looked at the 20 "largest outliers":
16/20 showed 1/4 outsider value.
Using minReplicatesForReplace=7 and cooksCutoff=TRUE, pvalue and padj were of course set to NA.
Using minReplicatesForReplace=7 and cooksCutoff=FALSE, 8/16 padj were below the significant level (0.01).
Using minReplicatesForReplace=4 and cooksCutoff=TRUE/FALSE, 0/16 padj were below the significant level (9/16 padj were set to NA. In fact, baseMean was very low for most of these genes.)
minReplicatesForReplace=7 and cooksCutoff=TRUE and minReplicatesForReplace=4 induce no significant pvalues on these genes, that are probably not deregulated. I would not use minReplicatesForReplace=7 and cooksCutoff=FALSE, since the significativity seems to be due to a single sample.
I have not checked for the 472 genes if minReplicatesForReplace=4 lead to a significant padj.
Finally, I still do not know which is the best, minReplicatesForReplace=7 and cooksCutoff=TRUE or minReplicatesForReplace=4
From this, I'd recommend the default settings then minRep=7 and using cooksCutoff to set the outliers to NA in the small group
Okay.
Thank you a lot for your help