Question: DESeq2: Outliers, covariates and number of replicates
gravatar for j.viana
2.8 years ago by
j.viana0 wrote:


I'm analysing RNA-seq data and I switched from edgeR to DESeq2 because most of my differentially expressed genes had an outlier sample, which was not an outlier in the entire dataset (judged by PCA and overall Cook's distance).

My question is related to the criteria used by DESeq2 to count the number of replicates in my experiment.

I have 3 experimental groups, A,B and C, each of them with 16 samples. If I run a LRT model with no batch, the model considers I have over 7 replicates per group and replaces counts with large Cook’s distance with the trimmed mean.

dgeszexp4 <- DESeqDataSetFromMatrix(countData = counts5,
colData = phenomodel,
design = ~exp)

dgeszLRT <- DESeq(dgeszexp, test="LRT", reduced=~1)
dgeszLRTres <- results(dgeszLRT)

I get:

-- replacing outliers and refitting for 240 genes

outliers [1]     : 0, 0%

If I include sex and week of experiment as covariates in my model, the model considers I have less than 7 but more than 3 replicates per group (I have 4 samples same sex, same week, same exposure) and removes the genes with outliers instead instead of replacing the counts.

dgeszexp4 <- DESeqDataSetFromMatrix(countData = counts5,
colData = phenomodel1,
design = ~sex+week+exp)

dgeszLRT4 <- DESeq(dgeszexp4, test="LRT", reduced=~sex + week)
dgeszLRTres4 <- results(dgeszLRT4)

I get:

outliers [1]     : 11, 0.057%

In my opinion, I believe I still have 16 replicates per group and I'm only trying to control for the fact that I have different sex and the experiment was run in two weeks.

This gets worse if I try to include normalisation factors estimated by RUV-seq. Since each factor is different for each sample, DESeq2 will assume I only have 1 sample per group and it won't deal with my outliers.

dgeszLRT <- DESeq(dgesznowat, test="LRT", reduced=~sex + week +  ruv)

Any ideas on how to get around this?

Thanks in advance!



ADD COMMENTlink modified 2.8 years ago • written 2.8 years ago by j.viana0

I don't want to go on a tangent here, but I did use the quasi likelihood framework from edgeR, which did not solve the problem. My guess is that these genes do not have outlying dispersions. 

ADD REPLYlink written 2.8 years ago by j.viana0

I got around it by running the model without covariates, extract the replaces counts and re-run the model with the covariates. Since in my model with RUV estimates there will be absolutely no replacement of outliers of exclusion of genes I feel this is ok.

But I think something should be added to the reference manual to warn users about the fact that adding batch might change the way DESeq2 deals with the number of replicates - this wasn't intuitive for me.

Thanks for you help!

ADD REPLYlink written 2.8 years ago by j.viana0
Answer: DESeq2: Outliers, covariates and number of replicates
gravatar for Steve Lianoglou
2.8 years ago by
Steve Lianoglou12k wrote:

I don't have a comment regarding your DESeq2 question, but I don't understand your reasoning for switching between edgeR and DESeq2.

I mean, using either is a fine choice but it sounds like you think there is a reason that forced your switch. Your 'outlier sample' scenario should be rather nicely taken care of when you use edgeR's quasi likelihood framework. If you're not familiar with it, have a read through their f1000research paper.

ADD COMMENTlink written 2.8 years ago by Steve Lianoglou12k
Answer: DESeq2: Outliers, covariates and number of replicates
gravatar for Michael Love
2.8 years ago by
Michael Love24k
United States
Michael Love24k wrote:

Outliers are defined with respect to the fitted coefficients and the expected values, which change when you introduce covariates to account for more differences. So there's not really a way around this. What you could do is run the model with just the three groups and minReplicatesForReplace=Inf, and keep track of the genes that have relatively high maximum Cook's distance per gene, mcols(dds)$maxCooks. Then you could still run the final model with sex, week and RUV factors, just using the maxCooks information from the previous analysis to flag genes as potential outliers which you inspect by eye with plotCounts().

ADD COMMENTlink written 2.8 years ago by Michael Love24k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 212 users visited in the last hour