Question

DESeq2 problem with outlier replacement in paired multi-factor design

0

Entering edit mode

gregory.l.stone ▴ 10

@gregorylstone-12225

Last seen 5.5 years ago

Please excuse my ignorance as I am new to DESeq2 and differential gene expression analysis in general. Also please excuse the formatting of this post (this is my first post to this forum). I appreciate in advance any and all advice and guidance.

I am attempting to conduct a paired multi-factor analysis in DESeq2 but I seem to be detecting a lot of differentially expressed genes that are pushed into siginificance due to the presence of an outlier. I constructed a PCA plot and removed 5 samples that appeared to be outliers (1 of which included a consistent count outlier that was responsible for shoving those genes into significance) and re-ran the DESeq2 analysis. I then got no significant genes. I know that DESeq2 has an internal outlier detection and replacement algorithm that requires a specified (I think default is 7) replicates. I have around 200 paired samples, ~100 pre and ~100 post treatment. In accordance with the paired multi-factor DESeq2 vignette, I have structured my sample table like so:

HTSeq_file sex condition nested
sample1 Male pre 1
sample1 Male post 1
sample2 Male pre 2
sample2 Male post 2
sample3 Female pre 1
sample3 Female post 1
... ... ... ...

When I remove the nested column, which to my understanding is the same as removing pairing, DESeq2 calls its internal filtering and outlier replacing methods. Is the inclusion of pairing making DESeq2 think that I have no replicates? How can I keep my paired design but still take advantage of DESeq2's outlier detection?

All comments are greatly appreciated.

Thanks,
Greg

deseq2 differential gene expression multiple factor design paired design • 1.5k views

ADD COMMENT • link 7.2 years ago gregory.l.stone ▴ 10

score 1 · Answer 1 · 2017-01-24

1

Entering edit mode

Michael Love 41k

@mikelove

Last seen 4 hours ago

United States

hi Greg,

Yes, the assessment of replicates is just looking at the design matrix, and allowing outlier filtering (3-6) or replacement (7+) based on how many other rows (samples) of the design matrix are identical. The outlier filtering/replacement procedure is not perfect, but it's the bare minimum we wanted to have in place so that individual outliers are not dominating results in experiments with replication. For more complicated situations like yours, I'd recommend visual inspection of suspicious genes based on the underlying statistic used for identifying outliers, the Cook's distance. This gives some measure of how much the coefficient vector depends on a single sample, and we produce this for every gene, giving a matrix of Cook's distances in assays(dds)[["cooks"]]. The maximum value for each gene is in mcols(dds)$maxCooks. What you could do, is to fit a separate model solely for identifying outliers that may be driving pre vs post differences, by using a simpler model, e.g. ~sex + condition. Then use the maxCooks per gene for this model to make a rule of thumb for identifying outliers in your main model, where you turn off outlier filtering and replacement (using minReplicatesForReplace=Inf and cooksCutoff=FALSE).

ADD COMMENT • link 7.2 years ago Michael Love 41k

0

Entering edit mode

I followed your advice and simplified my design from ~ sex + sex:nested + sex:condition to ~ sex + sex:condition to examine max cook's values and establish a cooks cutoff. When I go to my original design and set the cooksCutoff value with res = results(dds, cooksCutoff = x) the padj don't change, even though it is my understanding that genes with a count with a cooks value above cooksCutoff get set to na. Am I implementing the cooksCutoff value wrong?

Additionally, on a somewhat related note, what is your opinion on independent hypothesis weighting vs the implemented BH correction?

Thank you for any and all advice.

ADD REPLY • link 7.2 years ago gregory.l.stone ▴ 10

1

Entering edit mode

There are not replicates for the filtering, so it's not attempting filtering. You can filter yourself using:

res$pvalue[ other.max.cooks > x ] <- NA

Then

res$padj <- p.adjust(res$pvalue, method="BH")

Or try the IHW function to do weighting. IHW is very promising and I would try it out.

ADD REPLY • link 7.2 years ago Michael Love 41k

0

Entering edit mode

Thank you again for your great help. I apologize I should have seen your response at C: [DESeq2] Cook's distance flagging for continuous variables

ADD REPLY • link 7.2 years ago gregory.l.stone ▴ 10

score 0 · Answer 2 · 2017-01-24

0

Entering edit mode

gregory.l.stone ▴ 10

@gregorylstone-12225

Last seen 5.5 years ago

Thank you very much for your thoughtful and quick response. That plan sounds great and I'll be sure to give it a try.

ADD COMMENT • link 7.2 years ago gregory.l.stone ▴ 10