Question

DESeq2 with high dispersions

0

Entering edit mode

meeta.mistry ▴ 30

@meetamistry-7355

Last seen 21 months ago

United States

Hello;

I am working with a mouse RNA-Seq dataset that has four samples groups with seven replicates for each group. There was genomic DNA contamination due to low amounts of DNAase which resulted in a low exonic mapping rate (~40%) for samples. Also, amongst those mapped reads the number of uniquely mapping reads is also quite low leaving us with ~7k genes detected for each sample (not the same 7k genes in each sample).

After removing any genes that have zero counts across the board I am left with 15,056 genes. I used DESeq with fitType='local' and I get 57 significant genes at padj < 0.1 (with very low fold changes).

After looking at a few diagnostic plots, I have some question and wonder whether given the nature of our dataset this is giving us anything meaningful (and frankly, is it worth analyzing)?

1. The dispersion plot is not typical based on what I have seen before. Rather than a decreasing trend with higher counts, we see slight increase in dispersion with low counts, followed by a decrease. I am not sure how to interpret this trend. Also, dispersion values are higher than what I normally find, although maybe expected since we are dealing with low counts?

2. The MA plot which is plotted using plotMA, I'm assuming are the shrunken fold changes. In which case, we would expect the shrinkage to be greater for the log2 fold change estimates from genes with low counts and high dispersion, but we don't really see narrowing of spread of leftmost points in the plot. Rather, it looks quite disperse.

3. In the results table, if I count how many genes !is.na(res$padj) I get 753 genes. This seems quite low for the total number of total tests within a dataset. My guess is that majority of genes lost here are due to independent filtering (based on low mean normalized count) that is applied. Is there a way to find out what threshold was used for this filtering?

The remaining NA genes are likely due to Cook's distance filtering. When running DESeq I get the following message during model fitting "replacing outliers and refitting for 3417 genes". Even though values are getting replaced (with the trimmed mean over all samples) - it seems they are still being flagged? What is the purpose of replacement in that case?

Thanks in advance,

Meeta

Dispersion plot: https://db.tt/zV58Kon5

MA plot: https://db.tt/wK36ltmI

deseq2 dispersion low counts • 4.0k views

ADD COMMENT • link 8.7 years ago meeta.mistry ▴ 30

0

Entering edit mode

meeta.mistry ▴ 30

@meetamistry-7355

Last seen 21 months ago

United States

Thanks Mike, your explanations are very helpful. It turns out there are three major outlier samples - which after removing I no longer get as many outliers. I still get quite a few genes filtered by mean normalized count, but it's expected. Turning it off only results in less significant genes which makes sense since we are preforming many more unneccessary tests.

Cheers,

Meeta

ADD COMMENT • link 8.7 years ago meeta.mistry ▴ 30

score 3 · Accepted Answer · 2015-07-28

hi Meeta,

"I am not sure how to interpret this trend. Also, dispersion values are higher than what I normally find, although maybe expected since we are dealing with low counts?"

Some experiments can have different relationship of dispersion over mean, which is why we provide the local fit. This is fine. The methods are still valid with the local fit.

"we don't really see narrowing of spread of leftmost points in the plot. Rather, it looks quite disperse."

It looks like the beta prior distribution (by "beta" we mean log fold change) is so wide that there is not as much shrinkage as in other experiments. The beta prior is estimated from the observed differences between conditions across all genes, so if there are many large differences here, there is less shrinkage. So this is expected.

"In the results table, if I count how many genes !is.na(res$padj) I get 753 genes. This seems quite low for the total number of total tests within a dataset. My guess is that majority of genes lost here are due to independent filtering (based on low mean normalized count) that is applied. Is there a way to find out what threshold was used for this filtering?"

Yes this is due to the low counts in the experiment, and independent filtering. This is expected.

See the vignette section on Independent Filtering for how to find out what threshold was used. Also summary(res) will give you this information.

"The remaining NA genes are likely due to Cook's distance filtering. When running DESeq I get the following message during model fitting "replacing outliers and refitting for 3417 genes"."

With this many outliers, I would turn off the replacement steps (by setting minReplicatesForReplace=Inf in DESeq) and investigate to see if there is a sample which is consistently an outlier. The automatic outlier replacement is useful for limited replacement, but this number of outliers indicates that you might need to do more manual inspection of the dataset.

You can find such a consistent outlier sample through PCA plots (see vignette).

You might also turn off the Cooks cutoff with cooksCutoff=FALSE in results(), and examine the top genes yourself by looking at plotCounts for the genes with large mcols(dds)$maxCooks and smallest adjusted p-value.