I am working with a mouse RNA-Seq dataset that has four samples groups with seven replicates for each group. There was genomic DNA contamination due to low amounts of DNAase which resulted in a low exonic mapping rate (~40%) for samples. Also, amongst those mapped reads the number of uniquely mapping reads is also quite low leaving us with ~7k genes detected for each sample (not the same 7k genes in each sample).
After removing any genes that have zero counts across the board I am left with 15,056 genes. I used DESeq with fitType='local' and I get 57 significant genes at padj < 0.1 (with very low fold changes).
After looking at a few diagnostic plots, I have some question and wonder whether given the nature of our dataset this is giving us anything meaningful (and frankly, is it worth analyzing)?
1. The dispersion plot is not typical based on what I have seen before. Rather than a decreasing trend with higher counts, we see slight increase in dispersion with low counts, followed by a decrease. I am not sure how to interpret this trend. Also, dispersion values are higher than what I normally find, although maybe expected since we are dealing with low counts?
2. The MA plot which is plotted using plotMA, I'm assuming are the shrunken fold changes. In which case, we would expect the shrinkage to be greater for the log2 fold change estimates from genes with low counts and high dispersion, but we don't really see narrowing of spread of leftmost points in the plot. Rather, it looks quite disperse.
3. In the results table, if I count how many genes !is.na(res$padj) I get 753 genes. This seems quite low for the total number of total tests within a dataset. My guess is that majority of genes lost here are due to independent filtering (based on low mean normalized count) that is applied. Is there a way to find out what threshold was used for this filtering?
The remaining NA genes are likely due to Cook's distance filtering. When running DESeq I get the following message during model fitting "replacing outliers and refitting for 3417 genes". Even though values are getting replaced (with the trimmed mean over all samples) - it seems they are still being flagged? What is the purpose of replacement in that case?
Thanks in advance,
Dispersion plot: https://db.tt/zV58Kon5
MA plot: https://db.tt/wK36ltmI