I'm hoping to use DEseq2 to look for regions of open chromatin that differ between cell types. It seems like the general statistical framework should be highly analogous between RNA-seq and DNase-seq (or, in my case, ATAC-seq). However, I wanted to check that some of the assumptions underlying DEseq2 do in fact translate, as some of my diagnostic plots differ from what I expected from reading the (excellent) DEseq2 documentation.
Data source/background: ATAC-seq was performed on three different cell types within a single tissue (two biological replicates per cell type). Open chromatin peaks were called on independent samples (~30,000 peaks per sample), and I then calculated the per region coverage in the regions defined by the union of all peaks (~50,000 peaks) for each sample. These counts (50,00 regions x 6 samples) constitute the count table I load into DEseq2. I then ran DEseq2 using design = ~ cell_type to identify regions of differentially open chromatin across the three cell types.
Of the 50,000 peaks I examined, 23,000 pass padj < 0.1 (comparing the first two cell types). I get a similar number of differences between the second cell-type comparison, and ~2,000 differences between the third pair (these are qualitatively what I expect for these cell types, but seem quantitatively high). Here's the MA plot for the first cell-type comparison (23,000 differences):
Many of these differences are small or identified in regions with low mean read counts--I can easily filter on LFC to get a more manageable list (setting lfcThreshold=1 and altHypothesis="greaterAbs" reduces this to 1034 regions). When I look at top hits, I get interesting and tissue-relevant results, so I think this is generally working. But I'm worried that getting back so many significant differences might mean my data aren't appropriate for DEseq2 (e.g. too noisy). Additional specific questions:
1) Is the DEseq2 normalization framework appropriate for open chromatin data? Specifically, the normalization by scale factors derived from the median of "pseudo-references" seems to assume similar overall distributions of reads between conditions ("most genes don't change" for RNA-seq, or "most open chromatin doesn't change" if looking at DHS peaks). Given that I get so many significant differences, is this normalization not appropriate for my data, and are there specific alternatives I should consider (total read depth seems inadequate given different signal to noise ratios between samples)? Is part of the problem that RNA-seq considers the entire transcriptome (lots of low-level background), whereas my approach to feature selection has enriched for regions that have higher "expression."
2) I expected some of the lower read depth regions to be removed by independent filtering. But the independent filtering threshold is set at 0%.
When I test for larger differences (setting lfcThreshold=1 and altHypothesis="greaterAbs"), the independent filtering threshold moves to 40%.
3) Dispersion fit: this isn't related to my concern about too many hits, but parametric fitting of my dispersion curve failed. Local fitting was successful, but I don't see any asymptotic shift at higher read depths as I've seen in other example dispersion fit plots. Is this an acceptable fit or--again--does this suggest a problem with applying DEseq2 to a different type of dataset (do my data have a weaker mean-dispersion relationship and will this undermine my results)?
Please let me know if I can clarify question(s) provide additional information/if I've posted this in the wrong format/forum. I consulted the posting guide--I haven't provided sample code because nothing in DEseq2 is reporting any errors, but I am happy to provide sample data/code if it will help better illustrate my question. I absolutely appreciate any input.
R version 3.1.1 (2014-07-10)
Platform: x86_64-unknown-linux-gnu (64-bit)
 LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
 LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
 LC_PAPER=en_US.UTF-8 LC_NAME=C
 LC_ADDRESS=C LC_TELEPHONE=C
 LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
 parallel stats graphics grDevices utils datasets methods
other attached packages:
 DESeq2_1.4.5 RcppArmadillo_0.4.400.0 Rcpp_0.11.2
 GenomicRanges_1.16.4 GenomeInfoDb_1.0.2 IRanges_1.22.10
loaded via a namespace (and not attached):
 annotate_1.42.1 AnnotationDbi_1.26.0 Biobase_2.24.0
 DBI_0.3.0 genefilter_1.46.1 geneplotter_1.42.0
 grid_3.1.1 lattice_0.20-29 locfit_1.5-9.1
 RColorBrewer_1.0-5 RSQLite_0.11.4 splines_3.1.1
 stats4_3.1.1 survival_2.37-7 tools_3.1.1
 XML_3.98-1.1 xtable_1.7-4 XVector_0.4.0