Question

Gene set enrichment analysis with ChIP-seq data using test statistics directly

0

Entering edit mode

gabriel.hoffman ▴ 170

@gabrielhoffman-8391

Last seen 3 months ago

United States

In RNA-seq/microarray data, enrichment analysis of gene sets generally takes 2 forms: 1) Identify a set of genes that are significant by some metric (usually FDR < 5%) and evaluate enrichment compared to some background set of genes; 2) use the test statistics directly to avoid an arbitrary cutoff. In limma #2 is implemented in cameraPR(), geneSetTest() and wilcoxGST() functions.

Let's turn to ChIP-seq, ATAC-seq data or any assay that yields genomic intervals. Just like standard differential expression analysis, consider a differential 'binding' test between two sets of samples. After getting the test statistics for each interval, I want to see if these differential intervals are enrichments for some biological function. Test #1, where a subset of intervals are called 'significant', is implemented in GREAT great.stanford.edu). GREAT accounts for false positive enrichments due to overlap of genomic intervals with genes in specific function set even under the null.

How do I perform test #2 with ChiP-seq intervals? The analysis must a) use the test statistics directly, rather than dividing intervals into 'significant' and 'non-significant' sets, and b) use a GREAT-style method to avoid false positive enrichments.

Does a method / implementation like this exist?

limma csaw • 1.1k views

ADD COMMENT • link updated 6.6 years ago by Aaron Lun ★ 28k • written 6.6 years ago by gabriel.hoffman ▴ 170

score 0 · Answer 1 · 2018-09-11

The concept of assigning a binding site to the closest gene has always seemed rather... courageous to me, regardless of the specifics of how the testing is done. That's speaking from some experience with Hi-C data; at 1 Mbp, you're typically beyond the range of the ability of non-specific local interactions to maintain 3D proximity. Rather, you're reliant on looping interactions for a distal TF binding site to exert some cis effect on a gene, and such loops tend not to be promiscuous, i.e., I would not assume that one exists for a random site:gene pair. Of course, structural proteins like CTCF and lamins will have quite far-reaching effects, but - at least anecdotally - relating their binding to activity of genes within the surrounding domain seems even harder.

Nonetheless, I'll shoot from the hip and try to answer your question. Given the csaw tag, I assume you have a test statistic for each window. The simplest approach to achieve your request would be to define a weight for each window based on the distance to any gene in the functional set. Specifically, each [window]:[gene in set] pair has a weight that decreases with distance between them (or zero, if they lie on different chromosomes); I would probably use a Gaussian kernel with a bandwidth of 5-10 kbp. You can sum this up to obtain a total weight for the window, which can then be used as the weights= argument in roast or camera.

This approach has some nice properties. Assuming you're working with a TF of some sort, you can define distances to favour windows near the TSS where binding is most likely to have an effect on transcription. You avoid "false positives" from uneven intergenic region sizes, as windows that lie in large intergenic regions would not have high weight unless they were actually close to the genes. The summation would also favour windows that lie in regions that are enriched for genes in your set, especially bidirectional promoters for which it would be arbitrary to assign a window to one gene or the other.

Note that, no matter what weighting scheme you pick, it is important to use a gene set test that is robust to correlations, as immediately adjacent windows will be strongly correlated for obvious reasons. ROAST would be my method of choice here. You could reduce the correlations to some extent by "thinning" the windows (e.g., using findMaxima) but that won't get rid of the biologically-driven correlations when multiple binding sites occur in the same genomic neighbourhood.