Differential Pol2 pausing analysis
Entering edit mode
dalerr • 0
Last seen 4.7 years ago

I am trying to use edgeR for differential Pol2 pausing analysis but I am unsure if I'm on the right track.

"Pol2 pausing index" is the ratio of Pol2 ChIP in a gene's promoter vs downstream. I'd like to identify how this changes between control and treatment. I'm unable to find a published method that accounts for replicates or input or gives an indication of statistical significance, so I'm trying to come up with one.

The experimental design and sample table is as follows:

  • Two treatments
  • Two regions; "tss" indicates the counts for that column are from TSS+300bp downstream and "body" indicates 300bp-2000bp downstream
  • Pol2 ChIP and its matched input
  • Two replicates
  • Additional columns "bam" and "material" attempt to show the nesting
    • bam: for a particular BAM file, there are two sets of read counts -- those falling in TSS and those falling into the gene body
    • material: for a particular biological starting material, we get two bams, one for input and one for pol2 IP

Here is the sample table:

sampleID  treatment region    antibody  experiment bam material group
1         control   tss       input     1          1   1        control.tss.input
2         control   tss       pol2      1          2   1        control.tss.pol2
3         control   body      input     1          1   1        control.body.input
4         control   body      pol2      1          2   1        control.body.pol2
5         treatment tss       input     1          3   2        treatment.tss.input
6         treatment tss       pol2      1          4   2        treatment.tss.pol2
7         treatment body      input     1          3   2        treatment.body.input
8         treatment body      pol2      1          4   2        treatment.body.pol2
9         control   tss       input     2          5   3        control.tss.input
10        control   tss       pol2      2          6   3        control.tss.pol2
11        control   body      input     2          5   3        control.body.input
12        control   body      pol2      2          6   3        control.body.pol2
13        treatment tss       input     2          7   4        treatment.tss.input
14        treatment tss       pol2      2          8   4        treatment.tss.pol2
15        treatment body      input     2          7   4        treatment.body.input
16        treatment body      pol2      2          8   4        treatment.body.pol2

The counts matrix has a row for each gene. So Sample 1 and Sample 3 columns in the counts matrix are counts from the same BAM file, just different parts of each gene.

A first attempt:

y <- DGEList(counts=counts.matrix, group=sampletable$group)
y <- calcNormFactors(y)
design <- model.matrix(~0+sampletable$group, data=y$samples)
y <- estimateDisp(y, design)
fit <- glmFit(y, design)

my.contrasts <- makeContrasts(
  (treatment.tss.pol2 - treatment.tss.input) -
  (treatment.body.pol2 - treatment.body.input)
 ) - (
  (control.tss.pol2 - control.tss.input) -
  (control.body.pol2 - control.body.input)
tt.full <- topTags(glmLRT(fit, contrast=my.contrasts[,'full']))


To evaluate this, I visually identified in the genome browser some genes that appear to be changing their pausing index and find where they fall in the top tags. The results from the contrast do not match what I'm seeing in the browser. Some possible reasons:

  • nesting is not properly accounted for
  • "samples" are not independent (e.g., control.tss.input and control.body.input come from the same BAM file), causing normalization to be incorrect
  • systematic difference in window sizes (TSS is 300bp; body is 1700bp) possibly skewing normalization
  • input introduces noise/artifacts that are not worth the benefit of controlling for open chromatin
  • genome browser normalization (scaled by million mapped reads) not consistent with edgeR normalization

Some things I'm considering:

  • I ran across edgeR::calcNormOffsetsforChIP. Sounds potentially useful, but I'm not sure how to correctly use it in this context
  • Manually enter normalization factors to reflect library size and region size and bypass TMM
  • Use csaw? No overlapping windows here so I'm not sure what changes from the vignette would be necessary
  • Ignore input altogether, as suggested in e.g. the csaw docs

Any corrections, suggestions, or nudges in the right direction would be much appreciated. Thanks!


edger csaw chipseq differential binding analysis • 1.0k views
Entering edit mode
Aaron Lun ★ 27k
Last seen 4 hours ago
The city by the bay

For starters, I would ignore the input libraries, at least for the time being. This is because "correcting" for chromatin state will also eliminate some (or all) of the genuine changes in Pol2 binding. For example, it seems entirely reasonable that Pol2 would have increased binding in open, transcriptionally active regions of the genome. If the DNA is twice as open (i.e., twice as much coverage in the input), and there's also twice as much Pol2 binding in these open regions, normalizing on the former would eliminate the signal from the latter.

As for what I would recommend, the analysis actually becomes quite simple because of the nature of your question. You're looking to compare regions in the same sample - if you use the sample identity as a blocking factor, then you don't need to normalize between samples. Moreover, because you're comparing the TSS/gene body fold change between conditions, you don't even need to normalize for differences in size, GC content, etc. of the two regions. Under the null hypothesis, you can assume that the coverage bias between the TSS and the gene body are constant between conditions, so it all cancels out - theoretically, at least.

To sum up, I would use a design matrix that looks like this:

chip.only <- sampletable[sampletable$antibody!="input",]
chip.only$bam <- factor(chip.only$bam)
design <- model.matrix(~ 0 + bam + treatment:region, chip.only)
design <- design[,-grep("regionbody", colnames(design))] # get to full rank
colnames(design) <- sub(":", "_", colnames(design)) # get rid of colons

The first four coefficients here are the sample blocking factors, representing the coverage in the gene body. The last two coefficients represent the condition-specific log-fold change in coverage of the TSS over the body. You can check for changes in pausing with:

con <- makeContrasts(treatmentcontrol_regiontss - treatmenttreatment_regiontss, levels=design)

... which tests whether the TSS vs. gene body enrichment in the control condition is equal to that in the treatment.

Entering edit mode

Thank you so much for this. Way more elegant than the path I was on and the results are now looking quite good.

Follow-up question: I hope to do some downstream filtering on which genes were paused in the control. Based on your arguments above on ignoring input and sample as blocking factor, would you recommend something like this?:

control.chip.only <- chip.only[chip.only$treatment == 'control',]
control.chip.only$bam <- droplevels(control.chip.only$bam)
design <- model.matrix(~ bam + region, chip.only)
Entering edit mode

That's going to be harder, because that kind of comparison will require normalization between region widths, GC content, mappability, etc. between the TSS and gene body. We managed to avoid them in my original answer because we assumed that these biases cancelled out between conditions; but if you're going to look for differences within each condition, then you'll have to deal with them. This could be quite difficult, depending on how what factors drive the bias for each gene. If Pol2 coverage is mostly uniform across the TSS/gene body, you might be able to just normalize based on region size - otherwise, some multivariate regression on relevant features (e.g., exon length, GC content) would be in order.

Entering edit mode

Hmm. Let's make the simplifying assumption that coverage is uniform across TSS and across gene body. Practically speaking, how would I incorporate region size normalization into edgeR? Would this be an offset?

Entering edit mode

Yes. Log-transform (base e) the region sizes and use them as the offsets for the corresponding counts, i.e., log-TSS width for the TSS counts, log-body width for the gene body counts. This should be a matrix if the widths differ across genes. Finally, use scaleOffset to add the offsets to the DGEList object.

Entering edit mode

OK, I'll give that a shot. Thanks very much for your help!


Login before adding your answer.

Traffic: 351 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6