I am trying to use edgeR for differential Pol2 pausing analysis but I am unsure if I'm on the right track.
"Pol2 pausing index" is the ratio of Pol2 ChIP in a gene's promoter vs downstream. I'd like to identify how this changes between control and treatment. I'm unable to find a published method that accounts for replicates or input or gives an indication of statistical significance, so I'm trying to come up with one.
The experimental design and sample table is as follows:
- Two treatments
- Two regions; "tss" indicates the counts for that column are from TSS+300bp downstream and "body" indicates 300bp-2000bp downstream
- Pol2 ChIP and its matched input
- Two replicates
- Additional columns "bam" and "material" attempt to show the nesting
- bam: for a particular BAM file, there are two sets of read counts -- those falling in TSS and those falling into the gene body
- material: for a particular biological starting material, we get two bams, one for input and one for pol2 IP
Here is the sample table:
sampleID treatment region antibody experiment bam material group 1 control tss input 1 1 1 control.tss.input 2 control tss pol2 1 2 1 control.tss.pol2 3 control body input 1 1 1 control.body.input 4 control body pol2 1 2 1 control.body.pol2 5 treatment tss input 1 3 2 treatment.tss.input 6 treatment tss pol2 1 4 2 treatment.tss.pol2 7 treatment body input 1 3 2 treatment.body.input 8 treatment body pol2 1 4 2 treatment.body.pol2 9 control tss input 2 5 3 control.tss.input 10 control tss pol2 2 6 3 control.tss.pol2 11 control body input 2 5 3 control.body.input 12 control body pol2 2 6 3 control.body.pol2 13 treatment tss input 2 7 4 treatment.tss.input 14 treatment tss pol2 2 8 4 treatment.tss.pol2 15 treatment body input 2 7 4 treatment.body.input 16 treatment body pol2 2 8 4 treatment.body.pol2
The counts matrix has a row for each gene. So Sample 1 and Sample 3 columns in the counts matrix are counts from the same BAM file, just different parts of each gene.
A first attempt:
y <- DGEList(counts=counts.matrix, group=sampletable$group) y <- calcNormFactors(y) design <- model.matrix(~0+sampletable$group, data=y$samples) y <- estimateDisp(y, design) fit <- glmFit(y, design) my.contrasts <- makeContrasts( full=( (treatment.tss.pol2 - treatment.tss.input) - (treatment.body.pol2 - treatment.body.input) ) - ( (control.tss.pol2 - control.tss.input) - (control.body.pol2 - control.body.input) ), levels=design ) tt.full <- topTags(glmLRT(fit, contrast=my.contrasts[,'full']))
To evaluate this, I visually identified in the genome browser some genes that appear to be changing their pausing index and find where they fall in the top tags. The results from the contrast do not match what I'm seeing in the browser. Some possible reasons:
- nesting is not properly accounted for
- "samples" are not independent (e.g., control.tss.input and control.body.input come from the same BAM file), causing normalization to be incorrect
- systematic difference in window sizes (TSS is 300bp; body is 1700bp) possibly skewing normalization
- input introduces noise/artifacts that are not worth the benefit of controlling for open chromatin
- genome browser normalization (scaled by million mapped reads) not consistent with edgeR normalization
Some things I'm considering:
- I ran across edgeR::calcNormOffsetsforChIP. Sounds potentially useful, but I'm not sure how to correctly use it in this context
- Manually enter normalization factors to reflect library size and region size and bypass TMM
- Use csaw? No overlapping windows here so I'm not sure what changes from the vignette would be necessary
- Ignore input altogether, as suggested in e.g. the csaw docs
Any corrections, suggestions, or nudges in the right direction would be much appreciated. Thanks!
Thank you so much for this. Way more elegant than the path I was on and the results are now looking quite good.
Follow-up question: I hope to do some downstream filtering on which genes were paused in the control. Based on your arguments above on ignoring input and sample as blocking factor, would you recommend something like this?:
That's going to be harder, because that kind of comparison will require normalization between region widths, GC content, mappability, etc. between the TSS and gene body. We managed to avoid them in my original answer because we assumed that these biases cancelled out between conditions; but if you're going to look for differences within each condition, then you'll have to deal with them. This could be quite difficult, depending on how what factors drive the bias for each gene. If Pol2 coverage is mostly uniform across the TSS/gene body, you might be able to just normalize based on region size - otherwise, some multivariate regression on relevant features (e.g., exon length, GC content) would be in order.
Hmm. Let's make the simplifying assumption that coverage is uniform across TSS and across gene body. Practically speaking, how would I incorporate region size normalization into edgeR? Would this be an offset?
Yes. Log-transform (base e) the region sizes and use them as the offsets for the corresponding counts, i.e., log-TSS width for the TSS counts, log-body width for the gene body counts. This should be a matrix if the widths differ across genes. Finally, use
scaleOffset
to add the offsets to theDGEList
object.OK, I'll give that a shot. Thanks very much for your help!