I am trying to use edgeR for differential Pol2 pausing analysis but I am unsure if I'm on the right track.
"Pol2 pausing index" is the ratio of Pol2 ChIP in a gene's promoter vs downstream. I'd like to identify how this changes between control and treatment. I'm unable to find a published method that accounts for replicates or input or gives an indication of statistical significance, so I'm trying to come up with one.
The experimental design and sample table is as follows:
- Two treatments
- Two regions; "tss" indicates the counts for that column are from TSS+300bp downstream and "body" indicates 300bp-2000bp downstream
- Pol2 ChIP and its matched input
- Two replicates
- Additional columns "bam" and "material" attempt to show the nesting
- bam: for a particular BAM file, there are two sets of read counts -- those falling in TSS and those falling into the gene body
- material: for a particular biological starting material, we get two bams, one for input and one for pol2 IP
Here is the sample table:
sampleID treatment region antibody experiment bam material group 1 control tss input 1 1 1 control.tss.input 2 control tss pol2 1 2 1 control.tss.pol2 3 control body input 1 1 1 control.body.input 4 control body pol2 1 2 1 control.body.pol2 5 treatment tss input 1 3 2 treatment.tss.input 6 treatment tss pol2 1 4 2 treatment.tss.pol2 7 treatment body input 1 3 2 treatment.body.input 8 treatment body pol2 1 4 2 treatment.body.pol2 9 control tss input 2 5 3 control.tss.input 10 control tss pol2 2 6 3 control.tss.pol2 11 control body input 2 5 3 control.body.input 12 control body pol2 2 6 3 control.body.pol2 13 treatment tss input 2 7 4 treatment.tss.input 14 treatment tss pol2 2 8 4 treatment.tss.pol2 15 treatment body input 2 7 4 treatment.body.input 16 treatment body pol2 2 8 4 treatment.body.pol2
The counts matrix has a row for each gene. So Sample 1 and Sample 3 columns in the counts matrix are counts from the same BAM file, just different parts of each gene.
A first attempt:
y <- DGEList(counts=counts.matrix, group=sampletable$group) y <- calcNormFactors(y) design <- model.matrix(~0+sampletable$group, data=y$samples) y <- estimateDisp(y, design) fit <- glmFit(y, design) my.contrasts <- makeContrasts( full=( (treatment.tss.pol2 - treatment.tss.input) - (treatment.body.pol2 - treatment.body.input) ) - ( (control.tss.pol2 - control.tss.input) - (control.body.pol2 - control.body.input) ), levels=design ) tt.full <- topTags(glmLRT(fit, contrast=my.contrasts[,'full']))
To evaluate this, I visually identified in the genome browser some genes that appear to be changing their pausing index and find where they fall in the top tags. The results from the contrast do not match what I'm seeing in the browser. Some possible reasons:
- nesting is not properly accounted for
- "samples" are not independent (e.g., control.tss.input and control.body.input come from the same BAM file), causing normalization to be incorrect
- systematic difference in window sizes (TSS is 300bp; body is 1700bp) possibly skewing normalization
- input introduces noise/artifacts that are not worth the benefit of controlling for open chromatin
- genome browser normalization (scaled by million mapped reads) not consistent with edgeR normalization
Some things I'm considering:
- I ran across edgeR::calcNormOffsetsforChIP. Sounds potentially useful, but I'm not sure how to correctly use it in this context
- Manually enter normalization factors to reflect library size and region size and bypass TMM
- Use csaw? No overlapping windows here so I'm not sure what changes from the vignette would be necessary
- Ignore input altogether, as suggested in e.g. the csaw docs
Any corrections, suggestions, or nudges in the right direction would be much appreciated. Thanks!