Question

Interpretation of cross correlation plot to determine fragment length

0

Entering edit mode

siklenkak • 0

@siklenkak-12153

Last seen 6.5 years ago

Hi,

I have a question about interpreting my cross-correlation plots from native ChIP-seq data with 100bp SE reads aligned uniquely to the genome using bowtie2. The bam files have duplicates marked and removed for this step.

To estimate fragment length I used correlateReads() from the csaw package then plotted a very different graph than the smooth one showed in the user guide... I still see a prominent peak occurring at my expected fragment size of 147 bp for histone data, but it's hard to interpret what the other peaks are. I don't have a lot of data-sets to test this on, but how critical is the smoothness of this plot? From the commentary and other readings I understand that a cross correlation plot will have a second peak at the read-length if the efficiency was poor or duplicates not removed, but there is no mention about jaggedness. Based on the browser tracks and peak-calling results I always assumed my efficiency of pull down was very good, as the data is very clean and clear.

At first I thought the shape may be due to read-trimming and processing, but raw and processed .bam files give the same result.

Any help with interpretation is appreciated.

Cross correlation plot:

Code:

library(rtracklayer)

ch <- import.chain("~/ChIPdata/reference/ChainFiles/mm9ToMm10.over.chain")
original <- import("~/ChIPdata/reference/ChainFiles/mm9-blacklist.bed")
blacklist <- liftOver(x=original, chain=ch)
blacklist <- unlist(blacklist)

library(csaw)
x <- correlateReads(bamfile, max.dist=500, param=readParam(minq=20, discard=blacklist, dedup=T))
plot(1:length(x)-1, x, xlab="Delay (bp)", ylab="CCF", type="l")

csaw chip-seq histone • 1.9k views

ADD COMMENT • link updated 7.3 years ago by Aaron Lun ★ 28k • written 7.3 years ago by siklenkak • 0

score 2 · Answer 1 · 2017-01-11

2

Entering edit mode

Aaron Lun ★ 28k

@alun

Last seen 9 hours ago

The city by the bay

I don't have much experience with native ChIP-seq, so take my comments with a grain of salt. Anyway, you haven't mentioned what the histone mark is. If it's something broad like H3K27me3 or H3K36me3, I wouldn't even bother with the cross-correlation plots; there's not going to be strong bimodality if the enrichment is spread out across a large region. If it's something like H3K9ac or H3K4me3, then your plot is more concerning - I would have definitely expected a smoother peak, driven by strong bimodality at highly bound sites. Perhaps have a look at your tracks and see if the peaks are actually bimodal or whether they're something else, e.g., repeats.

In the absence of a good cross-correlation plot, you can also get the fragment length from the Bioanalyzer traces. This isn't quite right as fragments of a certain length are more likely to form colonies on the flow cell, but it's better than a wild guess. I also notice that you say you removed duplicates from the BAM file, rather than just marking them. I generally do not do this unless I have a good reason, i.e., the data are really noisy with lots of obvious duplicates upon manual inspection in a genome browser. See the user's guide for an explanation, if you haven't already.

As for what the wobbles are in your plot - they seem pretty regular, about 10-15 bp apart. The closest value I can think of is one turn of the double helix, though I don't know why this would be relevant - something to do with micrococcal nuclease cleavage preferences, perhaps? Also, I don't know what the up/down spike near zero represents (unless your reads are very short).

ADD COMMENT • link 7.3 years ago Aaron Lun ★ 28k

0

Entering edit mode

Hi Aaron, thanks for your answer. Sorry to not mention that this is H3K4me3 data from mouse sperm cells which are a-typical and have only ~1% of the nucleosomes present in somatic cells. These nucleosomes are mainly retained at TSS and preferentially enrich at CpG dense regions. Median peak width is around 2000 bp. Here's a browser shot of a couple peak regions... To me it looks bimodal, but there are some broad regions throughout as well.

Re: duplicates, I worded it wrong, I marked them and only set readParam(dedup=T) for this plot as per the documentation... All that changed between the two was the presence of another wobbly peak at the 100bp size of my read length.

I like your thinking about the double helix and MNase sensitivity. I'll look into that.

Thanks again

ADD REPLY • link 7.3 years ago siklenkak • 0

1

Entering edit mode

Those peaks are on the larger side, so perhaps it's not surprising to fail to see any strand bimodality. Generally, it seems that as soon as peaks get above 1 kbp, the bimodality becomes less pronounced; you're basically looking for a 100 bp interval of strand-specific coverage on the flanks of the bound region, which will be harder to spot as the region size increases. In the tracks you've shown, there's not much strand bimodality, assuming the bars represent reads coloured by strand - I would have expected a pure red block adjacent to a pure blue block, rather than the colours being mixed together.

In short, I wouldn't worry about the strange cross-correlation plot. Yes, it's odd, but so is your biological system, and as long as you're confident that the enriched regions aren't caused by something else (e.g., repeats), then you should just proceed with your analysis. In fact, for large regions, the fragment length doesn't really matter because the count is mainly determined by the window width, e.g., if you're using windows of 2 kbp, then it doesn't make much difference to give or take 100 bp.

Also, try to get in the habit of typing out TRUE in its entirety. Otherwise you can get surprises like this:

T <- 0
T==TRUE # evaluates to FALSE