Question: Removing GC content bias and trended bias from ChIP-seq data
0
3.0 years ago by
s14376430
s14376430 wrote:

Note: I originally posted this on Biostars, but was suggested to post here.

I am doing DB analysis of ChIP-seq data using the csaw package. It seems I have a small trended bias and possible GC content bias in my data. Below are MA plots showing signal in the merged peaks using CPM normalisation (LHS) and Loess normalisation (RHS) to account for trended biases:

If I create MA plots coloured by GC content (showing only top/bottom 10% by GC content) there also appears to be a GC content bias (LHS) which can be fixed using CQN normalisation (RHS):

However, looking at the affect of CQN normalization (RHS) on all of the data, I'm not sure if it corrects the trended bias correctly (like the Loess normalization).

Also, when I look at the called differential peak regions in a genome browser using the CQN normalisation, some of the calls don't match what I can measure roughly by eye, suggesting the normalisation isn't appropriate. Both of these methods output an offset matrix which I supply to edgeR.

1. Is there a way to combine the offsets produced by cqn and csaw to correct for both the trended bias and GC bias.
2. Is there a better way to correct for these biases?

modified 3.0 years ago by Aaron Lun25k • written 3.0 years ago by s14376430
Answer: Removing GC content bias and trended bias from ChIP-seq data
2
3.0 years ago by
Aaron Lun25k
Cambridge, United Kingdom
Aaron Lun25k wrote:
1. No. To illustrate, the two methods will both remove the MA trend to some degree; combining the offsets would result in over-normalization where you effectively remove the trend twice.
2. Well, cqn should do a pretty good job, and seems like it does so in your plots. (I'm not very familiar with how it works, but it's pretty close to quantile normalization, so it should remove trends as well.) I don't know why you would expect the calls to match what you measure by eye; obviously they won't, because the normalization does a fair bit of work to correct the biases present in the raw data.

In general, we don't routinely perform GC correction for DB analyses of ChIP-seq data because:

1. It seems unnecessary in most cases where the GC bias cancels out in DB comparisons between samples.
2. It requires extra work if you want to combine it with trend-based corrections. Getting rid of trends is a bigger priority as these can mess up mean-dispersion trend fitting in edgeR.
3. GC content may be correlated with interesting biology, e.g., proteins binding in GC-rich regions. So normalizing to remove GC content-correlated changes may end up removing the DB that you're looking for in the first place.

Also, looking at the 10% of windows with the most extreme GC contgent may give you a misleading impression of what's actually happening to the bulk of the data - perhaps make a GC content % vs M-value plot and see whether there's any trend there. (Some trend is likely due to the correlation between GC content and abundance; I guess the question is how much remains after loess correction.)