Question

ChIP-seq MA plot possibly skewed - how to normalise?

0

Entering edit mode

hasse.bossenbroek • 0

@hassebossenbroek-23193

Last seen 3.8 years ago

Hi everyone,

I'm using DiffBind to analyse some H3K4me3 ChIP-seq data. I've used DiffBind for H3K27ac ChIP-seq data as well, and when I drew MA plots using these data they were nicely symmetrical and centered around 0. However, this is not the case with my H3K4me3 data: It looks like the central line of the MA plot is not horizontal, but diagonal, with a lot of positive log fold changes at low "concentrations" (peak height) and negative log fold changes at high concentrations (see attached figure). When I use these data for differential binding analysis with DiffBind's default settings, both these groups (low conc/positive LFC and high conc/negative LFC) contain lots of differentially bound sites. However, when I use edgeR instead, this is not the case and only about a quarter of the sites are differentially bound (all low conc/positive LFC).

I have matching RNA-seq data, and these do not suggest that there is a global shift in transcription or anything. I also know that there is a difference in experimental efficiency between the two groups I am comparing (disease v. healthy). When I use bFullLibrarySize=FALSE in dba.analyze() with DESeq2 to compensate for this, I get results very similar to the edgeR result. Basically, I don't have other evidence to suggest there is a global change in signal between my two conditions (but I have not systematically tried to rule this out in the lab).

But even with stringent normalisation, the central line in the MA plot is diagonal. Does anyone know if this means anything, and what the consequences are for the normalisation method I should apply?

Any help would be much appreciated.

Thank you!

MA plot

DiffBind MA plot normalization ChIP-seq • 1.1k views

ADD COMMENT • link updated 3.8 years ago by Rory Stark ★ 5.2k • written 3.8 years ago by hasse.bossenbroek • 0

score 0 · Answer 1 · 2020-07-23

Given that the mark is H3K4me3, which is tends to accumulate in the promoters of actively transcribed genes, seeing the highest density at relatively high concentrations, with relatively low fold changes, is consistent. It would also be interesting to see the non-normalized plot (bNormalized=FALSE), but it does look like there is greater signal in the Control condition. But overall I don't see anything too concerning here.

Obervations: The main differences in the analyses are that the DESeq2 with full library size identifiies many more DB sites, while the edgeR analysis identifies a few more sites that lose signal in the disease condition, compared to the DESeq2 analysis using reads overlapping consensus peaks identifies more regions with lower fold changes.

Given that the data are overall shifted towards the Control, while the DB sites are shifted towards the Leukaemia, I would go with the edgeR/TMM analysis, or apply a fold change cutoff to a DESeq2 analysis.