Question

RiP RLE normalisation using spike-in peaks in DiffBind for ChIP-seq

0

Entering edit mode

spg • 0

@d3272e17

Last seen 15 months ago

Canada

Hi,

I have two conceptual/technical questions about spike-in normalisation of ChIP-seq data.

We did spike-in for PolII ChIP-seq from a mouse cell line with ActiveMotif kit, which uses anti-H2Av antibody with drosophila cells. The mouse and fly chromatin are mixed and a mix of anti-mouse PolII and anti-H2Av antibodies are used to IP. I aligned the reads to mouse and drosophila genome separately. And now I am trying to get normalization factors for each replicate using DiffBind (v. 3.8.3), which I would like to use for differential peak analysis as well as scaling bigwig files.

Qn1: From what I can gather, in DiffBind, by including the spikein bam filles in the Spikein column of sampleSheet, I have the option to use the spikein library size or the csaw background bins method with RLE or TMM. Is there a built-in option to use the reads in peaks (RiP) method from the spike-in datasets (i.e., read counts in H2Av peaks) for normalizing the mouse reads? If there is an option to supply the peaks in the spike-in dataset, I cannot find it.

To my naive thinking, this option better controls for IP efficiency (~signal/noise ratio) per replicate than library size or background bins (assuming the antibody and chromatin concentrations in the replicates are constant). Am I missing something conceptually?

Qn2: To get the normalisation factors using the RiP from the spike-in datasets, I made a dba object using the drosophila sequences ONLY (drosophila bam files + H2Av peaks) and tried the various normalisations (code below). Differential peaks should not be found between conditions here, since these are spike-in chromatin IPs. It looks like the RiP RLE normalisation does an eerily excellent job at normalising the counts, while the library size or background bins methods still give a lot significantly different peaks when they shouldn't (although the fold change is minimal).

I understand, with RiP RLE normalisation, I am more directly normalising the peak signals and plotting the peak signal, so the fact that it has the best effect in this analysis is not surprising in itself. But I am surprised by the near absence of variation between the replicates (meaning a lot of the variation comes from IP efficiency between samples?). Does this mean, RiP RLE normalisation using spike-in peaks is the correct method to normalise my mouse read counts?

I am pretty new to ChIP-seq/DiffBind analyses. Please let me know if anything is wrong or unclear. Thank you!

noNorm libSize bkgd_bins_RLE RiP_RLE RiP_RLE_y_zoomed

>spik.db <- dba(sampleSheet = "../input_data/sampleSheets/P2_dm3_spikein_sampleSheet_DiffBind.csv",
         minOverlap = 1)
>spik.db
4 Samples, 11449 sites in matrix:
ID Tissue Factor Condition  Treatment Replicate Intervals
1  1   CL    P2        WT Full-media         1     10715
2  2   CL    P2        WT Full-media         2      9763
3  3   CL    P2        KO Full-media         1     11166
4  4   CL    P2        KO Full-media         2     11480
>spik.counts <- dba.count(spik.db,minOverlap = 1)
>spik.model <- dba.contrast(spik.counts, minMembers=2, reorderMeta=list(Condition="KO"))
>spik.model <- dba.analyze(spik.model) #default lib size
>dba.plotMA(spik.model, bNormalized = F, sub="No norm")
>dba.plotMA(spik.model, bNormalized = T, sub="Norm: Lib size")
>rm(spik.model)

>spik.model <- dba.contrast(spik.counts, minMembers=2, reorderMeta=list(Condition="KO"))
>spik.model<-dba.normalize(spik.model,normalize = "RLE",background=T) # bkgd bins RLE normalisation
>spik.model <- dba.analyze(spik.model)
>dba.plotMA(spik.model, bNormalized = T, sub="Norm: csaw bkgd bins RLE")
>rm(spik.model)

>spik.model <- dba.contrast(spik.counts, minMembers=2, reorderMeta=list(Condition="KO"))
>spik.model<-dba.normalize(spik.model,normalize = "RLE") # RiP RLE normalisation
>spik.model <- dba.analyze(spik.model)
>dba.plotMA(spik.model, bNormalized = T, sub="Norm: RiP RLE")

>db_report<-dba.report(spik.model,th=1,bCounts = T) #to examine normalized counts of RiP RLE
>plot(db_report$Fold~db_report$Conc,main="RLE (RiP) y-axis zoomed")

DiffBind SpikeIn Normalization ChIP-seq csaw • 1.6k views

ADD COMMENT • link 16 months ago • updated 15 months ago spg • 0

score 2 · Accepted Answer · 2022-12-15

2

Entering edit mode

Rory Stark ★ 5.2k

@rory-stark-5741

Last seen 15 days ago

Cambridge, UK

Qn1: Is there a built-in option to use the reads in peaks (RiP) method from the spike-in datasets

If you specify spike-in read bam files in your sample sheet, you can pass a GRanges object as the value for the spikein parameter, and then only drosophila reads that overlap peaks in the GRanges object will be counted. This will become the "background", and you can further specify normalize="RLE" to do what you want. See the help page for dba.normalize() where it discusses the spikein parameter, specifically where it says how spikein can be a GRanges.

You did find an alternative way of deriving the normalisation factors you wanted by running a separate normalisation using only the spike-in reads and peaks, which is also valid, but the spikein=GRanges method is more straighforward.

Qn2: I am surprised by the near absence of variation between the replicates (meaning a lot of the variation comes from IP efficiency between samples?). Does this mean, RiP RLE normalisation using spike-in peaks is the correct method to normalise my mouse read counts?

RLE is a good choice when you expect most of the peaks not to be different between the conditions, which is exactly the assumption for spike-ins, where differences in the raw data should be due to IP efficiency and sequencing depth.

ADD COMMENT • link 16 months ago Rory Stark ★ 5.2k

0

Entering edit mode

I looked at the dba.normalize() help section where a GRanges can be supplied to the spikein parameter. This is the same interface for supplying parallel factor normalisation right?

I have two chromosomes that have the same names in mouse and drosophila genomes. Would that be a problem? It is a little bit unclear to me, whether supplying the spike-in bam files in the sample sheet forces dba.normalize() to look at the GRanges in its spikein parameter, only in eg.db.obj$samples$Spikein bam files or also eg.db.obj$samples$bamReads. I didn't realise this would be an issue downstream when aligning, or I would have aligned to the combined genomes with the fly chromosome names altered (probably best practice going forward regardless).

I would also like to edit the original question to reflect that TMM normalisation produces almost identical results to RLE. The distinction is really whether reads in peaks (RiP) vs library size/background bins method for this specific use case. But, I do not know the etiquette for editing after the question has been replied to. If it is frowned upon, this comment can be the record clarifying this.

And thanks a lot for actively supporting the package, responding to questions and the tutorials Rory. I learnt a lot about DiffBind from your replies to the different forum posts!

ADD REPLY • link 16 months ago spg • 0

1

Entering edit mode

Yes, this is the same interface as for parallel factors. The difference is whether or not Spikein files are specified in the sample sheet. If there are no Spikein files supplied, the GRanges will be evaluated against the primary read count files, as in the case of a parallel factor. When you specify Spikein files, only those files will be looked at, so in your case where they share a chromosome name, only reads on that chromosome in the Spikein files will be counted.

ADD REPLY • link 15 months ago Rory Stark ★ 5.2k

0

Entering edit mode

Yes, I can confirm supplying spikein peaks this way works as you mentioned.

After looking at the RiP proportions for mouse PolII and Drosophila H2AV in their respective genomes, I do have a follow-up conceptual question. The RiP values are not consistently changing between the replicates (i.e., one replicate has less PolII RIP and more H2AV RiP and the other replicate is the opposite). Assuming RiP proportion reflects IP efficiency, the IP efficiencies between PolII and H2AV antibodies do not seem to change consistently, meaning I cannot confidently use H2AV IP efficiency to normalise PolII IP. I guess if I had used a single antibody (a PolII antibody that binds to both mouse and Drosophila PolII), then mouse and Drosophila RiP values will more consistently change between replicates? Have you come across this scenario? Am I overthinking about this inconsistency mouse/Drosophila RiP proportion between replicates of the same sample? In the end, I am using the csaw background bins method with RLE on the spikein H2AV reads to normalise mouse PolII reads.

ADD REPLY • link 15 months ago spg • 0