Question

Spike-in normalization in EdgeR

0

Entering edit mode

Hesh ▴ 10

@hesh-14437

Last seen 3.4 years ago

University of Washington

Hi,

We are trying to analyze differential binding of TF. TFs were mapped using the CUT&RUN method.

To do the differential analysis we used EdgeR and TMM normalization. However, in one case where there is more total level of TF in one condition compared to the other (found from western blot), TMM normalization doesn't seem to be accurate. We would like to incorporate spike-in normalization. Does anyone know how to incorporate spike-in normalization to EdgeR differential analysis?

Thank you.

CUTandRUN edgeR Normalization SpikeIn ChIPSeq • 3.0k views

ADD COMMENT • link updated 17 months ago by Bogdan ▴ 670 • written 3.5 years ago by Hesh ▴ 10

0

Entering edit mode

Do you have spike-ins? If you do, can you explain what was spiked-in and what measurements you have on them?

ADD REPLY • link 3.5 years ago Gordon Smyth 50k

0

Entering edit mode

TF was mapped in human cells and spiked in with yeast DNA. Both were sequenced so now I have human and yeast reads aligned and quantified.

ADD REPLY • link 3.5 years ago Hesh ▴ 10

0

Entering edit mode

You are correct that you should not use TMM normalization directly on ChIP-seq counts when the binding enrichment changes systematically between conditions. Spike-ins are one way to get around this, but you can also normalize to "background" reads using large bins across the genome. The csaw package guide shows how to do this, and this ability is now built into the DiffBind package (the vignette walks through this process in some detail).

For spike-ins, there are a number of protocols and commercial kits for performing ChIP-seq spike-ins, usually using Drosophila chromatin. After sequencing, the reads can be aligned against the Drosophila reference genome (either separately of combined with the target genome). Once you have the alignments, the latest version of DiffBind has built-in support for normalizing to spike-in data and performing differential binding analysis using edgeR. The vignette has a section showing how to do this.

Note that, once the normalization parameters have been set, you can export the edgeR DGEList object from within DiffBind for fine-grained control over the edgeR analysis.

ADD REPLY • link 3.5 years ago Rory Stark ★ 5.2k

0

Entering edit mode

Thank you. I will check on this. I also want to point out that it was not ChIPseq, but CUT&RUN method that was used.

ADD REPLY • link 3.5 years ago Hesh ▴ 10

score 1 · Answer 1 · 2020-11-01

edgeR itself can incorporate alternative normalization schemes fairly easily; the real question is whether the assumptions behind the spike-in process are applicable. IIRC, there are two major assumptions:

The spike-in antibody (usually against some Drosophila histone mark) is subject to the same technical biases as the actual antibody against your desired target.
Your spike-in addition is sufficiently accurate so that the ratio of the concentrations of spike-in chromatin to your actual chromatin of interest is constant across samples.

To make it all work, you can align reads to a combined genome containing both your human and spike-in reference sequence. Then it's a simple matter of:

Identifying enriched regions in the combined genome. The safest way to do so is to pool reads from all samples together for a single round of peak calling.
Creating the usual DGEList where each row corresponds to an enriched region in the combined genome.
Subsetting the DGEList to your regions from the spike-in genome (do not set keep.lib.sizes=FALSE!) and run calcNormFactors().
Transfer the normalization factors from the subset back to the full DGEList.

Steps 3 and 4 would look something like this, assuming your DGEList is named y and you have a GRanges named locations:

is.spike.in <- as.logical(seqnames(locations) %in% c("I", "II", "III")) # I dunno, whatever the spike-in chromosome names are.
ysub <- y[is.spike.in,]
ysub <- calcNormFactors(ysub)
y$samples$norm.factors <- ysub$samples$norm.factors

Here, the TMM step assumes that any difference in the spike-in coverage is technical and should be removed. The transfer of the normalization factors back to y further assumes that the biases affecting the spike-in chromatin are also applicable to the actual test chromatin.

And that's it. After that, it's just the usual edgeR workflow. Personally I always felt that these assumptions were pretty sketchy, and I would prefer to use the binning approach (see Section 4.1 here for some background). But to each their own.

I'll also add that just adding in yeast DNA is not really all that informative. The main appeal of spike-ins is to capture differences in immunoprecipitation efficiency across samples. If you're just throwing in yeast DNA without an antibody against it, you don't get that information; at that point, you might as well save yourself the trouble and use TMM on the bins, especially given that your TF probably isn't binding enough of the genome to compromise the accuracy of the binning approach.