Spike-in normalization in EdgeR
1
0
Entering edit mode
Hesh ▴ 10
@hesh-14437
Last seen 4.0 years ago
University of Washington

Hi,

We are trying to analyze differential binding of TF. TFs were mapped using the CUT&RUN method.

To do the differential analysis we used EdgeR and TMM normalization. However, in one case where there is more total level of TF in one condition compared to the other (found from western blot), TMM normalization doesn't seem to be accurate. We would like to incorporate spike-in normalization. Does anyone know how to incorporate spike-in normalization to EdgeR differential analysis?

Thank you.

CUTandRUN edgeR Normalization SpikeIn ChIPSeq • 3.7k views
ADD COMMENT
0
Entering edit mode

Do you have spike-ins? If you do, can you explain what was spiked-in and what measurements you have on them?

ADD REPLY
0
Entering edit mode

TF was mapped in human cells and spiked in with yeast DNA. Both were sequenced so now I have human and yeast reads aligned and quantified.

ADD REPLY
0
Entering edit mode

You are correct that you should not use TMM normalization directly on ChIP-seq counts when the binding enrichment changes systematically between conditions. Spike-ins are one way to get around this, but you can also normalize to "background" reads using large bins across the genome. The csaw package guide shows how to do this, and this ability is now built into the DiffBind package (the vignette walks through this process in some detail).

For spike-ins, there are a number of protocols and commercial kits for performing ChIP-seq spike-ins, usually using Drosophila chromatin. After sequencing, the reads can be aligned against the Drosophila reference genome (either separately of combined with the target genome). Once you have the alignments, the latest version of DiffBind has built-in support for normalizing to spike-in data and performing differential binding analysis using edgeR. The vignette has a section showing how to do this.

Note that, once the normalization parameters have been set, you can export the edgeR DGEList object from within DiffBind for fine-grained control over the edgeR analysis.

ADD REPLY
0
Entering edit mode

Thank you. I will check on this. I also want to point out that it was not ChIPseq, but CUT&RUN method that was used.

ADD REPLY
1
Entering edit mode
Aaron Lun ★ 28k
@alun
Last seen 15 hours ago
The city by the bay

edgeR itself can incorporate alternative normalization schemes fairly easily; the real question is whether the assumptions behind the spike-in process are applicable. IIRC, there are two major assumptions:

  • The spike-in antibody (usually against some Drosophila histone mark) is subject to the same technical biases as the actual antibody against your desired target.
  • Your spike-in addition is sufficiently accurate so that the ratio of the concentrations of spike-in chromatin to your actual chromatin of interest is constant across samples.

To make it all work, you can align reads to a combined genome containing both your human and spike-in reference sequence. Then it's a simple matter of:

  1. Identifying enriched regions in the combined genome. The safest way to do so is to pool reads from all samples together for a single round of peak calling.
  2. Creating the usual DGEList where each row corresponds to an enriched region in the combined genome.
  3. Subsetting the DGEList to your regions from the spike-in genome (do not set keep.lib.sizes=FALSE!) and run calcNormFactors().
  4. Transfer the normalization factors from the subset back to the full DGEList.

Steps 3 and 4 would look something like this, assuming your DGEList is named y and you have a GRanges named locations:

is.spike.in <- as.logical(seqnames(locations) %in% c("I", "II", "III")) # I dunno, whatever the spike-in chromosome names are.
ysub <- y[is.spike.in,]
ysub <- calcNormFactors(ysub)
y$samples$norm.factors <- ysub$samples$norm.factors

Here, the TMM step assumes that any difference in the spike-in coverage is technical and should be removed. The transfer of the normalization factors back to y further assumes that the biases affecting the spike-in chromatin are also applicable to the actual test chromatin.

And that's it. After that, it's just the usual edgeR workflow. Personally I always felt that these assumptions were pretty sketchy, and I would prefer to use the binning approach (see Section 4.1 here for some background). But to each their own.

I'll also add that just adding in yeast DNA is not really all that informative. The main appeal of spike-ins is to capture differences in immunoprecipitation efficiency across samples. If you're just throwing in yeast DNA without an antibody against it, you don't get that information; at that point, you might as well save yourself the trouble and use TMM on the bins, especially given that your TF probably isn't binding enough of the genome to compromise the accuracy of the binning approach.

ADD COMMENT
0
Entering edit mode

Hi Aaron, Thank you for the informative reply. I only included yeast spike-in DNA and did not use an antibody against it. Like you have suggested, we used TMM normalization all this time. The cells that I gathered data from most recently, show that the TF of interest is high in mutant cells compared to WT cells based on western blot. That is why I wasn't sure if TMM normalization would be appropriate in this case. What are your thoughts on this?

I do understand that high total protein levels doesn't necessarily mean that there are more chromatin bound sites. But how can we know this for sure and make the appropriate assumption?

Thank you.

ADD REPLY
0
Entering edit mode

As a general rule: if we could figure out which assumptions were reasonable from our data, we'd probably have enough information to avoid making those assumptions in the first place.

Anyway, I would suggest starting with the binning approach and seeing what happens. If you observe a consistent direction for the differential binding across all regions of interest, only then should you start to worry about whether this is caused by (i) a genuine increase in binding associated with the increase in TF concentration, or (ii) a difference in the IP efficiency between conditions. For a WT vs mutant study with everything else being the same: if the phenotype is not too dramatic, I would guess that (ii) would be rather unlikely. But you never know, e.g., if the chromatin compaction changes or the mutant starts pumping out proteins that absorb all of the formaldehyde/degrades the antibody/inhibits the MNase/whatever.

Personally, if I saw systematic DB in one direction, and I knew that the total TF also increased in the corresponding condition, I would be quite happy to accept that there actually was increased binding in that condition.

ADD REPLY
0
Entering edit mode

Thank you for helpful input!

ADD REPLY
0
Entering edit mode

HI everyone,

We are working with SNAP-ChIP, an assays that includes synthetical spike-ins.

When I do the normalization with edgeR, Gordon suggested the formula :

 y <- DGEList(counts = counts_without_spike_in, samples = mapping_file)
norm.factors <- spike_in_factor / y$samples$lib.size
norm.factors <- norm.factors / prod(norm.factors)^(1/length(norm.factors))
y$samples$norm.factors <- norm.factors

I was wondering if this procedure that is validated in independent studies, and it can be applied with limma/voom as well. Thank you !

I have cross-posted the question : Using edgeR and a spike-in to calculate absolute abundance

ADD REPLY

Login before adding your answer.

Traffic: 1262 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6