Question

Unexpected reversal in normalization factors from csaw

0

Entering edit mode

Ryan C. Thompson ★ 7.9k

@ryan-c-thompson-5618

Last seen 8 months ago

Scripps Research, La Jolla, CA

In csaw, there are two suggested ways to compute normalization factors: composition bias normalization factors, computed from large bins across the genome; and efficiency bias normalization factors, computed from a subset of windows that have high-abundance or that overlap called peaks.

In the csaw User's Guide section 4.3.2 "Checking normalization with MA plots", the composition and efficiency factors are plotted over the MA plot of all windows between two samples, with the composition factor lining up horizontally with the background cloud and the efficiency factor lining up with the high-abundance cloud. However, in my data, I am consistently seeing the reverse: the composition factor aligns with the high-abundance cloud and the efficiency factor (computed from either the high-abundance windows or peak-overlapping windows) lines up with the low-abundance cloud. You can see examples of my MA plots for 3 different histone marks here:

H3K4me3: https://www.dropbox.com/s/rba4qx1g4751ghb/H3K4me3%20Selected%20Sample%20MA%20Plots.pdf?dl=0
H3K4me2: https://www.dropbox.com/s/989lcfeeahbzv8p/H3K4me2%20Selected%20Sample%20MA%20Plots.pdf?dl=0
H3K27me3: https://www.dropbox.com/s/o2sz1bdnmhm1fbt/H3K27me3%20Selected%20Sample%20MA%20Plots.pdf?dl=0

I having a hard time explaining this reversal. My first guess was that I somehow flipped the signs on the normalization factor ratios when generating the plot, but as far as I can tell, my code is computing them in the same way as the example code in the User's Guide (see below). For both the M value and the line height, I'm doing Sample 2 minus Sample 1 on a log scale. So can anyone suggest a reason why my normalization factors are matching up with the wrong clouds of points on all my MA plots?

Relevant excerpts of setup code (not complete/runnable):

library(dplyr)
library(reshape2)
library(ks)
library(RColorBrewer)
library(ggplot2)
library(RColorBrewer)
library(rtracklayer)
bigbin.counts <-  windowCounts(sample.table$bampath, bin=TRUE, width=10000, param=param)
window.counts <- windowCounts(sample.table$bampath, ext=147, width=1, spacing=73, param=param)
chip.peaks <- import.narrowPeak(
    sprintf(fmt="data_files/ChIP-Seq/%s_peaks_IDR_filtered.bed",
            chiptype)) # chiptype is e.g. "H3K4me3"
## Computing normalization offset
colData(window.counts)$CompNormFactors <- normOffsets(bigbin.counts, type="scaling")
abundances <- aveLogCPM(asDGEList(window.counts))
## [ Pretty much the same code as the User's Guide for picking a filter threshold ]
high.ab.window.counts <- window.counts[abundances >= filter.threshold,]
colData(window.counts)$HANormFactors <- normOffsets(high.ab.window.counts, type="scaling")
peak.overlap <- overlapsAny(window.counts, chip.peaks)
peak.window.counts <- window.counts[peak.overlap,]
colData(window.counts)$PeakNormFactors <- normOffsets(peak.window.counts, type="scaling")

## Convert to DGEList, compute log2(cpm)
dge <- asDGEList(window.counts)
logcpm <- cpm(dge, log=TRUE)

MA plot code:

## Function for MA plot between 2 samples
doMAPlot <- function(s1, s2) {
    pointdata <- data.frame(S1=logcpm[,s1], S2=logcpm[,s2]) %>%
        ## (CALCULATION OF M-VALUES HAPPENS HERE)
        transmute(A=(S1+S2)/2, M=S2-S1) %>%
        filter(A >= -2)
    ## Compute bandwidth and kernel smooth surface
    H <- pointdata %>% Hbcv(binned=TRUE)
    k <- pointdata %>%
        as.matrix %>%
        kde(gridsize=1024, bgridsize=rep(1024, 2), verbose=TRUE,
            H=H, binned=TRUE)
    ## Convert kernel smooth to data frame
    densdata <- melt(k$estimate) %>%
        transmute(
            A=k$eval.points[[1]][Var1],
            M=k$eval.points[[2]][Var2],
            ## Sometimes the estimate goes a bit negative, which is no
            ## good
            Density=value %>% pmax(0),
            ## Part of a hack to make the alpha look less bad
            AlphaDens=value %>% pmax(1e-15))
    ## Compute height of normalization lines
    linedata <- c(Comp="CompNormFactors",
                  Peaks="PeakNormFactors",
                  HiAb="HANormFactors") %>%
        ## (CALCULATION OF NORM FACTOR LINE HEIGHTS HAPPENS HERE)
        sapply(. %>% colData(window.counts)[[.]] %>% log2 %>% {.[s2] - .[s1]}) %>%
        data.frame(NormFactor=., NormType=names(.))

    ggplot(densdata) +
        ## MA Plot kernel density
        geom_raster(aes(x=A, y=M, fill=Density, alpha=AlphaDens),
                    interpolate=TRUE) +
        scale_fill_gradientn(colors=suppressWarnings(brewer.pal(Inf, "Blues")),
                             trans=power_trans(1/8),
                             name="Density") +
        scale_alpha_continuous(trans=power_trans(1/40), guide=FALSE) +
        ## Normalization lines
        geom_hline(data=linedata, aes(yintercept=NormFactor, color=NormType)) +
        scale_color_discrete(name="Norm Type") +
        ## Scales
        scale_x_continuous(expand=c(0,0)) +
        scale_y_continuous(expand=c(0,0)) +
        coord_fixed(0.5)
}
## e.g. MA plot between first 2 samples
p <- doMAPlot(1,2)
print(p)

csaw normalization • 1.2k views

ADD COMMENT • link updated 8.0 years ago by Aaron Lun ★ 28k • written 8.0 years ago by Ryan C. Thompson ★ 7.9k

score 2 · Accepted Answer · 2016-04-22

My ggplot skills are rubbish, so I'll just assume that you've made the MA plot correctly. (What happened to the good old days of using smoothScatter?) Anyway, I'll guess that the difference from the user's guide is because you're plotting counts for windows rather than for the bins. I find that the counts for windows are usually too low to give sensible MA plots, as you can see from the discrete patterns throughout your images.

So, moving on to explaining your results; what you see as the "low-abundance cloud" in your plots is probably the truncated right-hand side of the actual high-abundance cloud of protein-bound regions (that should be more clearly visible in a plot made with the binned counts). On the same note, the true low-abundance cloud is not visible at all in your MA plot, as the relevant background regions will probably have counts of zero for small windows. I suspect your "high-abundance cloud" might be comprised of repeat regions, such that they're subject to the same sort of biases as the background regions, but they just show up with much higher abundances.

In summary, I'd repeat the plotting using (2 kbp?) binned counts, hopefully you should get plots similar to those in the UG. More generally, I'd suggest using some larger window sizes for histone mark data analyses. I usually use the nucleosomal DNA width as the window size (as the window size represents the footprint of the protein), and then the estimated DNA fragment length as the extension width. Using 147 bp windows ensures that reads within the nucleosome are still counted, regardless of the direction in which they point.