In csaw, there are two suggested ways to compute normalization factors: composition bias normalization factors, computed from large bins across the genome; and efficiency bias normalization factors, computed from a subset of windows that have high-abundance or that overlap called peaks.
In the csaw User's Guide section 4.3.2 "Checking normalization with MA plots", the composition and efficiency factors are plotted over the MA plot of all windows between two samples, with the composition factor lining up horizontally with the background cloud and the efficiency factor lining up with the high-abundance cloud. However, in my data, I am consistently seeing the reverse: the composition factor aligns with the high-abundance cloud and the efficiency factor (computed from either the high-abundance windows or peak-overlapping windows) lines up with the low-abundance cloud. You can see examples of my MA plots for 3 different histone marks here:
H3K4me3: https://www.dropbox.com/s/rba4qx1g4751ghb/H3K4me3%20Selected%20Sample%20MA%20Plots.pdf?dl=0
H3K4me2: https://www.dropbox.com/s/989lcfeeahbzv8p/H3K4me2%20Selected%20Sample%20MA%20Plots.pdf?dl=0
H3K27me3: https://www.dropbox.com/s/o2sz1bdnmhm1fbt/H3K27me3%20Selected%20Sample%20MA%20Plots.pdf?dl=0
I having a hard time explaining this reversal. My first guess was that I somehow flipped the signs on the normalization factor ratios when generating the plot, but as far as I can tell, my code is computing them in the same way as the example code in the User's Guide (see below). For both the M value and the line height, I'm doing Sample 2 minus Sample 1 on a log scale. So can anyone suggest a reason why my normalization factors are matching up with the wrong clouds of points on all my MA plots?
Relevant excerpts of setup code (not complete/runnable):
library(dplyr) library(reshape2) library(ks) library(RColorBrewer) library(ggplot2) library(RColorBrewer) library(rtracklayer) bigbin.counts <- windowCounts(sample.table$bampath, bin=TRUE, width=10000, param=param) window.counts <- windowCounts(sample.table$bampath, ext=147, width=1, spacing=73, param=param) chip.peaks <- import.narrowPeak( sprintf(fmt="data_files/ChIP-Seq/%s_peaks_IDR_filtered.bed", chiptype)) # chiptype is e.g. "H3K4me3" ## Computing normalization offset colData(window.counts)$CompNormFactors <- normOffsets(bigbin.counts, type="scaling") abundances <- aveLogCPM(asDGEList(window.counts)) ## [ Pretty much the same code as the User's Guide for picking a filter threshold ] high.ab.window.counts <- window.counts[abundances >= filter.threshold,] colData(window.counts)$HANormFactors <- normOffsets(high.ab.window.counts, type="scaling") peak.overlap <- overlapsAny(window.counts, chip.peaks) peak.window.counts <- window.counts[peak.overlap,] colData(window.counts)$PeakNormFactors <- normOffsets(peak.window.counts, type="scaling")
## Convert to DGEList, compute log2(cpm) dge <- asDGEList(window.counts) logcpm <- cpm(dge, log=TRUE)
MA plot code:
## Function for MA plot between 2 samples doMAPlot <- function(s1, s2) { pointdata <- data.frame(S1=logcpm[,s1], S2=logcpm[,s2]) %>% ## (CALCULATION OF M-VALUES HAPPENS HERE) transmute(A=(S1+S2)/2, M=S2-S1) %>% filter(A >= -2) ## Compute bandwidth and kernel smooth surface H <- pointdata %>% Hbcv(binned=TRUE) k <- pointdata %>% as.matrix %>% kde(gridsize=1024, bgridsize=rep(1024, 2), verbose=TRUE, H=H, binned=TRUE) ## Convert kernel smooth to data frame densdata <- melt(k$estimate) %>% transmute( A=k$eval.points[[1]][Var1], M=k$eval.points[[2]][Var2], ## Sometimes the estimate goes a bit negative, which is no ## good Density=value %>% pmax(0), ## Part of a hack to make the alpha look less bad AlphaDens=value %>% pmax(1e-15)) ## Compute height of normalization lines linedata <- c(Comp="CompNormFactors", Peaks="PeakNormFactors", HiAb="HANormFactors") %>% ## (CALCULATION OF NORM FACTOR LINE HEIGHTS HAPPENS HERE) sapply(. %>% colData(window.counts)[[.]] %>% log2 %>% {.[s2] - .[s1]}) %>% data.frame(NormFactor=., NormType=names(.)) ggplot(densdata) + ## MA Plot kernel density geom_raster(aes(x=A, y=M, fill=Density, alpha=AlphaDens), interpolate=TRUE) + scale_fill_gradientn(colors=suppressWarnings(brewer.pal(Inf, "Blues")), trans=power_trans(1/8), name="Density") + scale_alpha_continuous(trans=power_trans(1/40), guide=FALSE) + ## Normalization lines geom_hline(data=linedata, aes(yintercept=NormFactor, color=NormType)) + scale_color_discrete(name="Norm Type") + ## Scales scale_x_continuous(expand=c(0,0)) + scale_y_continuous(expand=c(0,0)) + coord_fixed(0.5) } ## e.g. MA plot between first 2 samples p <- doMAPlot(1,2) print(p)
Oh, I missed that the User's Guide was still using the large bins for those plots. I'll re-do them using my
bigbin.counts
object and see what happens.Ok, doing the same with the big bin counts revealed the same pattern as shown in the User's Guide: https://www.dropbox.com/s/3yl2s5g2wuliwz1/H3K4me3%20Selected%20Sample%2010KB%20Bin%20MA%20Plots.pdf?dl=0