Aaron - thank you for the excellent csaw package and extensive users guide! I am analyzing a ChIPSeq experiment that has 60 IP samples + 60 inputs. There are 3 treatments X 4 brain regions = 12 groups X 5 reps each. It seems a shame not to use the input samples and I was following section 3.5.3 to compare the scaled average of the IPs to the scaled average of the inputs. However, I'm worried that binding sites that only occur in one treatmentXregion group or even one brain region will not be enough to make the (average IP vs average input) > 3 FC. The inputs are good negative controls in that there is no correlation between an individual's IP and input, so I had the idea to do something similar to how I filter out "unexpressed" genes in RNA-Seq: compare each IP's scaled value to the average scaled input, and then require any 5 samples to have > 3 FC.
Assuming you think this is a reasonable approach, I can't quite figure out how to hack your functions to calculate this. filterWindows() returns the average background abundances but I'm not sure exactly what to pull out of scaledAverage() to get individual scaled averages for the IP. Or do I not even need to worry about scaledAverage() because for IP scale = 1, but instead call cpm() in something like this:
library(csaw) library(edgeR) #set up params and count individual windows param <- readParam( dedup = T, minq = 30) dt <- windowCounts(bam_files, ext = 250, width = 150, filter = 0, param = param) #get binned counts dt_bin <- windowCounts(bam_files, bin = TRUE, width = 10000, filter = 0, param = param) #separate binned windows & get scale factors for input controls dt_con_bin <- dt_bin[, grepl("input", bam_files)] dt_exp_bin <- dt_bin[, !grepl("input", bam_files)] dt_scale_inf <- scaleControlFilter(dt_exp_bin, dt_con_bin) #separate individual windows dt_con <- dt[, grepl("input", bam_files)] dt_exp <- dt[, !grepl("input", bam_files)] #run filterWindows to get average scaled background dt_filt <- filterWindows(dt_exp, dt_con, type = "control", prior.count = 5, scale.info = dt_scale_inf) #run cpm to get individual IP abundances IPcpm <- cpm(asDGEList(dt_exp, assay = 1), prior.count = 5, log = TRUE) # set up filter filter.stat <- IPcpm - dt_filt$back.abundances rep.size <- 5 keep <- rowSums(filter.stat > log2(3)) >= rep.size #filter windows dt_filtered <- dt_exp[keep, ]
Does this seem reasonable? One last question - should I re-bin the window counts with only IP samples in order to estimate normalization factors, or is it fine to use the object