I've loaded my matrix this way:
bin.coords <- import("./1000000.bed") bin.coords <- as(bin.coords, "GRanges") raw <- read.table("./1000000.txt") gi <- GInteractions(raw$V1, raw$V2, bin.coords, mode="reverse") iset1 <- InteractionSet(as.matrix(raw$V3), gi) iset1$totals <- colSums(assay(iset))
The I use the average log-CPM corresponding to a count of 5 to filter out pairs of bins prior to differential analysis with edgeR.
ave.ab <- aveLogCPM(asDGEList(iset)) threshold <- aveLogCPM(5, lib.size=mean(iset$totals)) hist(ave.ab, xlab="Average abundance", xlim = c(-5,5), breaks=100) abline(v = threshold) count.keep <- ave.ab >= threshold summary(count.keep) #11% of bins kept #removing low counts as a function of the distance. In Hi-C data, #bins within a very short distance have high interaction due to #they are linearly close, not due to a specific interaction. trended <- filterTrended(iset) #plot it smoothScatter(trended$log.distance, trended$abundances, xlab="Log-Distance", ylab="Normalized abundance") o <- order(trended$log.distance) lines(trended$log.distance[o], trended$threshold[o], col="red", lwd=2) trend.keep <- trended$abundances > trended$threshold summary(trend.keep) #50% of bins kept #combining filters counts+distances final.keep <- trend.keep & count.keep iset_filtered <- iset[final.keep,]
However, with this, I filter out the 90% of bins. Is it a lot? How can I choose a better threshold?