Question: diffHic: minimum counts per bin
0
2.8 years ago by
Spain
andreucat910 wrote:

Hi,

I've loaded my matrix this way:

bin.coords <- import("./1000000.bed")
bin.coords <- as(bin.coords, "GRanges")
gi <- GInteractions(raw$V1, raw$V2, bin.coords, mode="reverse")
iset1 <- InteractionSet(as.matrix(raw$V3), gi) iset1$totals <- colSums(assay(iset))

The I use the average log-CPM corresponding to a count of 5 to filter out pairs of bins prior to differential analysis with edgeR.

ave.ab <- aveLogCPM(asDGEList(iset))
threshold <- aveLogCPM(5, lib.size=mean(iset$totals)) hist(ave.ab, xlab="Average abundance", xlim = c(-5,5), breaks=100) abline(v = threshold) count.keep <- ave.ab >= threshold summary(count.keep) #11% of bins kept #removing low counts as a function of the distance. In Hi-C data, #bins within a very short distance have high interaction due to #they are linearly close, not due to a specific interaction. trended <- filterTrended(iset) #plot it smoothScatter(trended$log.distance, trended$abundances, xlab="Log-Distance", ylab="Normalized abundance") o <- order(trended$log.distance)
lines(trended$log.distance[o], trended$threshold[o], col="red", lwd=2)

trend.keep <- trended$abundances > trended$threshold
summary(trend.keep) #50% of bins kept

#combining filters counts+distances
final.keep <- trend.keep & count.keep
iset_filtered <- iset[final.keep,]

However, with this, I filter out the 90% of bins. Is it a lot? How can I choose a better threshold?

edger diffhic • 492 views
modified 2.8 years ago • written 2.8 years ago by andreucat910
Answer: diffHic: minimum counts per bin
1
2.8 years ago by
Aaron Lun24k
Cambridge, United Kingdom
Aaron Lun24k wrote:
1. I assume you mean "filter out bin pairs", not bins.
2. I don't recall using a "log cpm 5" filter. Do you mean an average log-CPM corresponding to a count of 5?
3. If you did any pre-filtering in squareCounts, then losing a further 90% of bin pairs is quite a lot. It's hard to tell whether it's caused by your data or your code, though, because you don't show what you did.
4. There are many filtering strategies described in the user's guide, which can give you a suitable threshold.

On further reflection, 90% is probably fine, as long as you're keeping 100000 bin pairs (or more), which is what I usually get. I tend not to use the trended filter because many of the low-distance interactions might be interesting (e.g., domain boundaries), so I try to hold onto them for downstream differential testing.

Thank you. I wanted to ask another question. I know that the lower the resolution, the best. In very low resolutions, we can have a very sparse matrix and we would filter too much bins. Let's say I have matrices for 20.000, 40.000, 100.000bp. Do you know how to choose the best resolution for differential analysis?

1

Firstly, you've got your terminology mixed up here; smaller bin sizes provide higher resolution, because their use improves your capacity to distinguish adjacent events in the interaction space. Secondly, it's not clear that there's a single "best" resolution for the DI analysis. For example, a small bin size might be optimal for detecting changes in looping intensity, while a large bin size would provide more power for detecting changes in large structures like TADs. To overcome this, it is possible to perform the analysis at multiple resolutions and then merge the results. This strategy exploits the potential for different discoveries at each resolution - see Section 7.3 of the diffHic user's guide for details.

Generally, though, I just pick a bin size, do the analysis and see if the results are "good enough". This evaluation is a balance between whether I get enough DIs (usually you get more DIs with a larger bin size, because of larger counts) and whether the results are interpretable (which is better with smaller bin sizes for assigning interactions to genes). For in situ Hi-C data, I've found that 50 kbp bins are a good place to start; large enough to get decently sized counts per bin pair, yet small enough so that you get manageable numbers of genes (< 5) assigned to each anchor region. If you, say, try 20 kbp bin pairs and find that you get much fewer DIs, then those counts are probably too low for analysis.

Thank you. Does it make sense to find bin1 vs bin1 (in the diagonal), differentialy expressed?