For simplicity, I'll assume that the node pairs are in the same order in both files, i.e., corresponding rows represent the same node pair in both files. You can then do something like:
logFC <- log2(a$intensity/b$intensity)
log.dist <- log10(a$distance + some.prior)
fit <- limma::loessFit(x=log.dist, y=logFC)
We use the log-distance rather than the raw distance, so that the covariate scale isn't skewed by the presence of a small number of large distances. In particular, the log-transformation provides a bit more resolution at lower distances. Otherwise, you could end up with a situation where all the interesting bin pairs are crammed into a small covariate range, which makes it difficult to fit a curve. You'll have to set
some.prior to something in the same ballpark as the raw distances. For example, if most of your distances are on the megabase scale, it would be worth setting
some.prior to, e.g., 500 kbp.
The general idea is that the trend in the log-fold changes represents some systematic distance-dependent difference between your studies. Our assumption is that this trend is a technical artefact, i.e., it is a bias that does not represent genuine differences in the interaction structure. We fit a curve to model this trend, and we normalize the intensities such that the trend is eliminated from the data:
adjustment <- 2^(fit$fitted/2)
new.a <- a$intensity/adjustment
new.b <- b$intensity*adjustment
We fit a loess curve to the log-fold changes against distance, rather than to the intensities directly. This is because the (log-)intensities decay dramatically with increasing distance, which puts a bit more pressure on the trend fitting algorithm to be accurate. In contrast, the trend in the log-fold changes should be mostly flat, which is easier to model with
The default approach in
diffHic is to fit a trend to the log-fold changes (more or less) against the average abundance. This is roughly equivalent to the distance-based approach after flipping the covariates, given that low distances correspond to high abundances and vice versa. I'd argue that trend fitting on average abundance is slightly better than that on distance, as node pairs with the same abundance should represent similar interactions (and thus, be subject to the same biases, such that the fitted trend will capture the common bias at a given abundance). This may not be true for node pairs at the same distance, e.g., if one lies inside a TAD while the other lies outside. That said, I'm probably nitpicking; on the whole, I expect that the distance-based approach will perform adequately.