Question

scaling loess curves

0

Entering edit mode

Bogdan ▴ 670

@bogdan-2367

Last seen 17 months ago

Palo Alto, CA, USA

Dear all,

please could you advise about a method to scale 2 plots of LOESS curves. More specifically, we do have 2 sets of 5C data, and the loess plots reflect the relationship between INTENSITY and DISTANCE (please see the R code below).

I am looking for a method/formula to scale these 2 LOESS plots and make them directly comparable.

many thanks,

-- bogdan

-------------- the R code ------------------

a <- read.delim("a",header=T)
qplot(data=a,distance,intensity)+geom_smooth(method = "loess", size = 1, span=0.01)+xlab("distance")+ylab("intensity")

b <- read.delim("b",header=T)
qplot(data=b,distance,intensity)+geom_smooth(method = "loess", size = 1, span=0.01)+xlab("distance")+ylab("intensity")

HiTC gothic diffhic HiC • 3.1k views

ADD COMMENT • link 9.6 years ago • updated 9.5 years ago Bogdan ▴ 670

0

Entering edit mode

It's not clear what package you're using, or what you're trying to do here. Are the counts in terms of bin pairs? Do you have matching bin pairs between your two experiments? Why are you trying to match the curves, e.g., to compare contact frequencies for the same bin pair between experiments?

ADD REPLY • link 9.6 years ago Aaron Lun ★ 28k

0

Entering edit mode

Bogdan ▴ 670

@bogdan-2367

Last seen 17 months ago

Palo Alto, CA, USA

Hi Aaron, thank you for your reply.

Yes, you are entirely correct : we do have 2 experiments, and for each experiment, the set of data is in the following format : "node1 (chr, start, end) - node 2(chr, start, end) - interaction intensity".

We could transform this data into matching bin pairs, and we are trying to SCALE the LOESS curves ( the graphs "distance between node 1 and node 2" vs "intensity") for experiment1 vs experiment2, in order to make the experiments directly comparable. The purpose would be to normalize the data based on the LOESS curve of experiment1 vs LOESS curve of the experiment2.

Could I do this possibly in diffHiC ? If I could have your email address, I could send you a concrete example. Mine is tanasa@gmail.com. Thank you ..

ADD COMMENT • link 9.5 years ago Bogdan ▴ 670

0

Entering edit mode

Bogdan ▴ 670

@bogdan-2367

Last seen 17 months ago

Palo Alto, CA, USA

Dear Aaron,

thank you for the R code and remarkable insights: it is extremely helpful. While I am applying the approach you suggested, if I may add a question, as the data is not derived from Hi-C but from 5C experiments.

by 5C typically, we expect to capture some dynamic changes in the looping pattern at particular distances, and this changes in the looping may be biologically relevant.

for example, if we consider the looping between promoter P and enhancer E, at 100kb distance promoter and enhancer, we may have an interaction count of 50 (before hormone treatment), and an interaction count of 100 (after hormone treatment) between P and E.

after normalization, would the approach that you suggest still allow us to call the interaction between P and E significantly changed between -hormone and +hormone conditions ?

I could get back to you a little later with a real 5C example from our data . Thanks a lot, I appreciate the help,

-- bogdan

ADD COMMENT • link 9.5 years ago Bogdan ▴ 670

0

Entering edit mode

I don't see why not.

ADD REPLY • link 9.5 years ago Aaron Lun ★ 28k

0

Entering edit mode

Bogdan ▴ 670

@bogdan-2367

Last seen 17 months ago

Palo Alto, CA, USA

Great thank you ! I will work on the code, and will let you know how it goes, especially on capturing the significant dynamic changes between Promoters and Enhancers ..

I was thinking also about the following approach (below), would you think it may be practical ?

<> for experiment A and B : using LOESS regression to model the relation between INTENSITY and DISTANCE.

qplot(data=a,distance,intensity)+geom_smooth(method = "loess", size = 1, span=0.01)+xlab("distance")+ylab("intensity")
qplot(data=b,distance,intensity)+geom_smooth(method = "loess", size = 1, span=0.01)+xlab("distance")+ylab("intensity")

<> for each experiment A and B, for a specific DISTANCE (let's say 50kb, or 100kb, or etc) using the RATIOS (loess fit in experiment A/loess fit in experiment B) as a scaling/normalizing factor for the interaction intensities in A and B at that specific DISTANCE ?

thanks,

bogdan

ADD COMMENT • link 9.5 years ago Bogdan ▴ 670

0

Entering edit mode

It would work, but it would be sensitive to any errors in the fitted loess curves in each of your separate experiments. This might not be optimal, for reasons I've described in my original answer.

ADD REPLY • link 9.5 years ago Aaron Lun ★ 28k

0

Entering edit mode

Bogdan ▴ 670

@bogdan-2367

Last seen 17 months ago

Palo Alto, CA, USA

Thanks, I am going carefully over your description and R code. Although it is a little later for us here in San Diego, wondering if we may be useful to consider also the log2 PLOTS :

log2(intensity) vs log2 (distance) for experiment A, and a linear regression fit to it (A)

log2(intensity) vs log2 (distance) for experiment B, and a linear regression fit to it (B),

and to normalize the experiments A and B based on the differences between the modes of linear regression A vs linear regression B. Any comments would be welcome. Many thanks again.

ADD COMMENT • link 9.5 years ago Bogdan ▴ 670

0

Entering edit mode

That might be a better approach, because the power-law relationship in typical Hi-C/5C data means that you should see a linear decreasing trend in a log-log plot. This is easier to fit with a loess curve of degree 1 (as done by default in loessFit).

ADD REPLY • link 9.5 years ago Aaron Lun ★ 28k

0

Entering edit mode

Bogdan ▴ 670

@bogdan-2367

Last seen 17 months ago

Palo Alto, CA, USA

Thanks Aaron, I appreciate your help ! happy weekend !

ADD COMMENT • link 9.5 years ago Bogdan ▴ 670

0

Entering edit mode

Bogdan ▴ 670

@bogdan-2367

Last seen 17 months ago

Palo Alto, CA, USA

Dear Aaron,

thank you again for the previous suggestions regarding the scaling of loess plots.

starting from our previous conversation, considering a set of 5C/HiC loops with intensity values in 2 conditions (-TREATMENT, and +TREATMENT)

i.e. a file with data of type " chr ; start ; end ; intensity-TREATMENT; intensity+TRETMENT",

please may I ask for a suggestion for data normalization INDEPENDENT on the distance between start and end, and only based on the numerical intensity values in -TREATMENT, and +TREATMENT.

many thanks again,

-- bogdan

ADD COMMENT • link 9.5 years ago Bogdan ▴ 670

0

Entering edit mode

Try normalizing on the abundance-dependent trend, i.e., compute the log-average intensity of each node pair, along with the log-fold change in intensity between conditions, and examine the trend in the latter with respect to the former. In particular, you can fit a loess curve to the log-fold changes against the log-average intensity, using loessFit as I've described above.

ADD REPLY • link 9.5 years ago Aaron Lun ★ 28k

0

Entering edit mode

Thank you for advice, Aaron, it is very helpful !

ADD REPLY • link 9.5 years ago Bogdan ▴ 670

score 3 · Accepted Answer · 2015-09-11

3

Entering edit mode

Aaron Lun ★ 28k

@alun

Last seen 2 hours ago

The city by the bay

For simplicity, I'll assume that the node pairs are in the same order in both files, i.e., corresponding rows represent the same node pair in both files. You can then do something like:

logFC <- log2(a$intensity/b$intensity)
log.dist <- log10(a$distance + some.prior)
fit <- limma::loessFit(x=log.dist, y=logFC)

We use the log-distance rather than the raw distance, so that the covariate scale isn't skewed by the presence of a small number of large distances. In particular, the log-transformation provides a bit more resolution at lower distances. Otherwise, you could end up with a situation where all the interesting bin pairs are crammed into a small covariate range, which makes it difficult to fit a curve. You'll have to set some.prior to something in the same ballpark as the raw distances. For example, if most of your distances are on the megabase scale, it would be worth setting some.prior to, e.g., 500 kbp.

The general idea is that the trend in the log-fold changes represents some systematic distance-dependent difference between your studies. Our assumption is that this trend is a technical artefact, i.e., it is a bias that does not represent genuine differences in the interaction structure. We fit a curve to model this trend, and we normalize the intensities such that the trend is eliminated from the data:

adjustment <- 2^(fit$fitted/2)
new.a <- a$intensity/adjustment
new.b <- b$intensity*adjustment

We fit a loess curve to the log-fold changes against distance, rather than to the intensities directly. This is because the (log-)intensities decay dramatically with increasing distance, which puts a bit more pressure on the trend fitting algorithm to be accurate. In contrast, the trend in the log-fold changes should be mostly flat, which is easier to model with loessFit.

The default approach in diffHic is to fit a trend to the log-fold changes (more or less) against the average abundance. This is roughly equivalent to the distance-based approach after flipping the covariates, given that low distances correspond to high abundances and vice versa. I'd argue that trend fitting on average abundance is slightly better than that on distance, as node pairs with the same abundance should represent similar interactions (and thus, be subject to the same biases, such that the fitted trend will capture the common bias at a given abundance). This may not be true for node pairs at the same distance, e.g., if one lies inside a TAD while the other lies outside. That said, I'm probably nitpicking; on the whole, I expect that the distance-based approach will perform adequately.

ADD COMMENT • link 9.5 years ago Aaron Lun ★ 28k

0

Entering edit mode

Dear Aaron, thank you very much again for your suggestions, really helpful !

If I may add a little question: shall we have multiple experiments 2*2 ie -TREATMENT WT, +TREATMENT WT, -TREATMENT KO, +TREATMENT KO, could we use the same strategy/principle for normalization as in LIMMA, ie. using a design matrix and a contrast matrix ?

ADD REPLY • link 9.5 years ago Bogdan ▴ 670

0

Entering edit mode

Yes, but you'll need biological replicates. So, if you have four conditions (various combinations of treatment and genotype), I'd suggest getting at least two replicates per condition, i.e., 8 libraries in total.

ADD REPLY • link 9.5 years ago Aaron Lun ★ 28k

0

Entering edit mode

Dear Aaron, thank you very much. Please could I have your insights on another topic, namely of statistically significant differential looping (by 5C) :

assuming that we do have a table of normalized counts in -TREAT and +TREAT , ie a table of a format that is shown below, what would be the most appropriate test to assess the DIFFERENTIAL LOOPING ?

loop1 : 10 counts(-TREAT); 20 counts (+TREAT);

loop2 : 15 counts(-TREAT); 20 counts (+TREAT);

loop3 : 100 counts(-TREAT); 20 counts (+TREAT);

ADD REPLY • link 9.5 years ago Bogdan ▴ 670

0

Entering edit mode

You could do this with the exact test or the LRT in edgeR, but you'd need a lot more than three loops. I don't know how you'd normalize if you only had three loops to work with. You'd also need to input a value for the NB dispersion, because it doesn't seem like you have any replicates.

ADD REPLY • link 9.5 years ago Aaron Lun ★ 28k

0

Entering edit mode

Hi Aaron, thanks, yes, I thought about edgeR; just had not been very sure whether the more basic assumptions, like NB distribution of our data will hold. How could I test for NB distribution of the data ?

We have replicates, but that were sequenced at different sequencing depth, and I shall probably use only one replicate at a time, possibly with a very small NB dispersion.

ADD REPLY • link 9.5 years ago Bogdan ▴ 670

1

Entering edit mode

The NB model is just that - a model, which is useful for data analysis but is expected to be wrong. There's no point testing for whether your data is NB-distributed, because it probably won't be. Rather, the aim is to use edgeR to capture enough of the data's attributes to allow you to make sensible conclusions. Obviously, this won't be appropriate if the data is highly non-NB, as the model won't be able to do a good job of capturing aspects of the data; but I see no evidence that this is the case for Hi-C (and presumably, for 5C as well).

Different sequencing depth doesn't really matter, as long as you have a decent number of reads for all libraries (i.e., large enough counts for most interactions, probably above 10). Some replication is better than none at all, so you might as well use them. However, if you're not going to use them, then putting in a very small NB dispersion is unwise. This will not capture the variability that is present between biological replicates. As a result, you are more likely to pick up spurious differences that are not reproducible between replicates.