How to use diffHiC with existing contact matrices?
2
0
Entering edit mode
Sameet Mehta ▴ 30
@sameet-mehta-2556
Last seen 4.1 years ago
United States

Currently I am in the process of analyzing HiC data from two different conditions.  When I talked to some colleagues about best way of analyzing and comparing two HiC data-sets, I was asked to look at the diffHiC.   I read the user-manual and the paper.  Congratulations on the paper.  I have a question.  The paper starts with reads or bam files.  What if I already have to HiC interaction matrices from two different conditions.  Can I still use diffHiC?  I did not see a simple path to do that.  If I am missing something please point me in the right direction.  Any help in this regard is highly appreciated.

What I have used is HiCPro (the first author of this paper also suggested that I use diffHiC) to process the data.  Each data has ~5 B reads.  I have already aligned and generated ICE normalized contact matrices.  There is quite a bit of difference between the two interms of read-depth.  When I plotted them using HiCPlotter, I can see that there are differences, but I would rather have some kind of statistical measure/FDR for those differences so that I can explain to the collaborator how to interpret that data.

diffHiC HiCPro diffhic • 1.7k views
2
Entering edit mode
Aaron Lun ★ 27k
@alun
Last seen 4 hours ago
The city by the bay

Yes, you can use interaction matrices, provided that the entries are counts. That is, not normalized intensities, but the raw read pair counts. If you only have intensities, that's too bad; diffHic (and, in general, most statistical software for handling sequencing data) needs to know the original count size in order to do a proper analysis. I'm not sure what else HiC-Pro produces, but I bet the relevant information is there somewhere.

Assuming you have or can get interaction matrices containing counts, you can load them into memory and convert them into ContactMatrix objects using the code below. You'll also need GRanges objects describing the genomic coordinates of the region represented by each row/column.

cm1 <- ContactMatrix(first.matrix, row.regions, column.regions)
cm2 <- ContactMatrix(second.matrix, row.regions, column.regions)


This can be converted into an InteractionSet object for entry into diffHic, like so:

to.keep <- as.matrix(cm1)!=0 | as.matrix(cm2)!=0
iset1 <- deflate(cm1, extract=to.keep)
iset2 <- deflate(cm2, extract=to.keep)
data <- cbind(iset1, iset2)

The to.keep setting basically specifies that we want to keep all entries of the interaction matrix that are non-zero for one or more of the libraries, to avoid storing all possible region pairs. The data object needs a little bit of work before it's fully ready:

interactions(data) <- as(interactions(data), "ReverseStrictGInteractions")

... and then you can plug it into the downstream diffHic functions. I should note, though, that a proper analysis needs replicates for each group. It sounds as if you only have one matrix for each condition, which would be a struggle to analyse rigorously. In any case, the code above will work with multiple ContactMatrix objects; just expand it to cm3, cm4, iset3, iset4, etc. as necessary.

0
Entering edit mode

Hi @Aaron,

I have managed to reach this point.  In the diffHic User manual, I am reading chapter 7.  But I do not see how/where to plug in the data generated above.

1
Entering edit mode

The data object above is basically a substitute for what you get out of squareCounts. So, once you've made this object, you should proceed to filtering and normalization, i.e., step 3 and onwards of the "quick start" list in the introduction of the diffHic user's guide, corresponding to Chapter 4 and onwards.

0
Entering edit mode

Understood.

0
Entering edit mode

When I try to do this with a large sparse dsCMatrix matrix (human genome Hi-C at 5kb resolution), it fails. I narrowed it down to when you call as.matrix(cm), either directly or within deflate. The error message is

Error in asMethod(object) : Cholmod error 'problem too large' at file ../Core/cholmod_dense.c, line 105

presumably because it tries to convert it to a dense matrix. Is there a way around this? I'm only interested in intrachromosomal interactions by the way, but I'm not sure if it's ok to do the detection on each chromosome independently.

0
Entering edit mode
1. Please start a new question.
2. Please show the exact command you're using, read the posting guide.
0
Entering edit mode
Sameet Mehta ▴ 30
@sameet-mehta-2556
Last seen 4.1 years ago
United States

Hi @Aaron, thank you.  This is useful.  Can you give me an idea of what columns should be in the GRanges object.  I do have raw matrices generated as part of the pipeline.  The bin information is as follows:

chr1    0    2500    1
chr1    2500    5000    2

Is this information enough?  Additionally, I do have a total of two replicates per condition.  But at 1M resolution.  I will try this analysis with that resolution first, and go from there.

1
Entering edit mode

The minimal fields in a GenomicRanges object are the chromosome name and the start and end coordinates. You can generate an object with the GRanges constructor, which should give you something sensible. Note that if the interaction matrix is a square consisting of interactions between all pairs of regions, then row.regions and col.regions will be the same.

0
Entering edit mode

The interaction matrix with count data is in sparse format.  Will that still work?  I get the following error:

 wt.1 <- ContactMatrix(as.matrix(wt.df), bin.gr, bin.gr)

    Error in validObject(.Object) :       invalid class “ContactMatrix” object: 'matrix' nrow must be equal to length of 'anchor1'

I think the problem is that my data is in sparse format, so all the "interactions" with zero value are already eliminated.  Although I am working with counts the non-normalized data.  It is possible to coerce the contact matrix, which has only 3 columns and ~30 million rows into a ContactMatrix data-type?

0
Entering edit mode

What is wt.df? Is it a square matrix, with dimensions equal to the length of bin.gr? The error suggests that the dimensions aren't matching up, so you better fix that first. If you're using the sparse matrix class from the Matrix package, it should be okay to use as-is without any need to call as.matrix.

0
Entering edit mode

Sorry,

wt.df is the sparse matrix that holds the interaction strength (raw counts).

> head(wt.df)
bin1 bin2 strength
1:   40   99        1
2:   41  100        1
3:   65  279        1
4:   71  285        1
5:  285  299        1
6:  299  305        1

bin.gr is the GRanges object that holds the information on the bin coordinates.

> head(bin.gr)
GRanges object with 6 ranges and 1 metadata column:
seqnames         ranges strand |    bin.no
<Rle>      <IRanges>  <Rle> | <integer>
[1]     chr1 [    0,  2500]      * |         1
[2]     chr1 [ 2500,  5000]      * |         2
[3]     chr1 [ 5000,  7500]      * |         3
[4]     chr1 [ 7500, 10000]      * |         4
[5]     chr1 [10000, 12500]      * |         5
[6]     chr1 [12500, 15000]      * |         6
-------
seqinfo: 25 sequences from an unspecified genome; no seqlengths

Because of the size of the file, I thought data.table would be a better option.

1
Entering edit mode

ContactMatrix requires a matrix as input. wt.df is clearly not a matrix - at least, not to R. So I'm not sure what you're expecting will happen. Rather, use the sparseMatrix constructor from the Matrix package:

first.matrix <- sparseMatrix(wt.df$bin1, wt.df$bin2, x=wt.df\$strength)