saving raw interaction matrices using diffHic
2
0
Entering edit mode
Last seen 5.1 years ago

I have been trying to follow the user guide for diffHic. I would like to save the raw interaction matrices for each pair of chromosomes to be able to use it for other analysis but dont seem to be able to. I would like a file in the form of:

Bin1   Bin2   Bin3   Bin4   Bin5

Bin1       1       1        0         5       7

Bin2       3      4         1         1       2

Bin3

Bin4

Bin5

I understand that the data is saved in the DIList object but am not sure how to extract it in the format I need.

-- I tried as.matrix() but it does not returned non-integer negative numbers which are not read counts.

Thanks!

diffhic interaction matrix • 802 views
0
Entering edit mode
Aaron Lun ★ 27k
@alun
Last seen 19 hours ago
The city by the bay

By default, as.matrix will return a contact matrix filled with the average log-CPM for each bin pair across all libraries. This is the simplest representation when you have multiple samples, because otherwise you'd need a three-dimensional array to contain all of the data. However, if you just want to get the counts for one sample, you can supply the read counts directly via the fill argument. For example, if you have a DIList object named blah, and you wanted to get the matrix for chromosome A, you could do something like:

as.matrix(blah, anchor="chrA", fill=counts(blah)[,1])

... to fill the matrix with the read counts for sample 1. The next release of diffHic will (hopefully) use a base class that handles this conversion more naturally - see here if you want a sneak peak of what it'll look like.

0
Entering edit mode
Last seen 5.1 years ago

Thanks for the quick response! It works of course but the bin numbers in the rows and columns don't match the expected number of bins; I have used hg19 and a bin size of 1Mb which roughly gives me 250 bins for chr1. However, I get 231.

> str(as.matrix(data, first="chr1", second="chr1", fill=counts(data)[,1]), )
int [1:231, 1:231] 562 816 61 46 14 10 43 19 14 19 ...
- attr(*, "dimnames")=List of 2
..$: chr [1:231] "983" "984" "985" "986" ... ..$ : chr [1:231] "983" "984" "985" "986" ...

I thought perhaps bins that have no interactions with any other bins are removed but that doesnt seem to be the case. What is more surprising is that chr2 yields more bins than chr1 (243 vs 231 respectively). I am not sure what is going on. Can I label the bins with their ranges to avoid ambiguity? and where are the missing bins?

str(as.matrix(data, first="chr2", second="chr2", fill=counts(data)[,1]), )
int [1:243, 1:243] 4334 2566 897 667 320 283 322 234 176 102 ...
- attr(*, "dimnames")=List of 2
..$: chr [1:243] "1356" "1357" "1358" "1359" ... ..$ : chr [1:243] "1356" "1357" "1358" "1359" ...

Thanks!

1
Entering edit mode

Bins in diffHic are rounded up to the nearest restriction site, befitting the underlying resolution limits of the Hi-C protocol. This means that you get some regions (e.g., centromeres, telomeres) which form huge bins, because there aren't any restriction sites inside them. My guess is that chromosome 1 has more/longer repetitive elements than chromosome 2; these form huge bins as described (e.g., > 3 Mb) which results in fewer bins despite the chromosome being longer. From an analysis perspective, this doesn't really matter because you shouldn't be able to map within those elements anyway; whether you get one large bin or several small bins across them is largely irrelevant if the counts are all near-zero.

The row and column names should refer to the indices of regions(data), so you should be able to figure out what genomic interval each row and column corresponds to.