Thanks for the quick response! It works of course but the bin numbers in the rows and columns don't match the expected number of bins; I have used hg19 and a bin size of 1Mb which roughly gives me 250 bins for chr1. However, I get 231.
> str(as.matrix(data, first="chr1", second="chr1", fill=counts(data)[,1]), )
int [1:231, 1:231] 562 816 61 46 14 10 43 19 14 19 ...
- attr(*, "dimnames")=List of 2
..$ : chr [1:231] "983" "984" "985" "986" ...
..$ : chr [1:231] "983" "984" "985" "986" ...
I thought perhaps bins that have no interactions with any other bins are removed but that doesnt seem to be the case. What is more surprising is that chr2 yields more bins than chr1 (243 vs 231 respectively). I am not sure what is going on. Can I label the bins with their ranges to avoid ambiguity? and where are the missing bins?
str(as.matrix(data, first="chr2", second="chr2", fill=counts(data)[,1]), )
int [1:243, 1:243] 4334 2566 897 667 320 283 322 234 176 102 ...
- attr(*, "dimnames")=List of 2
..$ : chr [1:243] "1356" "1357" "1358" "1359" ...
..$ : chr [1:243] "1356" "1357" "1358" "1359" ...
Thanks!
Bins in
diffHic
are rounded up to the nearest restriction site, befitting the underlying resolution limits of the Hi-C protocol. This means that you get some regions (e.g., centromeres, telomeres) which form huge bins, because there aren't any restriction sites inside them. My guess is that chromosome 1 has more/longer repetitive elements than chromosome 2; these form huge bins as described (e.g., > 3 Mb) which results in fewer bins despite the chromosome being longer. From an analysis perspective, this doesn't really matter because you shouldn't be able to map within those elements anyway; whether you get one large bin or several small bins across them is largely irrelevant if the counts are all near-zero.The row and column names should refer to the indices of
regions(data)
, so you should be able to figure out what genomic interval each row and column corresponds to.