Question

DiffHic-package. Identify the name of bins with over or underrepresentation of reads mapped -marginCounts- . Extracting data from DIList objects

0

Entering edit mode

rulicosentino • 0

@rulicosentino-8154

Last seen 8.4 years ago

Germany

Hi,

I am following the diffHic tutorial using my datasets, consisting in two biological replicates for one condition and four biological replicates for another condition. I have three questions which I am writing below:

1st question:

I am in the step 3.4 of the tutorial ("Counting into single bins"), by which I can asses if there are systematic differences in coverage between libraries for a given bin. When i compared each pair of datasets, I could see that there were some bins, that were always overrepresented in the datasets from one condition versus the datasets in the other condition, indicating, probably, copy number variation. So, I was interested to know, which region were represented by those bins. And I do not know how to do it.

this was my pipeline:

> genome <- "./genome.fasta"

> test.frag <- cutGenome(genome, "AAGCTT",4)

> test.param <- pairParam(test.frag)

> input <- c("wt1.h5", "wt2.h5", "ko1.h5", "ko2.h5", "ko3.h5", "ko4.h5")

> bin.size <- 1e5

> data <- squareCounts(input, test.param, width=bin.size, filter=1)

> data
DIList object for 6 libraries with 105542 pairs across 469 regions

> margin.data <- marginCounts(input, test.param, width=bin.size)

> margin.data
DIList object for 6 libraries with 467 pairs across 469 regions

adjc <-cpm(asDGEList(margin.data), log=TRUE, prior.count=5)

adjc contains the log2 value of mapped reads per million for each bin, but only for 467 of the 469, so it seems that two were not considered, I would guess that because they do not have enough counts. The thing is that I do not know which bins were not considered and, because of the shift, I do not know if regions(data)[N] corresponds to adjc[N,] or tho adjc[N-1,] or to adjc[N-2,]...How could I know that? I would also like now which are the bins that were not considered.

2nd question:

"counts(data)" contains the information of mapped read for each pair of bins in each dataset and looks like this:

> head(counts(data))
     [,1] [,2] [,3] [,4] [,5] [,6]
[1,]   59    3   69   50   57   70
[2,]   73    5   54   66   62  171
[3,]  307   46  281  258  223  439
[4,]   80    6   45   53   40  112
[5,]  106    7  104  116  123  306
[6,]  273   35  212  212  195  334

so, as i understood, each column represents each dataset, and each row is each pair of bins, so 59 is the number of counts between "anchors(data)[1]" and "targets(data)[1]" in the dataset "input[1]" , and 54 is the number of counts between the bins "anchors(data)[2]" and "targets(data)[2]" in the dataset "input[3]". but, Is there an easy way to reconstruct for each dataset the interaction matrix in which the name of the rows and the columns is each bin region and each value inside the matrix is the number of interaction between each pair of bins?

Third question:

Given that my genome was divided in 469 bins, i am expecting (469^2 + 469)/2 interaction pairs, which are 110215 pairs. But my "data" object has only 105542 pairs, so there are almost 5000 interactions missing, which I supposed were filtered by "filter=1", but when I run squareCounts again with "filter=0" I obtain the same number of interactions (105542). Where are the missing interaction pairs? or am i thinking it or doing something wrong?

I hope you can help me!

Cheers,

Raúl

MarginCounts squareCounts DIList DiffHic • 1.0k views

ADD COMMENT • link updated 8.7 years ago by Aaron Lun ★ 28k • written 8.7 years ago by rulicosentino • 0

score 0 · Answer 1 · 2015-08-05

For your first question: marginCounts will automatically remove bins that are empty in all libraries. If you want to know which bin corresponds to each row of the margin.data object (or to each row of adjc), you should use anchors(margin.data) or targets(margin.data) instead of regions(margin.data). This is described in the marginCounts documentation:

Even though counts do not correspond to interactions, a 'DIList' object is still used to store the output for convenience. Bin coordinates can be extracted as the anchor or target intervals in the output.

Admittedly, this is more confusing than it needs to be, but I wanted to store output consistently in DIList objects, so I'm stuck with it. To find the missing bins, you can do something like this:

regions(margin.data)[-anchors(margin.data, id=TRUE)]

For your second question: In the development version of diffHic, there's an as.matrix method for DIList objects. This will generate an interaction matrix where the entries are filled according the average abundance. If you want to fill it with counts for a particular library, you can do something like:

as.matrix(data, fill=counts(data)[,1]) # for library 1

The solution becomes less obvious if you want to do multiple libraries at once - how do you stuff multiple counts into a single entry of an interaction matrix? This could be handled with array objects, but I haven't got any plans to do that.

Also watch out for memory usage; the above code will construct the interaction matrix for the entire genome. This is okay if you've only got ~500 bins, but it'll get quadratically worse as your number of bins increases. Try the first and second arguments, which limit construction to a few chromosomes at a time.

For your third question: Even with filter=0, empty bin pairs will not be reported. Under the hood, counting is accelerated by avoiding bin pairs that do not contain any read pairs. This is a necessity as the interaction space is quadratically large, and it seems pointless to try to traverse the tracts of empty space in a decently-sized genome. As a result, squareCounts won't care when you set filter=0, because the algorithm inherently filters on a value of 1. Identifying the missing bin pairs requires some effort:

everything <- combn(length(regions(data)), 2)
everything <- paste0(everything[2,], ".", everything[1,])
present <- paste0(anchors(data, id=TRUE), ".", targets(data, id=TRUE))
missed <- setdiff(everything, present)
missed.anchors <- as.integer(sub("\\..*", "", missed))
missed.targets <- as.integer(sub(".*\\.", "", missed))

These anchors and targets will point to the pairs of regions(data) representing the missing bin pairs. This will be rather memory inefficient if you have a large interaction space, but it should be fine here.