Resources to learn about s4 object extension/inheritance, custom print methods, and new operations
1
0
Entering edit mode
lklei001 • 0
@lklei001-23674
Last seen 3.7 years ago

I am a graduate student in Statistics at UC Riverside, and I have mapped out a plan to execute a new analysis method on HiC datasets. However, there is currently no software to implement this new approach, so I will need to write much of it on my own. However, I want to avoid reinventing the wheel wherever possible, so I’d like to ask for your input on how I can start with pre-existing S4 classes.

In order, I would like to do the following:

1. Read in raw HiC data into R (likely an InteractionSet or hic.table) (I CAN ALREADY DO THIS)

2. Identify the initial binning size, K, so that I might find a number between the largest observed genomic coordinate and the chromosome size that is proportional to a power of 2:

LargestObservedGenomicCoord   <=   K * 2^J    <=   ChromSize

3. Bin the Hi-C data into bins of size K Currently, binning seems limited to Kilobase resolutions. Can I be more precise than that?

If not, I can find a way to proceed with the limitation that the initial bin size be at least 1kb, but if I can have bins of arbitrary size K (i.e. 13.407 kB), that would be simpler for my workflow.

4. Map the 2D upper-triangular matrix coordinates (x,y) to a 1-D "pseudo space-filling curve" coordinate (d) which is also defined on upper-triangular spaces (I HAVE ALREADY DONE THIS with Rcpp)

One issue i may encounter is a limitation on the size of the d index values.

Seeing as the largest human chromosome is approximately 250 million basepairs in length, the index values (which will be used in sorting and defining “neighboring observations”) can be as large as ~2^55, necessitating 64-bit numbers.

I have already written a function to create these 64bit numbers (using Rcpp) but including them in an InteractionSet object and using InteractionSet methods on these 64-bit number may be a challenge.

5. Sort the Hi-C object (eg. InteractionSet, hic.table, or other) according to this new 1-dimensional coordinate (d)

I see there are fast sorting/ordering methods for InteractionSets.

However, I will be dealing with 64 bit numbers. I’m not sure how to reorder an Rle/S4Vector object based on a secondary vector of metadata indices.

6. Perform a 1D wavelet-transform/multiresolution-analysis on the sorted data (NEW OPERATION)

This would involve downsampling (performing successive sums and differences of neighboring pairs in the vectors, and repeating on the resulting sum. For example:

# Input data
A0 <-     c(x0, x1, x2, x3, x4, x5, x6, x7)

# Perform first round of differencing , summing, and downsampling

# call this c( d1.1,  d1.2, d1.3, d1.4  )
D1 <-     c(x0 - x1, x2 - x3, x4 - x5, x6 - x7)

# call this c( a1.1,  a1.2, a1.3, a1.4  )
A1 <-     c(x0 + x1, x2 + x3, x4 + x5, x6 + x7)

# D2 and A2 are calculated the same as above, but with A1 as the input

# = c( a1.1 - a1.2,   a1.3 - a1.4  )  = c( d2.1,  d1.2  )
D2 <-     c(x0 + x1 - (x2 + x3), x4 + x5 - (x6 + x7))

# = c( a1.1 + a1.2,   a1.3 + a1.4  ) = c( a2.1,  a1.2  )
A2 <-     c(x0 + x1 + (x2 + x3), x4 + x5 + (x6 + x7))

# Typically the result is stored in a list of lists (see this for an example of a 2D wavelet object)
#
#  However, I might use a list of InteractionSets??
#
# The original signal, A0, can be recovered without loss from this object.
#
# This could be a new class object, such as a “HicWavelet” object??
my.wavelet.object <- list( D1, D2, D3, …., DJ, AJ)


Thankfully, the size of each successive, downsampled S4Vector is fairly easy to determine to looking at how many none-zero successive runs exist in the current vector. So, memory allocation can be done to avoid major slowdowns that would otherwise be brought about by appending values to a vector of unknown length.

If you’ve made it this far, I would like to thank you! I do not have the requisite knowledge about how to make changes to S4 classes, and any input/guidance you might offer would be extremely helpful. I’ve ready Hadley Wickham’s Advanced R material on S4 classes, but I find it too superficial to get me to a place where I am extending S4 class objects, or modifying a current restrictions on slots.

Please, if you are familiar with any resources I might make use of to train myself up to the point where I can get myself up and running with S4 classes on my own, I would be endlessly grateful. SPECIFICALLY:

• How to extend/modify InteractionSet objects to accept 64 bit / double numbers in their slots

• How to modify the sorting methods of InteractionSets to work with 64 bit numbers

• How to write new functions for use on InteractionSets (such as differencing/summing neighbors and performing the reverse transformation)

• How to create a new class of object which will resemble the usual list-of-lists structure of a typical wavelet transform, but instead uses an InteractionSet type object for each entry (i.e. a list-of-InteractionSets).

• How to write custom print methods for objects of this type.

S4 Rcpp InteractionSet IRanges Rle • 772 views
0
Entering edit mode
Aaron Lun ★ 28k
@alun
Last seen 11 minutes ago
The city by the bay

Bin the Hi-C data into bins of size K Currently, binning seems limited to Kilobase resolutions. Can I be more precise than that?

I don't know where this limitation is coming from, I certainly don't recall setting it anywhere in InteractionSet.

One issue i may encounter is a limitation on the size of the d index values.

Where are you looking at these d values? InteractionSet only has anchor1 and anchor2, which can be any integer of size up to .Machine$integer.max. Provided you're not using 1 bp anchor regions, that's plenty. However, I will be dealing with 64 bit numbers. I’m not sure how to reorder an Rle/S4Vector object based on a secondary vector of metadata indices. The usual order and subset paradigm would suffice. library(InteractionSet) example(GInteractions, echo=FALSE) gi$X <- runif(length(gi))
gi[order(gi\$X)]


How to write new functions for use on InteractionSets (such as differencing/summing neighbors and performing the reverse transformation)

You can define a function in the usual way that takes an InteractionSet and returns a list of... somethings. I don't know what the output is meant to be here, given that the vector shrinks in length and no longer has a 1:1 correspondence to the entries of the original InteractionSet.

How to create a new class of object which will resemble the usual list-of-lists structure of a typical wavelet transform, but instead uses an InteractionSet type object for each entry (i.e. a list-of-InteractionSets).

If you're not planning on doing more than list operations, then a list of somethings is perfectly adequate. The List from S4Vectors even allows you to put metadata on each entry (corresponding to resolution, I assume).