Question

making a RleViewsList object gets slow with many chromosomes

0

Entering edit mode

Janet Young ▴ 740

@janet-young-2360

Last seen 4.5 years ago

Fred Hutchinson Cancer Research Center,…

Hi again,

I'm playing with Herve's binnedAverage function, from the help page "?tileGenome". It works really nicely for me for well-behaved genomes (thank you! it'll be very useful).

However, I am also working with some new assemblies that comprise a lot of fairly short scaffolds (~45,000 of them, total genome size ~100 Mb), and binnedAverage gets pretty slow with those. I wondered whether you have any thoughts on alternative ways to do this more efficiently, or whether there's any opportunity for you to speed things up internally?

Digging into it a bit, I think the part of the function that seems slow is creating the Views - calculating the viewMeans seems to happen fast. Some test code is below (and sessionInfo - I'm using the devel version).

thanks in advance for any suggestions,

Janet

library(GenomicRanges)

binnedAverage <- function(bins, numvar, mcolname)
{
stopifnot(is(bins, "GRanges"))
stopifnot(is(numvar, "RleList"))
stopifnot(identical(seqlevels(bins), names(numvar)))
bins_per_chrom <- split(ranges(bins), seqnames(bins))
means_list <- lapply(names(numvar),
function(seqname) {
views <- Views(numvar[[seqname]],
bins_per_chrom[[seqname]])
viewMeans(views)
})
new_mcol <- unsplit(means_list, as.factor(seqnames(bins)))
mcols(bins)[[mcolname]] <- new_mcol
bins
}

################# some functions for convenience

### make a fake seqlengths object
makeSeqlengths <- function( totalGenomeSize, numChrs) {
mySeqlengths <- rep( round(totalGenomeSize/numChrs), numChrs)
names(mySeqlengths) <- paste("scaffold",1:numChrs,sep="")
mySeqlengths
}

### make a fake RleList object
makeRleObject <- function( mySeqlengths ) {
RleList(
lapply(mySeqlengths,
function(len) Rle(sample(-10:10, len, replace=TRUE))),
compress=FALSE)
}

### make some bins on the genome
makeBins <- function( mySeqlengths ) {
tileGenome(mySeqlengths, tilewidth=100, cut.last.tile.in.chrom=TRUE)
}

###### test a 100Mb genome with 90 chromosomes (each chromosome is 1.1Mb)
mySeqlengths90scaffolds <- makeSeqlengths( 100000000, 90 )
my_var1_90scaffolds <- makeRleObject(mySeqlengths90scaffolds)
bins1_90scaffolds <- makeBins(mySeqlengths90scaffolds)

system.time( bins1_90scaffolds <- binnedAverage(bins1_90scaffolds, my_var1_90scaffolds, "binned_var1") )
## 2.1 seconds

###### test a 100Mb genome with 900 chromosomes (each chromosome is 111.1kb)
mySeqlengths900scaffolds <- makeSeqlengths( 100000000, 900 )
my_var1_900scaffolds <- makeRleObject(mySeqlengths900scaffolds)
bins1_900scaffolds <- makeBins(mySeqlengths900scaffolds)

system.time( bins1_900scaffolds <- binnedAverage(bins1_900scaffolds, my_var1_900scaffolds, "binned_var1") )
## 4.4 seconds

###### test a 100Mb genome with 9000 chromosomes (each chromosome is 11.1kb)
mySeqlengths9000scaffolds <- makeSeqlengths( 100000000, 9000 )
my_var1_9000scaffolds <- makeRleObject(mySeqlengths9000scaffolds)
bins1_9000scaffolds <- makeBins(mySeqlengths9000scaffolds)

system.time( bins1_9000scaffolds <- binnedAverage(bins1_9000scaffolds, my_var1_9000scaffolds, "binned_var1") )
## 34 seconds

######## and my real genome is worse - haven't finished timing it yet....

######## check out a few components of the binnedAverage function:

### check the split by chr
system.time( bins1_9000scaffolds_per_chr <- split(ranges(bins1_9000scaffolds), seqnames(bins1_9000scaffolds)) )
## quick

### check the Views creation
system.time(
myViewsList <- lapply(names(my_var1_9000scaffolds), function(seqname) {
Views(my_var1_9000scaffolds[[seqname]], bins1_9000scaffolds_per_chr[[seqname]])
})
)
## 32 sec

#### check the Views creation a different way - almost as slow if we do it without an obvious lapply
system.time( myViewsListAgain <- Views(my_var1_9000scaffolds, bins1_9000scaffolds_per_chr) )
## 28 sec

### check the ViewMeans calculation - quick
system.time( myViewMeans <- lapply(myViewsList, viewMeans) )
## 2.7 sec

###########
sessionInfo()

R Under development (unstable) (2014-10-31 r66921)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets
[8] methods base

other attached packages:
[1] GenomicRanges_1.19.36 GenomeInfoDb_1.3.12 IRanges_2.1.38
[4] S4Vectors_0.5.19 BiocGenerics_0.13.4

loaded via a namespace (and not attached):
[1] XVector_0.7.4

views iranges rle • 1.4k views

ADD COMMENT • link 9.2 years ago Janet Young ▴ 740

0

Entering edit mode

Hi Janet,

Thanks for the report. Very useful as always :-) I'll look into this.

H.

ADD REPLY • link 9.2 years ago Hervé Pagès 16k

0

Entering edit mode

Janet Young ▴ 740

@janet-young-2360

Last seen 4.5 years ago

Fred Hutchinson Cancer Research Center,…

Thanks so much Herve - that's great! (and a very very fast response - do you ever sleep?) I had been having vague thoughts myself about trying to concatenate all the chromosomes together end-to-end, but hadn't even begun to think about implementing it.

That was a 10x speedup on the test case I gave you, and on my real data it was even more impressing: it went from about 3 hours per sample (I set it going before I went home last night) to 3 seconds! I'm sure you can imagine I'm pretty happy about that....

thanks again,

Janet

ADD COMMENT • link 9.2 years ago Janet Young ▴ 740

0

Entering edit mode

Glad it works so well for you!

Yeah the idea of concatenating all the chromosomes together end-to-end has been floating around for a while. However I was hesitating to formally introduce this in the GenomicRanges infrastructure because of the genome size limitation and also because we didn't really have a strong use case so far. I guess now we have one ;-)

FWIW I just added the isSmallGenome() utility to the GenomicRanges package as a mean for the user to check upfront whether the "absoluteRanges" hack can be used or not. The current situation is not entirely satisfying though from a user point of view. We still need to work on improving the high-level API so the end user can perform things like the binned average without having to switch between 2 implementations (binnedAverage2() or binnedAverage()) depending on whether the genome is "small" or "big".

Cheers,

H.

ADD REPLY • link 9.2 years ago Hervé Pagès 16k

0

Entering edit mode

Two thumbs up (again) for that absoluteRanges function: I just used it again for plotting. I have an assembly that consists of a lot of small scaffolds, and I have a rough idea of their order on the chromosome, so I used absoluteRanges to convert within-scaffold coordinates to fake chromosome coordinates for a plot. Very useful - thank you!

ADD REPLY • link 9.1 years ago Janet Young ▴ 740

score 2 · Accepted Answer · 2015-02-12

Hi Janet,

Please use this function instead.

### A replacement for binnedAverage() that is about 10x faster.
### With binnedAverage2() 'numvar' can be an RleList, a
### NumericList, or an ordinary list.
### HOWEVER, because it relies on absoluteRanges(), an important
### limitation of binnedAverage2() is that the length of the
### underlying genome must be < 2^31 (see ?absoluteRanges for the
### details). So for example it cannot be used if 'bins' and
### 'numvar' are covering the full Human genome :-(
binnedAverage2 <- function(bins, numvar, mcolname)
{
    stopifnot(is(bins, "GRanges"))
    stopifnot(is(numvar, "List") || is.list(numvar))
    stopifnot(identical(seqlengths(bins), elementLengths(numvar)))
    unlisted_numvar <- unlist(numvar, use.names=FALSE)
    abs_bins <- absoluteRanges(bins)
    views <- Views(unlisted_numvar, abs_bins)
    mcols(bins)[[mcolname]] <- viewMeans(views)
    bins
}

It relies on absoluteRanges() which is new in GenomicRanges 1.19.37 (I just added it) so you will need to use Bioc devel (requires current R devel). GenomicRanges 1.19.37 will become available thru biocLite() on Friday morning but you can grab it now directly from svn with:

svn co https://hedgehog.fhcrc.org/bioconductor/trunk/madman/Rpacks/GenomicRanges
user: readonly
password: readonly

Let me know if you run into problems.

Cheers,

H.