Hi again,
I'm playing with Herve's binnedAverage function, from the help page "?tileGenome". It works really nicely for me for well-behaved genomes (thank you! it'll be very useful).
However, I am also working with some new assemblies that comprise a lot of fairly short scaffolds (~45,000 of them, total genome size ~100 Mb), and binnedAverage gets pretty slow with those. I wondered whether you have any thoughts on alternative ways to do this more efficiently, or whether there's any opportunity for you to speed things up internally?
Digging into it a bit, I think the part of the function that seems slow is creating the Views - calculating the viewMeans seems to happen fast. Some test code is below (and sessionInfo - I'm using the devel version).
thanks in advance for any suggestions,
Janet
library(GenomicRanges)
binnedAverage <- function(bins, numvar, mcolname)
{
stopifnot(is(bins, "GRanges"))
stopifnot(is(numvar, "RleList"))
stopifnot(identical(seqlevels(bins), names(numvar)))
bins_per_chrom <- split(ranges(bins), seqnames(bins))
means_list <- lapply(names(numvar),
function(seqname) {
views <- Views(numvar[[seqname]],
bins_per_chrom[[seqname]])
viewMeans(views)
})
new_mcol <- unsplit(means_list, as.factor(seqnames(bins)))
mcols(bins)[[mcolname]] <- new_mcol
bins
}
################# some functions for convenience
### make a fake seqlengths object
makeSeqlengths <- function( totalGenomeSize, numChrs) {
mySeqlengths <- rep( round(totalGenomeSize/numChrs), numChrs)
names(mySeqlengths) <- paste("scaffold",1:numChrs,sep="")
mySeqlengths
}
### make a fake RleList object
makeRleObject <- function( mySeqlengths ) {
RleList(
lapply(mySeqlengths,
function(len) Rle(sample(-10:10, len, replace=TRUE))),
compress=FALSE)
}
### make some bins on the genome
makeBins <- function( mySeqlengths ) {
tileGenome(mySeqlengths, tilewidth=100, cut.last.tile.in.chrom=TRUE)
}
###### test a 100Mb genome with 90 chromosomes (each chromosome is 1.1Mb)
mySeqlengths90scaffolds <- makeSeqlengths( 100000000, 90 )
my_var1_90scaffolds <- makeRleObject(mySeqlengths90scaffolds)
bins1_90scaffolds <- makeBins(mySeqlengths90scaffolds)
system.time( bins1_90scaffolds <- binnedAverage(bins1_90scaffolds, my_var1_90scaffolds, "binned_var1") )
## 2.1 seconds
###### test a 100Mb genome with 900 chromosomes (each chromosome is 111.1kb)
mySeqlengths900scaffolds <- makeSeqlengths( 100000000, 900 )
my_var1_900scaffolds <- makeRleObject(mySeqlengths900scaffolds)
bins1_900scaffolds <- makeBins(mySeqlengths900scaffolds)
system.time( bins1_900scaffolds <- binnedAverage(bins1_900scaffolds, my_var1_900scaffolds, "binned_var1") )
## 4.4 seconds
###### test a 100Mb genome with 9000 chromosomes (each chromosome is 11.1kb)
mySeqlengths9000scaffolds <- makeSeqlengths( 100000000, 9000 )
my_var1_9000scaffolds <- makeRleObject(mySeqlengths9000scaffolds)
bins1_9000scaffolds <- makeBins(mySeqlengths9000scaffolds)
system.time( bins1_9000scaffolds <- binnedAverage(bins1_9000scaffolds, my_var1_9000scaffolds, "binned_var1") )
## 34 seconds
######## and my real genome is worse - haven't finished timing it yet....
######## check out a few components of the binnedAverage function:
### check the split by chr
system.time( bins1_9000scaffolds_per_chr <- split(ranges(bins1_9000scaffolds), seqnames(bins1_9000scaffolds)) )
## quick
### check the Views creation
system.time(
myViewsList <- lapply(names(my_var1_9000scaffolds), function(seqname) {
Views(my_var1_9000scaffolds[[seqname]], bins1_9000scaffolds_per_chr[[seqname]])
})
)
## 32 sec
#### check the Views creation a different way - almost as slow if we do it without an obvious lapply
system.time( myViewsListAgain <- Views(my_var1_9000scaffolds, bins1_9000scaffolds_per_chr) )
## 28 sec
### check the ViewMeans calculation - quick
system.time( myViewMeans <- lapply(myViewsList, viewMeans) )
## 2.7 sec
###########
sessionInfo()
R Under development (unstable) (2014-10-31 r66921)
Platform: x86_64-unknown-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] GenomicRanges_1.19.36 GenomeInfoDb_1.3.12 IRanges_2.1.38
[4] S4Vectors_0.5.19 BiocGenerics_0.13.4
loaded via a namespace (and not attached):
[1] XVector_0.7.4
Hi Janet,
Thanks for the report. Very useful as always :-) I'll look into this.
H.