Question

mclapply to loop over a GRangesList object

2

Entering edit mode

foehn ▴ 100

@foehn-16281

Last seen 3.8 years ago

United States

Hello,

I'm trying to speed up lapply applied to a GRangesList object by replacing it with mclapply.

library(EnsDb.Hsapiens.v86)
GRsList <- exonsBy(EnsDb.Hsapiens.v86, by = "tx")
# the goal
res <- lapply(GRsList, FUN = gaps) 
# Here `gaps` is just for an example.

When I switch to mclapply, the code does not run faster at all

res <- mclapply(GRsList, FUN = gaps, mc.cores = 4)

Careful inspection shows that inside mclapply, much of the time is spent on converting GRangesList to list.

GRsList <- as(GRsList, "list") # takes forever

Therefore I'm curious about how to convert GRsList to list, or simply how to make mclapply work with large GRangesList objects.

Thanks,

GenomicRanges GRanges GRangesList parallel mclapply • 2.6k views

ADD COMMENT • link 6.4 years ago foehn ▴ 100

score 2 · Answer 1 · 2019-02-05

2

Entering edit mode

Martin Morgan 25k

@martin-morgan-1513

Last seen 5 months ago

United States

Depending on what your goal is, a fast approach is to 'unlist-apply-relist'

gr = unlist(grl)
gr$foo = ...
relist(gr, grl)

Probably describing what you would like to do, in not too much detail, would lead to a more helpful answer.

ADD COMMENT • link 6.4 years ago Martin Morgan 25k

0

Entering edit mode

Thanks @Martin. My goal is to get introns from each transcript, that is what FUN in lapply does. Any suggestions?

ADD REPLY • link 6.4 years ago foehn ▴ 100

0

Entering edit mode

Then you're looking for GenomicFeatures::intronsByTranscript()!

ADD REPLY • link 6.4 years ago Martin Morgan 25k

0

Entering edit mode

But GenomicFeatures::intronsByTranscript() does not work for EnsDb.Hspaiens.v86 which is from ensembldb...

ADD REPLY • link 6.4 years ago foehn ▴ 100

1

Entering edit mode

intronsByTranscript() should probably be non-generic, as it should really just rely on functions in the "TxDb" API. But anyway, if you look at selectMethod(intronsByTranscript, "TxDb"), you will see the basic pattern for achieving what you want with ensembldb objects.

ADD REPLY • link 6.4 years ago Michael Lawrence ★ 11k

0

Entering edit mode

That's a good idea. For each transcript, intronsByTranscript uses psetdiff to find regions complementary to the exons and psetdiff is very fast.

ADD REPLY • link 6.4 years ago foehn ▴ 100

Martin Morgan · Answer 2 · 2019-02-05

1

Entering edit mode

foehn ▴ 100

@foehn-16281

Last seen 3.8 years ago

United States

so, if the goal is to get introns per transcript, thanks to @martin's suggestion, we can use the method under the hood of intronsByTranscript:

exonsByTx <- exonsBy(EnsDb.Hsapiens.v86, "tx")
tx <- transcripts(EnsDb.Hsapiens.v86)
tx <- tx[match(names(exonsByTx), mcols(tx)[, "tx_id"])]
intronsByTx <- psetdiff(tx, exonsByTx)

Back to my original question, however, this method is not so generalizable and does not fit other user defined FUN in lapply. I'm still open to more versatile solutions.

ADD COMMENT • link updated 6.4 years ago by Martin Morgan 25k • written 6.4 years ago by foehn ▴ 100

0

Entering edit mode

The infrastructure includes a good number of utilities for common use cases. Combined with the pattern Martin mentions above, the need for an actual lapply() is very rare. We've optimized lapply() to a reasonable extent, but the bottom line is that decompressing the GRangesList into an actual list is going to be expensive.

ADD REPLY • link 6.4 years ago Michael Lawrence ★ 11k

0

Entering edit mode

Thanks for the explanation.

ADD REPLY • link 6.4 years ago foehn ▴ 100

score 0 · Answer 3 · 2019-02-06

0

Entering edit mode

foehn ▴ 100

@foehn-16281

Last seen 3.8 years ago

United States

With regard to the original goal, one workaround is to create an external list to index the GRangesList:

index <- seq_along(GRsList)
res <- mclapply(index, FUN = function(i) gaps(GRsList[[i]]), mc.cores = 4)

Here gaps can be any other procedures applied to a GRanges object.

ADD COMMENT • link 6.4 years ago foehn ▴ 100

0

Entering edit mode

I don't think that's any faster than

mclapply(GRsList, gaps, mc.cores = 4)

? The 'cost' is in creating a GRanges() for each element of the list, the cost is paid either internally by lapply() or explicitly by GRsList[[i]], and either way distributed across cores.

The 'Bioconductor' way would use BiocParallel::bplapply(), which would mean that the code would work on Windows (where mclapply() is not available) and potentially with other back-ends like a cluster (obviously for more complicated FUNs).

ADD REPLY • link 6.4 years ago Martin Morgan 25k

0

Entering edit mode

I understand what you mean, and I agree there should be no fundamental difference. As mentioned before, the first step of mclapply is to convert GRangesList to list, which is time consuming and runs with only one core. Using an external index will avoid this step but distribute the cost to multiple cores, hence in total the running time can be reduced. This is true at least from my test.

ADD REPLY • link 6.4 years ago foehn ▴ 100