trigger package fails at parallalization - Transcriptional Regulatory Inference from Genetics of ExpRession
1
0
Entering edit mode
@affennacken-6905
Last seen 10.1 years ago
Netherlands

Dear Bioconductor Community,

the reference manual (October, 21, 2014) of the bioconductor trigger package states that it is doing calculations in parallel at least on large datasets (p 11: trigger.mlink-methods; p 12: trigger.net-method), which makes sense because a large number of permutations may be involved. I cannot get parallel processing running, neither on the minimal example provided below, nor on larger datasets. As seen in the example above, I am using doMC in order to mediate parallelization. Should I install a different parallelization package other than doMC? Do I somehow interpret the reference manual the wrong way? Or is the trigger package buggy in that sense?

Help is greatly appreciated,
Kind regards,

Jonas

 

No parallel processing is achieved using the following code:

library(doMC)
library(trigger)
## registering multiple cores
registerDoMC(cores = 4)
## loading trigger accompanied data:
data(yeast)
attach(yeast)
## sample gene indexes to idx
set.seed(666)
idx <- c(unique(sort(sample(1:nrow(exp), size = 150, replace = F)),383,590,5003,4949))
my_trigger <- trigger.build(exp = exp[idx,], exp.pos = exp.pos[idx,], marker=marker, marker.pos = marker.pos)
my_loclink <- trigger.loclink(my_trigger, window.size = 30000)
my_mlink <- trigger.mlink(my_loclink, B = 100,seed = 666)

 

> sessionInfo()
R version 3.1.1 (2014-07-10)
Platform: x86_64-pc-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
[1] doMC_1.3.3      iterators_1.0.7 foreach_1.4.2   trigger_1.10.0
[5] qtl_1.33-7      corpcor_1.6.7  

loaded via a namespace (and not attached):
[1] codetools_0.2-9 qvalue_1.38.0   sva_3.10.0      tcltk_3.1.1    
[5] tools_3.1.1  

 

trigger parallel processing multicore • 1.2k views
ADD COMMENT
1
Entering edit mode
@valerie-obenchain-4275
Last seen 2.9 years ago
United States

Hi Jonas,

Functions in the trigger package are not themselves run in parallel. I believe the authors intended that 'idx' would be used as the chunking argument to a parallel function outside the package. You can do this with doMC, you just need a foreach object and evaluation with %dopar%.

library(doMC)
cores <- 4
registerDoMC(cores = cores)
...
...

The gene index should be a list. For this example I'll split into approximately equal groups across the number of workers.

nrows <- nrow(my_loclink@exp)
idx <- split(seq_len(nrows), ceiling(seq_len(nrows)/(nrows/cores)))

> length(idx)
[1] 4
> elementLengths(idx)
 1  2  3  4 
37 38 37 38 

Create a foreach object and R expression then evaluate them with %dopar%. 

res <- foreach(i = idx) %dopar% {
    trigger.mlink(my_loclink, B=100, i=i, seed=666) }
> res <- foreach(i = idx) %dopar% {
+     trigger.mlink(my_loclink, B=100, i=i, seed=666) }
Error in { : 
  task 1 failed - "Please select at least 100 genes to compute multi-locus linkage for them"

Looks like we need at least 100 genes in each list element for a user-supplied 'idx'. This data set is small, only 150 genes, so we'll fake it just to demonstrate the parallel example.

idx <- list(1:100, 1:100)
res <- foreach(i = idx) %dopar% {
    trigger.mlink(my_loclink, B=100, i=i, seed=666) }

4 cores were specified but the list is length 2 so we only see 2 workers working ...

> res <- foreach(i = idx) %dopar% {
+     trigger.mlink(my_loclink, B=100, i=i, seed=666) }
[1] Start to calculate multi-locus linkage statistics ...
[1] Start to calculate multi-locus linkage statistics ...
[1] 10% completed
[1] 10% completed
[1] 20% completed
[1] 20% completed
[1] 30% completed
[1] 30% completed
...

and the result -

> res
[[1]]
*** TRIGGER object *** 
Marker matrix with  3244 rows and  112 columns 
Expression matrix with  150 rows and  112 columns 

[[2]]
*** TRIGGER object *** 
Marker matrix with  3244 rows and  112 columns 
Expression matrix with  150 rows and  112 columns 


Another option for parallel work is the BiocParallel package.

library(BiocParallel)

Multicore, Snow and BatchJobs backends are supported. We'll use Multicore since you were using doMC.

Register a MulticoreParam with 4 workers.

register(MulticoreParam(workers = 4))

BiocParallel has a family of bp*apply functions that are based on lapply(), sapply(), mapply() etc. but are run in parallel. bplaply() is similar to lapply(); the first argument is a list and each element is passed to FUN.

Create the FUN to be run on each worker.

FUN <- function(i) 
    trigger.mlink(my_loclink, B=100, i=i, seed=666)

Execute bplapply():

res <- bplapply(idx, FUN=FUN)

and we get the same result -

> res
[[1]]
*** TRIGGER object *** 
Marker matrix with  3244 rows and  112 columns 
Expression matrix with  150 rows and  112 columns 

[[2]]
*** TRIGGER object *** 
Marker matrix with  3244 rows and  112 columns 
Expression matrix with  150 rows and  112 columns 


Valerie

ADD COMMENT

Login before adding your answer.

Traffic: 824 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6