how to use biocParallel for large local variables?
1
1
Entering edit mode
oyoung122 ▴ 20
@oyoung122-14640
Last seen 3.2 years ago

The parallel is need to be applied on a large local variable (read only) generated. I am using 'MulticoreParam' thinking of shared memory for efficiency. And the performance is very weird for me. I did some tests which further confused me. test1 works as expected, test4 is what I am aiming which initiated cores without using them all, but 2~3 cores. test2 and test3 provide some hint, but I cannot figure it out.

Any helps appreciated!

UPDATE: The following phenomenal didn't shows in R 3.6.1 or R 4.0.3 as the comment below.

require(BiocParallel)
register(MulticoreParam(20))
rN <- 30000
cN <- 5000
X <- matrix(rnorm(rN*cN),ncol=cN)
test1 <- function(){
    ids <- sample(LETTERS[1:20],cN,replace=T)
    message("parallel")
    tmp <- bplapply(LETTERS[1:20], function(id) {#
        y = X[, ids %in% id,drop=F]
        return(apply(y, 1, sum, na.rm = TRUE)/sum(y))
    })  
    return(tmp)
}
test2 <- function(){
    X1 <- X
    ids <- sample(LETTERS[1:20],cN,replace=T)
    message("parallel")
    tmp <- bplapply(LETTERS[1:20], function(id) {#
        y = X[, ids %in% id,drop=F]
        return(apply(y, 1, sum, na.rm = TRUE)/sum(y))
    })  
    return(tmp)
}
test3 <- function(){
    X1 <- X
    rm(X1)
    ids <- sample(LETTERS[1:20],cN,replace=T)
    message("parallel")
    tmp <- bplapply(LETTERS[1:20], function(id) {#
        y = X[, ids %in% id,drop=F]
        return(apply(y, 1, sum, na.rm = TRUE)/sum(y))
    })  
    return(tmp)
}
test4 <- function(){
    X1 <- X
    ids <- sample(LETTERS[1:20],cN,replace=T)
    message("parallel")
    tmp <- bplapply(LETTERS[1:20], function(id) {#
        y = X1[, ids %in% id,drop=F]
        return(apply(y, 1, sum, na.rm = TRUE)/sum(y))
    })  
    return(tmp)
}
message("test1")
print(system.time(res <- test1()))
message("test2")
print(system.time(res <- test2()))
message("test3")
print(system.time(res <- test3()))
message("test4")
print(system.time(res <- test4()))

And the output:

Loading required package: BiocParallel
test1
parallel
   user  system elapsed
  0.064   0.066   0.603
test2
parallel
   user  system elapsed
  6.302  12.067  18.534
test3
parallel
   user  system elapsed
  0.052   0.059   0.549
test4
parallel
   user  system elapsed
  5.608  13.019  19.130

And the session infor:

> sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS release 6.8 (Final)

Matrix products: default
BLAS: /.../pkg/R/3.5.1/centos6/lib64/R/lib/libRblas.so
LAPACK: /.../pkg/R/3.5.1/centos6/lib64/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] BiocParallel_1.16.6

loaded via a namespace (and not attached):
[1] compiler_3.5.1 parallel_3.5.1
BiocParallel • 879 views
ADD COMMENT
0
Entering edit mode
@martin-morgan-1513
Last seen 6 weeks ago
United States

I don't have enough physical memory to replicate your example, so I used

rN <- 3000
cN <- 50000
X <- matrix(rnorm(rN*cN),ncol=cN)

Use rowSums() rather than apply(); it is much faster

library(bench)
mark(
    apply(X, 1, sum, na.rm =TRUE),
    rowSums(X, na.rm = TRUE),
    max_iterations = 10
)

leading to 10x higher throughput (iterations per second)

# A tibble: 2 x 13
  expression                          min   median `itr/sec` mem_alloc `gc/sec`
  <bch:expr>                     <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
1 apply(X, 1, sum, na.rm = TRUE)    8.27s    8.27s     0.121        NA    0.242
2 rowSums(X, na.rm = TRUE)       664.78ms 664.78ms     1.50         NA    0    
# … with 7 more variables: n_itr <int>, n_gc <dbl>, total_time <bch:tm>,
#   result <list>, memory <list>, time <list>, gc <list>
Warning message:
Some expressions had a GC in every iteration; so filtering is disabled.

Depending on your workflow, it might make sense to use the weirdly named rowsum() function on the transpose of your data; of course transposing such a large matrix could be very expensive.

ids <- factor(
    sample(LETTERS[1:20], cN, replace = TRUE),
    levels = LETTERS[1:20]
)
Xt <- t(X)
rowsum(Xt, ids)
rowsum(rowSums(Xt), ids

And I guess this is single cell data, and the matrix is actually sparse; using Matrix::sparseMatrix() and some of the single cell tools outlined in the Orchestrating Single Cell Analysis book might be helpful.

With your tests, I have

> message("test1")
test1
> print(system.time(res <- test1()))
parallel
   user  system elapsed 
  9.938   2.366   1.761 
> message("test2")
test2
> print(system.time(res <- test2()))
parallel
   user  system elapsed 
 13.345   3.155   1.801 
> message("test3")
test3
> print(system.time(res <- test3()))
parallel
   user  system elapsed 
 12.473   2.900   1.782 
> message("test4")
test4
> print(system.time(res <- test4()))
parallel
   user  system elapsed 
 10.877   2.521   1.777

and I am not sure what you are seeing that is weird? It might be that the relative cost of communication (e.g., of the result) is large, so that only a few cores are working on the computation... Can you be more explicit about what you are seeing, and provide sessionInfo()? Here's mine

> sessionInfo()
R version 4.0.3 Patched (2020-10-13 r79345)
Platform: x86_64-apple-darwin17.7.0 (64-bit)
Running under: macOS High Sierra 10.13.6

Matrix products: default
BLAS:   /Users/ma38727/bin/R-4-0-branch/lib/libRblas.dylib
LAPACK: /Users/ma38727/bin/R-4-0-branch/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] BiocParallel_1.24.1

loaded via a namespace (and not attached):
[1] compiler_4.0.3        BiocManager_1.30.10.7 parallel_4.0.3
ADD COMMENT
1
Entering edit mode

Thank you for the reply! And yes, it is for sc/sn datasets. Using rowSums is faster than apply, thanks for the suggestion. I modified the original post to include the running time and sessionInfo. I also changed to cN to be 5000. The weird is test2 and test4 which cost a log time. The increasing of cN has exponential impact on running time for test2 and test4. However, seems it is not showing such on your system, which trigger me to test on higher version of R (3.6). And it did NOT have such weird thing, all tests cost similarly time and faster!

ADD REPLY

Login before adding your answer.

Traffic: 641 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6