Question

how to use biocParallel for large local variables?

1

Entering edit mode

oyoung122 ▴ 20

@oyoung122-14640

Last seen 4.7 years ago

The parallel is need to be applied on a large local variable (read only) generated. I am using 'MulticoreParam' thinking of shared memory for efficiency. And the performance is very weird for me. I did some tests which further confused me. test1 works as expected, test4 is what I am aiming which initiated cores without using them all, but 2~3 cores. test2 and test3 provide some hint, but I cannot figure it out.

Any helps appreciated!

UPDATE: The following phenomenal didn't shows in R 3.6.1 or R 4.0.3 as the comment below.

require(BiocParallel)
register(MulticoreParam(20))
rN <- 30000
cN <- 5000
X <- matrix(rnorm(rN*cN),ncol=cN)
test1 <- function(){
    ids <- sample(LETTERS[1:20],cN,replace=T)
    message("parallel")
    tmp <- bplapply(LETTERS[1:20], function(id) {#
        y = X[, ids %in% id,drop=F]
        return(apply(y, 1, sum, na.rm = TRUE)/sum(y))
    })  
    return(tmp)
}
test2 <- function(){
    X1 <- X
    ids <- sample(LETTERS[1:20],cN,replace=T)
    message("parallel")
    tmp <- bplapply(LETTERS[1:20], function(id) {#
        y = X[, ids %in% id,drop=F]
        return(apply(y, 1, sum, na.rm = TRUE)/sum(y))
    })  
    return(tmp)
}
test3 <- function(){
    X1 <- X
    rm(X1)
    ids <- sample(LETTERS[1:20],cN,replace=T)
    message("parallel")
    tmp <- bplapply(LETTERS[1:20], function(id) {#
        y = X[, ids %in% id,drop=F]
        return(apply(y, 1, sum, na.rm = TRUE)/sum(y))
    })  
    return(tmp)
}
test4 <- function(){
    X1 <- X
    ids <- sample(LETTERS[1:20],cN,replace=T)
    message("parallel")
    tmp <- bplapply(LETTERS[1:20], function(id) {#
        y = X1[, ids %in% id,drop=F]
        return(apply(y, 1, sum, na.rm = TRUE)/sum(y))
    })  
    return(tmp)
}
message("test1")
print(system.time(res <- test1()))
message("test2")
print(system.time(res <- test2()))
message("test3")
print(system.time(res <- test3()))
message("test4")
print(system.time(res <- test4()))

And the output:

Loading required package: BiocParallel
test1
parallel
   user  system elapsed
  0.064   0.066   0.603
test2
parallel
   user  system elapsed
  6.302  12.067  18.534
test3
parallel
   user  system elapsed
  0.052   0.059   0.549
test4
parallel
   user  system elapsed
  5.608  13.019  19.130

And the session infor:

> sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS release 6.8 (Final)

Matrix products: default
BLAS: /.../pkg/R/3.5.1/centos6/lib64/R/lib/libRblas.so
LAPACK: /.../pkg/R/3.5.1/centos6/lib64/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] BiocParallel_1.16.6

loaded via a namespace (and not attached):
[1] compiler_3.5.1 parallel_3.5.1

BiocParallel • 1.4k views

ADD COMMENT • link 4.7 years ago oyoung122 ▴ 20

score 0 · Answer 1 · 2021-01-13

I don't have enough physical memory to replicate your example, so I used

rN <- 3000
cN <- 50000
X <- matrix(rnorm(rN*cN),ncol=cN)

Use rowSums() rather than apply(); it is much faster

library(bench)
mark(
    apply(X, 1, sum, na.rm =TRUE),
    rowSums(X, na.rm = TRUE),
    max_iterations = 10
)

leading to 10x higher throughput (iterations per second)

# A tibble: 2 x 13
  expression                          min   median `itr/sec` mem_alloc `gc/sec`
  <bch:expr>                     <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
1 apply(X, 1, sum, na.rm = TRUE)    8.27s    8.27s     0.121        NA    0.242
2 rowSums(X, na.rm = TRUE)       664.78ms 664.78ms     1.50         NA    0    
# … with 7 more variables: n_itr <int>, n_gc <dbl>, total_time <bch:tm>,
#   result <list>, memory <list>, time <list>, gc <list>
Warning message:
Some expressions had a GC in every iteration; so filtering is disabled.

Depending on your workflow, it might make sense to use the weirdly named rowsum() function on the transpose of your data; of course transposing such a large matrix could be very expensive.

ids <- factor(
    sample(LETTERS[1:20], cN, replace = TRUE),
    levels = LETTERS[1:20]
)
Xt <- t(X)
rowsum(Xt, ids)
rowsum(rowSums(Xt), ids

And I guess this is single cell data, and the matrix is actually sparse; using Matrix::sparseMatrix() and some of the single cell tools outlined in the Orchestrating Single Cell Analysis book might be helpful.

With your tests, I have

> message("test1")
test1
> print(system.time(res <- test1()))
parallel
   user  system elapsed 
  9.938   2.366   1.761 
> message("test2")
test2
> print(system.time(res <- test2()))
parallel
   user  system elapsed 
 13.345   3.155   1.801 
> message("test3")
test3
> print(system.time(res <- test3()))
parallel
   user  system elapsed 
 12.473   2.900   1.782 
> message("test4")
test4
> print(system.time(res <- test4()))
parallel
   user  system elapsed 
 10.877   2.521   1.777

and I am not sure what you are seeing that is weird? It might be that the relative cost of communication (e.g., of the result) is large, so that only a few cores are working on the computation... Can you be more explicit about what you are seeing, and provide sessionInfo()? Here's mine

> sessionInfo()
R version 4.0.3 Patched (2020-10-13 r79345)
Platform: x86_64-apple-darwin17.7.0 (64-bit)
Running under: macOS High Sierra 10.13.6

Matrix products: default
BLAS:   /Users/ma38727/bin/R-4-0-branch/lib/libRblas.dylib
LAPACK: /Users/ma38727/bin/R-4-0-branch/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] BiocParallel_1.24.1

loaded via a namespace (and not attached):
[1] compiler_4.0.3        BiocManager_1.30.10.7 parallel_4.0.3