Taking much longer time for same script after R upgrade to 4.0
0
0
Entering edit mode
hsiowa2 ▴ 20
@hsiowa2-23534
Last seen 13 months ago
South Korea

Hello.

Recently I upgraded the R from 3.6 to 4.0, and also upgraded the RStudio.

I have a script which has 2 for loops, which runs 30, 60000 for 2 for loops respectively. Before the upgrade it took about 10 minutes, but now it takes more than an hour.

I'm running the exactly same script with exactly same input. Unfortunately, removing R 4.0 did not solve the problem, still taking much longer than 10 minutes.

Is there anybody experiencing similar? Would there be any way to solve this problem?

Thank you.

R • 1.5k views
ADD COMMENT
0
Entering edit mode

Does this have anything to do with Bioconductor? This sounds like a personal setup issue rather than anything to do with any Bioconductor packages.

ADD REPLY
0
Entering edit mode

I guess it's correct that this post is not appropriate. My mistake. Thanks all.

ADD REPLY
0
Entering edit mode

We do not know if it is relevant or not due to lack of information. Can you elaborate, please? I / we are happy to help.

ADD REPLY
0
Entering edit mode

Thanks for the kind words. It looks not appropriate since I found out it is not related with any bioconductor package, but related only with base R functions.

For those (if any) who wants to see the problem, the code is as follows; "final" is a matrix which has about 60000 rows and 98 columns.

aa = matrix(NA, nrow = nrow(final), ncol=34)

st = Sys.time()
for (j in 1:32){
for (i in 1:nrow(final)){
if (final[i,3*j] > 1.5 & final[i,3*j+1] < 0.05 & final[i,3*j+2] < 0.05){
aa[i,j+2] = 1
} else if (final[i,3*j] < -1 & final[i,3*j+1] < 0.05 & final[i,3*j+2] < 0.05){
aa[i,j+2] = 2
}
}
}
ed = Sys.time()
ed-st

As you see here there is nothing more than base R functions inside the for loops. Some people recommend not to use many for loops, but I don't know if there is a way not to use for loop in this case.

Thank you.

ADD REPLY
0
Entering edit mode

I see what you mean - no problem. lapply() can work quicker than a for loop, but it may take a while to become familiar with how lapply() works.

You could also 'parallelise' your for loop with foreach - I have a brief intro to parallel processing (including foreach) on Biostars: Tutorial: Parallel processing in R

ADD REPLY
0
Entering edit mode

Thank you very much for your kind reply. I will take a look at both.

Thank you again.

ADD REPLY
0
Entering edit mode

Sorry for bothering you, but when I tried the foreach function, there doesn't seem to be much time difference.

Here is the code I used;

require(doParallel)

cores <- makeCluster(detectCores(), type='PSOCK')

system <- Sys.info()['sysname']

cl <- NULL
if (system == 'Windows') {
    cl <- makeCluster(getOption('cl.cores', cores))
    registerDoParallel(cl)
    registerDoSEQ()
    on.exit(stopCluster(cl))
} else {
    options('mc.cores' = cores)
    registerDoParallel(cores)
}

"cores" variable gave 8. Below is the same code with above but changed the j from 1:32 to 1:2. First, without using foreach;

aa = matrix(NA, nrow = nrow(final), ncol=34)
st = Sys.time()
for (j in 1:2){
for (i in 1:nrow(final)){
  if (final[i,3*j] > 1.5 & final[i,3*j+1] < 0.05 & final[i,3*j+2] < 0.05){
    aa[i,j+2] = 1
  } else if (final[i,3*j] < -1 & final[i,3*j+1] < 0.05 & final[i,3*j+2] < 0.05){
    aa[i,j+2] = 2
  } else {aa[i,j+2] = 0}
}
}
ed = Sys.time()
ed-st

this gave 5.34 mins.

Second, using foreach function;

st = Sys.time()

foreach (j = 1:2, .combine = cbind) %dopar% {
foreach (i = 1:nrow(final), .combine = c) %dopar% {
  if (final[i,3*j] > 1.5 & final[i,3*j+1] < 0.05 & final[i,3*j+2] < 0.05){
    1
  } else if (final[i,3*j] < -1 & final[i,3*j+1] < 0.05 & final[i,3*j+2] < 0.05){
    2
  } else {0}
}
}
ed = Sys.time()
ed-st

This gave 5.63 mins.

Is there anything I did wrong?

Thank you very much.

ADD REPLY
0
Entering edit mode

The inner loop is a series of tests. You perform the test for each row (one R expression per line, so nrow expressions), but the values could be computed in a 'vectorized' fashion with a single R expression. For instance

for (i in 1:nrow(final))
    final[i, 3 * j] > 1.5

calculates the same values as

final[,3 * j] > 1.5

but does it much more efficiently. So I revised your code to

for (j in 1:32){
    test1 <- final[, 3 * j] > 1.5
    test2 <- final[, 3 * j + 1] < 0.05
    test3 <- final[, 3 * j + 2] < 0.05

    test4 <- test2 & test2 & test3
    test5 <- !test4 & (final[, 3 * j] < -1 & test2 & test3)

    aa[test4, j + 2] = 1
    aa[test5, j + 2] = 2
}

Make sure I have not made a mistake! This will be much faster.

ADD REPLY
1
Entering edit mode

I looked at this a little more carefully. Here's your original function:

f1 <- function(final) {
    aa = matrix(NA, nrow = nrow(final), ncol=34)
    for (j in 1:32){
        for (i in 1:nrow(final)){
            if (final[i,3*j] > 1.5 &
                final[i,3*j+1] < 0.05 &
                final[i,3*j+2] < 0.05) {
                aa[i,j+2] = 1
            } else if (final[i,3*j] < -1 &
                       final[i,3*j+1] < 0.05 &
                       final[i,3*j+2] < 0.05) {
                aa[i,j+2] = 2
            } else {
                aa[i,j+2] = 0
            }
        }
    }
    aa
}

and my (updated) implementation:

f2 <- function(final) {
    aa = matrix(NA, nrow = nrow(final), ncol = 34)
    for (j in 1:32) {
        test1 <- final[, 3 * j] > 1.5
        test2 <- final[, 3 * j + 1] < 0.05
        test3 <- final[, 3 * j + 2] < 0.05
        test4 <- final[, 3 * j] < -1
        test5 <- test1 & test2 & test3
        test6 <- !test5 & (test4 & test2 & test3)
        test7 <- !(test5 | test6)
        aa[test5, j + 2] = 1
        aa[test6, j + 2] = 2
        aa[test7, j + 2] = 0
    }
    aa
}

Here's some indication that I'm doing the right thing

set.seed(123); nrow = 60000; ncol = 98
m <- matrix(rnorm(nrow * ncol), nrow)
identical(f1(m), f2(m))
## [1] TRUE

And that I'm doing it quite a bit faster

microbenchmark::microbenchmark(f1(m), f2(m), times = 5)
## Unit: milliseconds
##   expr       min        lq      mean    median       uq       max neval cld
##  f1(m) 3627.0438 3752.4945 3741.0400 3768.0311 3773.083 3784.5472     5   b
##  f2(m)  315.7972  315.8579  327.9973  319.2417  339.672  349.4176     5  a 

But even your iteration is taking 3.7 seconds, whereas you mention 'minutes'.

So I guess that your data structure is a 'data.frame', and not a 'matrix'. Subsetting a data.frame is very expensive. Here's a smaller example

set.seed(123); nrow = 10; ncol = 98
m <- matrix(rnorm(nrow * ncol), nrow)
df <- as.data.frame(m)
microbenchmark::microbenchmark(f1(m), f1(df), times = 5)
## Unit: microseconds
##    expr       min        lq       mean    median        uq        max neval cld
##   f1(m)   574.814   578.177   614.1724   588.725   608.123    721.023     5  a 
##  f1(df) 41884.143 44706.265 58805.1530 47080.256 47826.073 112529.028     5   b

Ouch! Exploring a little bit (with nrow = 1000 and using system.time())

set.seed(123); nrow = 1000; ncol = 98
m <- matrix(rnorm(nrow * ncol), nrow)
df <- as.data.frame(m)
system.time(f1(df))
##    user  system elapsed
##   4.239   0.033   4.275

I can believe that it can take minutes for row-wise updating a data.frame.

So the messages are (a) use appropriate data structures (matrix) for the computation; (b) 'vectorize' instead of 'iterate'.

And finally, comparing R 3.6.3 and R 4.0.2 I find under 3.6.3

> set.seed(123); nrow = 1000; ncol = 98
> m <- matrix(rnorm(nrow * ncol), nrow)
> df <- as.data.frame(m)
> system.time(f1(df))
   user  system elapsed
  4.239   0.033   4.275

and under 4.0.2

> set.seed(123); nrow = 1000; ncol = 98
> m <- matrix(rnorm(nrow * ncol), nrow)
> df <- as.data.frame(m)
> system.time(f1(df))
   user  system elapsed
  4.481   0.032   4.520

So essentially the same time. I'd guess that your data.frame() under R 3.6.3 has 'factor' columns, whereas under 4.0.2 it does not, because of changes in the way R creates data.frame objects and the stringsAsFactors default. But your data shouldn't be represented as a data.frame anyway...

ADD REPLY
0
Entering edit mode

You were correct that my "final" was in fact data.frame, not a matrix. I didn't know subsetting a data.frame is this expensive. However as I know it is really common and convenient to use data.frame since it can handle different types of data(character, numeric, factor etc) by column, so do we always have to suffer from long time of processing, when we use data.frame?

By the way, thank you for your insights and (probably) large efforts to answer my post.

ADD REPLY
0
Entering edit mode

f2() is fast even with a data.frame

> set.seed(123); nrow = 60000; ncol = 98
> m <- matrix(rnorm(nrow * ncol), nrow)
> df <- as.data.frame(m)
> system.time(f2(df))
   user  system elapsed
  0.226   0.033   0.259

so writing 'vectorized' code instead of iterations is a very useful skill to learn.

Even if final is a data.frame, it would make sense to write a function like f1() or f2() that operated on just the numeric part, coercing to a matrix, e.g.,

f1a <- function(final) {
    final <- as.matrix(final)
    ## rest is same as f1
    ...
}

and pass in only the columns that you know will be used (all numeric) as the argument to f1a().

ADD REPLY
0
Entering edit mode

Thank you for the advice. Thank you for the replies, again.

ADD REPLY

Login before adding your answer.

Traffic: 487 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6