Entering edit mode
Hello.
Recently I upgraded the R from 3.6 to 4.0, and also upgraded the RStudio.
I have a script which has 2 for loops, which runs 30, 60000 for 2 for loops respectively. Before the upgrade it took about 10 minutes, but now it takes more than an hour.
I'm running the exactly same script with exactly same input. Unfortunately, removing R 4.0 did not solve the problem, still taking much longer than 10 minutes.
Is there anybody experiencing similar? Would there be any way to solve this problem?
Thank you.
Does this have anything to do with Bioconductor? This sounds like a personal setup issue rather than anything to do with any Bioconductor packages.
I guess it's correct that this post is not appropriate. My mistake. Thanks all.
We do not know if it is relevant or not due to lack of information. Can you elaborate, please? I / we are happy to help.
Thanks for the kind words. It looks not appropriate since I found out it is not related with any bioconductor package, but related only with base R functions.
For those (if any) who wants to see the problem, the code is as follows; "final" is a matrix which has about 60000 rows and 98 columns.
As you see here there is nothing more than base R functions inside the for loops. Some people recommend not to use many for loops, but I don't know if there is a way not to use for loop in this case.
Thank you.
I see what you mean - no problem.
lapply()
can work quicker than afor
loop, but it may take a while to become familiar with howlapply()
works.You could also 'parallelise' your
for
loop withforeach
- I have a brief intro to parallel processing (includingforeach
) on Biostars: Tutorial: Parallel processing in RThank you very much for your kind reply. I will take a look at both.
Thank you again.
Sorry for bothering you, but when I tried the foreach function, there doesn't seem to be much time difference.
Here is the code I used;
"cores" variable gave 8. Below is the same code with above but changed the j from 1:32 to 1:2. First, without using foreach;
this gave 5.34 mins.
Second, using foreach function;
This gave 5.63 mins.
Is there anything I did wrong?
Thank you very much.
The inner loop is a series of tests. You perform the test for each row (one R expression per line, so nrow expressions), but the values could be computed in a 'vectorized' fashion with a single R expression. For instance
calculates the same values as
but does it much more efficiently. So I revised your code to
Make sure I have not made a mistake! This will be much faster.
I looked at this a little more carefully. Here's your original function:
and my (updated) implementation:
Here's some indication that I'm doing the right thing
And that I'm doing it quite a bit faster
But even your iteration is taking 3.7 seconds, whereas you mention 'minutes'.
So I guess that your data structure is a 'data.frame', and not a 'matrix'. Subsetting a data.frame is very expensive. Here's a smaller example
Ouch! Exploring a little bit (with
nrow = 1000
and usingsystem.time()
)I can believe that it can take minutes for row-wise updating a data.frame.
So the messages are (a) use appropriate data structures (matrix) for the computation; (b) 'vectorize' instead of 'iterate'.
And finally, comparing R 3.6.3 and R 4.0.2 I find under 3.6.3
and under 4.0.2
So essentially the same time. I'd guess that your
data.frame()
under R 3.6.3 has 'factor' columns, whereas under 4.0.2 it does not, because of changes in the way R creates data.frame objects and the stringsAsFactors default. But your data shouldn't be represented as a data.frame anyway...You were correct that my "final" was in fact data.frame, not a matrix. I didn't know subsetting a data.frame is this expensive. However as I know it is really common and convenient to use data.frame since it can handle different types of data(character, numeric, factor etc) by column, so do we always have to suffer from long time of processing, when we use data.frame?
By the way, thank you for your insights and (probably) large efforts to answer my post.
f2()
is fast even with a data.frameso writing 'vectorized' code instead of iterations is a very useful skill to learn.
Even if
final
is a data.frame, it would make sense to write a function likef1()
orf2()
that operated on just the numeric part, coercing to a matrix, e.g.,and pass in only the columns that you know will be used (all numeric) as the argument to
f1a()
.Thank you for the advice. Thank you for the replies, again.