Entering edit mode
On 01/16/2012 08:00 AM, Simon Urbanek wrote:
> On Jan 16, 2012, at 9:02 AM, Prashantha Hebbar wrote:
>
>> Hello friends,
>> I was tryig to parallize a function using mclapply. But I find
lapply() executes in lesser time than mclapply(). I have given here my
system time taken for both the functions.
>>> library(ShortRead)
>>> library(multicore)> fqFiles<- list.files("./test")
>>> system.time(lapply(fqFiles, function(fqFiles){
>> readsFq<- readFastq(dirPath="./test",pattern=fqFiles)
>> }))
>> user system elapsed
>> 0.399 0.021 0.419
>>> system.time(mclapply(fqFiles, function(fqFiles){
>> readsFq<-
readFastq(dirPath="./test",pattern=fqFiles)},mc.cores=3))
>> user system elapsed
>> 0.830 0.151 0.261
>>
>> Since the ./test directory contains three fastq files. I have used
mc.cores = 3.
>>
>> here is my mpstat output for mclapply()
>>
>> 04:47:55 PM CPU %user %nice %sys %iowait %irq %soft
%steal %idle intr/s
>> 04:47:56 PM all 13.86 0.00 1.37 0.00 0.00 0.00
0.00 84.77 1023.23
>> 04:47:56 PM 0 21.21 0.00 2.02 0.00 0.00 0.00
0.00 76.77 1011.11
>> 04:47:56 PM 1 33.00 0.00 2.00 0.00 0.00 0.00
0.00 65.00 9.09
>> 04:47:56 PM 2 0.00 0.00 0.00 0.00 0.00 0.00
0.00 100.00 0.00
>> 04:47:56 PM 3 0.00 0.00 0.00 0.00 0.00 0.00
0.00 100.00 3.03
>> 04:47:56 PM 4 3.03 0.00 2.02 0.00 0.00 0.00
0.00 94.95 0.00
>> 04:47:56 PM 5 0.00 0.00 0.00 0.00 0.00 0.00
0.00 100.00 0.00
>> 04:47:56 PM 6 0.00 0.00 0.00 0.00 0.00 0.00
0.00 100.00 0.00
>> 04:47:56 PM 7 53.00 0.00 4.00 0.00 0.00 0.00
0.00 43.00 0.00
>>
>> Hence,Can you please suggest me, why mclapply has taken more time
than lapply()?
In case it's not clear, the system.time 'elapsed' time shows that
mclapply *is* faster overall ('wall clock') -- I would have only .261
seconds to go for coffee, compared to .419 with lapply.
As Simon suggests, a much more common paradigm is to put more work in
to
the function evaluated by lapply --, e.g., calculating qa() -- and
then
returning the result of the computation, typically a much smaller
summary of the bigger data. Even in this case, your computer will need
to have enough memory to hold all the fastq data in memory; for some
purposes it will make more sense to use FastqSampler and FastqStreamer
to iterate over your file.
Martin
>
> multicore is designed for parallel *computing* which is not what you
do. For serial tasks (like yours) it will be always slower, because it
needs to a) spawn processes b) read the data (serially since you use
the same location) c) serialize all the data and send it to the master
process, d) unserialize and concatenate all the data in the master
process to a list. If you run lapply it does only b) which is in your
case not the slowest part. Using multicore makes only sense if you
actually perform computations (or any parallel task).
>
> Cheers,
> Simon
>
>
>> Thanking you in anticipation.
>> Regards,
>> Prashantha
>> Prashantha Hebbar Kiradi,
>>
>> E-mail: prashantha.hebbar at dasmaninstitute.org
>>
>> [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> R-sig-hpc mailing list
>> R-sig-hpc at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/r-sig-hpc
>
> _______________________________________________
> R-sig-hpc mailing list
> R-sig-hpc at r-project.org
> https://stat.ethz.ch/mailman/listinfo/r-sig-hpc
--
Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109
Location: M1-B861
Telephone: 206 667-2793