Help on alternative and efficient data frame manipulation
1
0
Entering edit mode
Julie Zhu ★ 4.3k
@julie-zhu-3596
Last seen 5 months ago
United States
Hi, I have a data frame consisting of 5000 columns and 16000 rows. I would like to convert all values x in column 4 to 5000 to 1 if x >0. The following code works but it is very slow. Are there more efficient ways to modify large number of entries in a data frame? Many thanks for your kind help! id <- 4:ncol(mydata) for (i in id) {mydata[mydata[,i]>0,i]=1} Best regards, Julie
convert convert • 799 views
ADD COMMENT
0
Entering edit mode
@steve-lianoglou-2771
Last seen 14 months ago
United States
Hi, On Wed, Dec 28, 2011 at 3:01 PM, Zhu, Lihua (Julie) <julie.zhu at="" umassmed.edu=""> wrote: > Hi, > > I have a data frame consisting of 5000 columns and 16000 rows. I would like > to convert all values x in column 4 to 5000 to 1 if x >0. The following code > works but it is very slow. Are there more efficient ways to modify large > number of entries in a data frame? Many thanks for your kind help! > > id <- 4:ncol(mydata) > for (i in id) {mydata[mydata[,i]>0,i]=1} You might have better results if you treat the columns of the data.frame as a list, so something like: for (i in 4:ncol(mydata)) { mydata[[i]] <- ifelse(mydata[[i]] > 0, 1, mydata[[i]]) } ## Or, what if you convert to a matrix? m <- as.matrix(mydata[, -(1:4)]) m[m > 0] <- 1 ans <- cbind(mydata[,1:4], as.data.frame(m)) Are any of those better? -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology ?| Memorial Sloan-Kettering Cancer Center ?| Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact
ADD COMMENT
0
Entering edit mode
Thanks, Steve, Matrix is definitely faster. I will try with list to see if it makes it faster. Best regards, Julie On 12/28/11 3:06 PM, "Steve Lianoglou" <mailinglist.honeypot at="" gmail.com=""> wrote: > Hi, > > On Wed, Dec 28, 2011 at 3:01 PM, Zhu, Lihua (Julie) > <julie.zhu at="" umassmed.edu=""> wrote: >> Hi, >> >> I have a data frame consisting of 5000 columns and 16000 rows. I would like >> to convert all values x in column 4 to 5000 to 1 if x >0. The following code >> works but it is very slow. Are there more efficient ways to modify large >> number of entries in a data frame? Many thanks for your kind help! >> >> id <- 4:ncol(mydata) >> for (i in id) {mydata[mydata[,i]>0,i]=1} > > You might have better results if you treat the columns of the > data.frame as a list, so something like: > > for (i in 4:ncol(mydata)) { > mydata[[i]] <- ifelse(mydata[[i]] > 0, 1, mydata[[i]]) > } > > > ## Or, what if you convert to a matrix? > m <- as.matrix(mydata[, -(1:4)]) > m[m > 0] <- 1 > ans <- cbind(mydata[,1:4], as.data.frame(m)) > > > Are any of those better? > > -steve
ADD REPLY
0
Entering edit mode
Steve, Converting to a matrix resulted in a much larger increase in speed compared with treating the columns as list. Here are the comparison results for a 100 by 100 data frame. id <- 4:ncol(mydata) system.time(for (i in id) { mydata[[i]] <- ifelse(mydata[[i]] > 0, 1, mydata[[i]])} ) user system elapsed 0.034 0.000 0.037 system.time(for (i in id) { mydata[,i] <- ifelse(mydata[,i] > 0, 1, mydata[,i])} ) user system elapsed 0.038 0.003 0.042 system.time({m <- as.matrix(mydata[, -(id)]) m[m > 0] <- 1 ans <- cbind(mydata[,1:4], as.data.frame(m))}) user system elapsed 0.006 0.000 0.009 Many thanks for your great suggestions! Best regards, Julie On 12/28/11 3:10 PM, "Julie Zhu" <julie.zhu at="" umassmed.edu=""> wrote: > Thanks, Steve, > > Matrix is definitely faster. I will try with list to see if it makes it > faster. > > Best regards, > > Julie > > > On 12/28/11 3:06 PM, "Steve Lianoglou" <mailinglist.honeypot at="" gmail.com=""> > wrote: > >> Hi, >> >> On Wed, Dec 28, 2011 at 3:01 PM, Zhu, Lihua (Julie) >> <julie.zhu at="" umassmed.edu=""> wrote: >>> Hi, >>> >>> I have a data frame consisting of 5000 columns and 16000 rows. I would like >>> to convert all values x in column 4 to 5000 to 1 if x >0. The following code >>> works but it is very slow. Are there more efficient ways to modify large >>> number of entries in a data frame? Many thanks for your kind help! >>> >>> id <- 4:ncol(mydata) >>> for (i in id) {mydata[mydata[,i]>0,i]=1} >> >> You might have better results if you treat the columns of the >> data.frame as a list, so something like: >> >> for (i in 4:ncol(mydata)) { >> mydata[[i]] <- ifelse(mydata[[i]] > 0, 1, mydata[[i]]) >> } >> >> >> ## Or, what if you convert to a matrix? >> m <- as.matrix(mydata[, -(1:4)]) >> m[m > 0] <- 1 >> ans <- cbind(mydata[,1:4], as.data.frame(m)) >> >> >> Are any of those better? >> >> -steve > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor
ADD REPLY
0
Entering edit mode
Hi Julie, On Wed, Dec 28, 2011 at 3:41 PM, Zhu, Lihua (Julie) <julie.zhu at="" umassmed.edu=""> wrote: > Steve, > > Converting to a matrix resulted in a much larger increase in speed compared > with treating the columns as list. Here are the comparison results for a 100 > by 100 data frame. Cool ... I guess the conversion to an intermediary matrix will take more temp memory to do, but if you can afford it, than great. That having been said, though, it looks like there's a small bug in your test code, no? See this part here: > id <- 4:ncol(mydata) [snip] > system.time({m <- as.matrix(mydata[, -(id)]) > ?m[m > 0] <- 1 > ?ans <- cbind(mydata[,1:4], as.data.frame(m))}) > ? user ?system elapsed > ?0.006 ? 0.000 ? 0.009 It looks like your temporary `m` matrix is the opposite of what you want, no? Shouldn't the assignment to `m` be: m <- as.matrix(mydata[, id]) maybe? Not sure how much of a difference this will make in the timings again, but perhaps it's something worth seeing ... I reckon both methods are fast enough where the differences between the two aren't worth stressing over either way. > Many thanks for your great suggestions! Sure thing .. glad that it was helpful, -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology ?| Memorial Sloan-Kettering Cancer Center ?| Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact
ADD REPLY
0
Entering edit mode
Steve, Thanks for spotting the bug! Luckily, the code with the real data is correct! As you suspected, fixing the bug did not change the speed as much. Thanks again! Best regards, Julie On 12/28/11 4:47 PM, "Steve Lianoglou" <mailinglist.honeypot at="" gmail.com=""> wrote: > Hi Julie, > > On Wed, Dec 28, 2011 at 3:41 PM, Zhu, Lihua (Julie) > <julie.zhu at="" umassmed.edu=""> wrote: >> Steve, >> >> Converting to a matrix resulted in a much larger increase in speed compared >> with treating the columns as list. Here are the comparison results for a 100 >> by 100 data frame. > > Cool ... I guess the conversion to an intermediary matrix will take > more temp memory to do, but if you can afford it, than great. > > That having been said, though, it looks like there's a small bug in > your test code, no? > > See this part here: > >> id <- 4:ncol(mydata) > [snip] >> system.time({m <- as.matrix(mydata[, -(id)]) >> ?m[m > 0] <- 1 >> ?ans <- cbind(mydata[,1:4], as.data.frame(m))}) >> ? user ?system elapsed >> ?0.006 ? 0.000 ? 0.009 > > It looks like your temporary `m` matrix is the opposite of what you > want, no? Shouldn't the assignment to `m` be: > > m <- as.matrix(mydata[, id]) > > maybe? > > Not sure how much of a difference this will make in the timings again, > but perhaps it's something worth seeing ... I reckon both methods are > fast enough where the differences between the two aren't worth > stressing over either way. > >> Many thanks for your great suggestions! > > Sure thing .. glad that it was helpful, > > -steve
ADD REPLY

Login before adding your answer.

Traffic: 1020 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6