Merge dataframes
1
0
Entering edit mode
@joao-daniel-nunes-duarte-4710
Last seen 9.7 years ago
Hello, I am having some problems to use the 'merge' function. I'm not sure if I got its working right. What I want to do is: 1) Suppose I have a dataframe like: height width 1 1.1 2.3 2 2.1 2.5 3 1.8 1.9 4 1.6 2.1 5 1.8 2.4 2) And I generate a second dataframe sampled from this one, like: height width 1 1.1 2.3 3 1.8 1.9 5 1.8 2.4 3) Next, I add a new variable from this dataframe: height width color 1 1.1 2.3 red 3 1.8 1.9 red 5 1.8 2.4 blue 4) So, I want to merge those dataframes, so that the new variable, color, is binded to the first dataframe. Of course some cases won't have value for it, since I generated this variable in a smaller dataframe. In those cases I want the value to be NA. The result dataframe should be: height width color 1 1.1 2.3 red 2 2.1 2.5 NA 3 1.8 1.9 red 4 1.6 2.1 NA 5 1.8 2.4 blue I have written some codes, but they're not working properly. The new variable has its values mixed up, and they do not correspond to its row.names. # Generate the first dataframe data1 <- data.frame(height=rnorm(20,3,0.2),width=rnorm(20,2,0.5)) # Sample a smaller dataframe from data1 data2 <- data1[sample(1:20,15,replace=F),] # Generate the new variable color <- sample(c("red","blue"),15,replace=T) # Bind the new variable to data2 data2 <- cbind(data2, color) # Merge the data1 and data2$color by row.names, and force it to has the same values that data1. Next it generates a new dataframe where column 1 is the row.name, and then sort it by the row.name from data1. data.frame(merge(data1,data2$color, by=0, all.x=T),row.names=1)[row.names(data1),] I'm not sure what am I doing wrong. Can anyone see where the mistake is? Thank you! Cheers, Joao D. [[alternative HTML version deleted]]
• 1.0k views
ADD COMMENT
0
Entering edit mode
@joao-daniel-nunes-duarte-4710
Last seen 9.7 years ago
Hi everybody, Paniagua explanation and solution worked very good! Thank you a lot! I got the code to what I need. As Paniagua said, I was losing row.names when I merged a data.frame with a factor object. Instead of merging data1 to data2$color, I merged it to data2["color"]. It preserves row.names. When I do: d1 <- merge(data1, data2["color"], by=0, all.x=T) It creates a new variable on column 1, with the original row.names, so I have to transform this data.frame in a new one, using this variable as the row.names: d2 <- data.frame(d1, row.names=1) It does the job, but row.names are not in the same order as in data1, so I reorder it: d2[row.names(data1),] And it gives me the dataframe data1, binded to the new variable color, with the NAs in the proper position. Thank you all! Cheers, Joao D. Em 07/10/2011, ?s 20:36, Paniagua, Eric escreveu: > Hi Joao, > > To save space, I'm not going to paste the data frames in. Just > start a session, run your original code, and then try out the > commands below. > > The problem is a simple type confusion. If you check the class of > data2$color, you'll see it's not a data frame: > >> class(data2$color) > [1] "factor" > > The problem is that a factor is not a data frame, but merge expects > data frames as arguments. If either one is not a data frame, merge > will coerce the type if possible. In this case, that means that > passing data2$color to merge is equivalent to this: > >> merge(data1, as.data.frame(data2$color), by=0, all.x=T) > > If you inspect the coerced version of the argument, you'll see that > new row names have been assigned during coercion. Well, technically > not "new" ones because a factor does not have row names. In other > words, the row names were lost when you used the '$'. Just run the > following to see: > >> data2$color >> as.data.frame(data2$color) > > The solution is to just subset the data frame instead of pulling out > the factor. If you replace data2$color in your code with > data2["color"], your code behaves as intended. > > A good first clue in discovering this issue is the fact that in your > original code, you end up in the <na> elements all at the bottom, > rows 16-20, every time. The odds of this specific arrangement > happening any one particular time are 1 in 15504, or about > 0.00645%. Coincidentally, the 15 non <na> values are occupying rows > 1-15, making the hypothesis that your rows are being relabeled > somewhere along the way much more probable. And, in this case, > correct. > > Underlying that insight or intuition is the much more general > debugging approach of actually checking and looking at every > intermediate value in your computation that you can. And, of > course, reading the documentation :) > > For instance, check out the class and value of data2["color"]: > >> class(data2["color"]) > [1] "data.frame" > > You can also see that data2["color"] preserves the row names: > >> data2["color"] > > You can try this to verify that with my solution the values work out > correctly (though I'm sure there's a better way to do it directly in > R): > >> data3 <- merge(data1, data2["color"], by=0, all.x=T) >> data3 >> data4 <- data.frame(data3, row.names=1) >> data4 >> data4[row.names(data1),] >> data5 <- merge(data1, data4, by=0) >> data5 >> data6 <- data.frame(data5, row.names=1) >> data6 >> data6[row.names(data1),] > > (Of course, I'm using way more variables there than are required for > the computation. Consider their addition as purely didactic.) > > Anyway, I hope that helps! Happy scripting. > > Best, > Eric > > ________________________________________ > From: bioconductor-bounces at r-project.org [bioconductor-bounces at r-project.org > ] on behalf of Jo?o Daniel Nunes Duarte [jdanielnd at gmail.com] > Sent: Friday, October 07, 2011 6:32 PM > Subject: [BioC] Merge dataframes > > Hello, > > I am having some problems to use the 'merge' function. I'm not sure > if I got > its working right. > > What I want to do is: > > 1) Suppose I have a dataframe like: > > height width > 1 1.1 2.3 > 2 2.1 2.5 > 3 1.8 1.9 > 4 1.6 2.1 > 5 1.8 2.4 > > 2) And I generate a second dataframe sampled from this one, like: > > height width > 1 1.1 2.3 > 3 1.8 1.9 > 5 1.8 2.4 > > 3) Next, I add a new variable from this dataframe: > > height width color > 1 1.1 2.3 red > 3 1.8 1.9 red > 5 1.8 2.4 blue > > 4) So, I want to merge those dataframes, so that the new variable, > color, is > binded to the first dataframe. Of course some cases won't have value > for it, > since I generated this variable in a smaller dataframe. In those > cases I > want the value to be NA. The result dataframe should be: > > height width color > 1 1.1 2.3 red > 2 2.1 2.5 NA > 3 1.8 1.9 red > 4 1.6 2.1 NA > 5 1.8 2.4 blue > > I have written some codes, but they're not working properly. The new > variable has its values mixed up, and they do not correspond to its > row.names. > > # Generate the first dataframe > data1 <- data.frame(height=rnorm(20,3,0.2),width=rnorm(20,2,0.5)) > # Sample a smaller dataframe from data1 > data2 <- data1[sample(1:20,15,replace=F),] > # Generate the new variable > color <- sample(c("red","blue"),15,replace=T) > # Bind the new variable to data2 > data2 <- cbind(data2, color) > # Merge the data1 and data2$color by row.names, and force it to has > the same > values that data1. Next it generates a new dataframe where column 1 > is the > row.name, and then sort it by the row.name from data1. > data.frame(merge(data1,data2$color, by=0, > all.x=T),row.names=1)[row.names(data1),] > > I'm not sure what am I doing wrong. Can anyone see where the mistake > is? > > Thank you! > > Cheers, > > Joao D. > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
ADD COMMENT
0
Entering edit mode
Hi Joao, A simple one line solution using two functions from the 'caroline' CRAN package could suffice as well: nerge(list(data=data1, col=nv(color, rownames(data2))), all.x=T) nv : named vector nerge : named merge Dave On Sat, 8 Oct 2011, "Jo?o Daniel N. Duarte" wrote: > Hi everybody, > > Paniagua explanation and solution worked very good! Thank you a lot! > I got the code to what I need. > > As Paniagua said, I was losing row.names when I merged a data.frame with a > factor object. Instead of merging data1 to data2$color, I merged it to > data2["color"]. It preserves row.names. > > When I do: > > d1 <- merge(data1, data2["color"], by=0, all.x=T) > > It creates a new variable on column 1, with the original row.names, so I have > to transform this data.frame in a new one, using this variable as the > row.names: > > d2 <- data.frame(d1, row.names=1) > > It does the job, but row.names are not in the same order as in data1, so I > reorder it: > > d2[row.names(data1),] > > And it gives me the dataframe data1, binded to the new variable color, with > the NAs in the proper position. > > Thank you all! > > Cheers, > > Joao D. > > Em 07/10/2011, ?s 20:36, Paniagua, Eric escreveu: > >> Hi Joao, >> >> To save space, I'm not going to paste the data frames in. Just start a >> session, run your original code, and then try out the commands below. >> >> The problem is a simple type confusion. If you check the class of >> data2$color, you'll see it's not a data frame: >> >>> class(data2$color) >> [1] "factor" >> >> The problem is that a factor is not a data frame, but merge expects data >> frames as arguments. If either one is not a data frame, merge will coerce >> the type if possible. In this case, that means that passing data2$color to >> merge is equivalent to this: >> >>> merge(data1, as.data.frame(data2$color), by=0, all.x=T) >> >> If you inspect the coerced version of the argument, you'll see that new row >> names have been assigned during coercion. Well, technically not "new" ones >> because a factor does not have row names. In other words, the row names >> were lost when you used the '$'. Just run the following to see: >> >>> data2$color >>> as.data.frame(data2$color) >> >> The solution is to just subset the data frame instead of pulling out the >> factor. If you replace data2$color in your code with data2["color"], your >> code behaves as intended. >> >> A good first clue in discovering this issue is the fact that in your >> original code, you end up in the <na> elements all at the bottom, rows >> 16-20, every time. The odds of this specific arrangement happening any one >> particular time are 1 in 15504, or about 0.00645%. Coincidentally, the 15 >> non <na> values are occupying rows 1-15, making the hypothesis that your >> rows are being relabeled somewhere along the way much more probable. And, >> in this case, correct. >> >> Underlying that insight or intuition is the much more general debugging >> approach of actually checking and looking at every intermediate value in >> your computation that you can. And, of course, reading the documentation >> :) >> >> For instance, check out the class and value of data2["color"]: >> >>> class(data2["color"]) >> [1] "data.frame" >> >> You can also see that data2["color"] preserves the row names: >> >>> data2["color"] >> >> You can try this to verify that with my solution the values work out >> correctly (though I'm sure there's a better way to do it directly in R): >> >>> data3 <- merge(data1, data2["color"], by=0, all.x=T) >>> data3 >>> data4 <- data.frame(data3, row.names=1) >>> data4 >>> data4[row.names(data1),] >>> data5 <- merge(data1, data4, by=0) >>> data5 >>> data6 <- data.frame(data5, row.names=1) >>> data6 >>> data6[row.names(data1),] >> >> (Of course, I'm using way more variables there than are required for the >> computation. Consider their addition as purely didactic.) >> >> Anyway, I hope that helps! Happy scripting. >> >> Best, >> Eric >> >> ________________________________________ >> From: bioconductor-bounces at r-project.org >> [bioconductor-bounces at r-project.org] on behalf of Jo?o Daniel Nunes Duarte >> [jdanielnd at gmail.com] >> Sent: Friday, October 07, 2011 6:32 PM >> Subject: [BioC] Merge dataframes >> >> Hello, >> >> I am having some problems to use the 'merge' function. I'm not sure if I >> got >> its working right. >> >> What I want to do is: >> >> 1) Suppose I have a dataframe like: >> >> height width >> 1 1.1 2.3 >> 2 2.1 2.5 >> 3 1.8 1.9 >> 4 1.6 2.1 >> 5 1.8 2.4 >> >> 2) And I generate a second dataframe sampled from this one, like: >> >> height width >> 1 1.1 2.3 >> 3 1.8 1.9 >> 5 1.8 2.4 >> >> 3) Next, I add a new variable from this dataframe: >> >> height width color >> 1 1.1 2.3 red >> 3 1.8 1.9 red >> 5 1.8 2.4 blue >> >> 4) So, I want to merge those dataframes, so that the new variable, color, >> is >> binded to the first dataframe. Of course some cases won't have value for >> it, >> since I generated this variable in a smaller dataframe. In those cases I >> want the value to be NA. The result dataframe should be: >> >> height width color >> 1 1.1 2.3 red >> 2 2.1 2.5 NA >> 3 1.8 1.9 red >> 4 1.6 2.1 NA >> 5 1.8 2.4 blue >> >> I have written some codes, but they're not working properly. The new >> variable has its values mixed up, and they do not correspond to its >> row.names. >> >> # Generate the first dataframe >> data1 <- data.frame(height=rnorm(20,3,0.2),width=rnorm(20,2,0.5)) >> # Sample a smaller dataframe from data1 >> data2 <- data1[sample(1:20,15,replace=F),] >> # Generate the new variable >> color <- sample(c("red","blue"),15,replace=T) >> # Bind the new variable to data2 >> data2 <- cbind(data2, color) >> # Merge the data1 and data2$color by row.names, and force it to has the >> same >> values that data1. Next it generates a new dataframe where column 1 is the >> row.name, and then sort it by the row.name from data1. >> data.frame(merge(data1,data2$color, by=0, >> all.x=T),row.names=1)[row.names(data1),] >> >> I'm not sure what am I doing wrong. Can anyone see where the mistake is? >> >> Thank you! >> >> Cheers, >> >> Joao D. >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor
ADD REPLY

Login before adding your answer.

Traffic: 382 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6