Merge dataframes

0

Entering edit mode

João Daniel Nunes Duarte ▴ 30

@joao-daniel-nunes-duarte-4710

Last seen 9.7 years ago

Hello, I am having some problems to use the 'merge' function. I'm not sure if I got its working right. What I want to do is: 1) Suppose I have a dataframe like: height width 1 1.1 2.3 2 2.1 2.5 3 1.8 1.9 4 1.6 2.1 5 1.8 2.4 2) And I generate a second dataframe sampled from this one, like: height width 1 1.1 2.3 3 1.8 1.9 5 1.8 2.4 3) Next, I add a new variable from this dataframe: height width color 1 1.1 2.3 red 3 1.8 1.9 red 5 1.8 2.4 blue 4) So, I want to merge those dataframes, so that the new variable, color, is binded to the first dataframe. Of course some cases won't have value for it, since I generated this variable in a smaller dataframe. In those cases I want the value to be NA. The result dataframe should be: height width color 1 1.1 2.3 red 2 2.1 2.5 NA 3 1.8 1.9 red 4 1.6 2.1 NA 5 1.8 2.4 blue I have written some codes, but they're not working properly. The new variable has its values mixed up, and they do not correspond to its row.names. # Generate the first dataframe data1 <- data.frame(height=rnorm(20,3,0.2),width=rnorm(20,2,0.5)) # Sample a smaller dataframe from data1 data2 <- data1[sample(1:20,15,replace=F),] # Generate the new variable color <- sample(c("red","blue"),15,replace=T) # Bind the new variable to data2 data2 <- cbind(data2, color) # Merge the data1 and data2$color by row.names, and force it to has the same values that data1. Next it generates a new dataframe where column 1 is the row.name, and then sort it by the row.name from data1. data.frame(merge(data1,data2$color, by=0, all.x=T),row.names=1)[row.names(data1),] I'm not sure what am I doing wrong. Can anyone see where the mistake is? Thank you! Cheers, Joao D. [[alternative HTML version deleted]]

• 1.0k views

ADD COMMENT • link 12.6 years ago João Daniel Nunes Duarte ▴ 30

0

Entering edit mode

João Daniel Nunes Duarte ▴ 30

@joao-daniel-nunes-duarte-4710

Last seen 9.7 years ago

Hi everybody, Paniagua explanation and solution worked very good! Thank you a lot! I got the code to what I need. As Paniagua said, I was losing row.names when I merged a data.frame with a factor object. Instead of merging data1 to data2$color, I merged it to data2["color"]. It preserves row.names. When I do: d1 <- merge(data1, data2["color"], by=0, all.x=T) It creates a new variable on column 1, with the original row.names, so I have to transform this data.frame in a new one, using this variable as the row.names: d2 <- data.frame(d1, row.names=1) It does the job, but row.names are not in the same order as in data1, so I reorder it: d2[row.names(data1),] And it gives me the dataframe data1, binded to the new variable color, with the NAs in the proper position. Thank you all! Cheers, Joao D. Em 07/10/2011, ?s 20:36, Paniagua, Eric escreveu: > Hi Joao, > > To save space, I'm not going to paste the data frames in. Just > start a session, run your original code, and then try out the > commands below. > > The problem is a simple type confusion. If you check the class of > data2$color, you'll see it's not a data frame: > >> class(data2$color) > [1] "factor" > > The problem is that a factor is not a data frame, but merge expects > data frames as arguments. If either one is not a data frame, merge > will coerce the type if possible. In this case, that means that > passing data2$color to merge is equivalent to this: > >> merge(data1, as.data.frame(data2$color), by=0, all.x=T) > > If you inspect the coerced version of the argument, you'll see that > new row names have been assigned during coercion. Well, technically > not "new" ones because a factor does not have row names. In other > words, the row names were lost when you used the '$'. Just run the > following to see: > >> data2$color >> as.data.frame(data2$color) > > The solution is to just subset the data frame instead of pulling out > the factor. If you replace data2$color in your code with > data2["color"], your code behaves as intended. > > A good first clue in discovering this issue is the fact that in your > original code, you end up in the <na> elements all at the bottom, > rows 16-20, every time. The odds of this specific arrangement > happening any one particular time are 1 in 15504, or about > 0.00645%. Coincidentally, the 15 non <na> values are occupying rows > 1-15, making the hypothesis that your rows are being relabeled > somewhere along the way much more probable. And, in this case, > correct. > > Underlying that insight or intuition is the much more general > debugging approach of actually checking and looking at every > intermediate value in your computation that you can. And, of > course, reading the documentation :) > > For instance, check out the class and value of data2["color"]: > >> class(data2["color"]) > [1] "data.frame" > > You can also see that data2["color"] preserves the row names: > >> data2["color"] > > You can try this to verify that with my solution the values work out > correctly (though I'm sure there's a better way to do it directly in > R): > >> data3 <- merge(data1, data2["color"], by=0, all.x=T) >> data3 >> data4 <- data.frame(data3, row.names=1) >> data4 >> data4[row.names(data1),] >> data5 <- merge(data1, data4, by=0) >> data5 >> data6 <- data.frame(data5, row.names=1) >> data6 >> data6[row.names(data1),] > > (Of course, I'm using way more variables there than are required for > the computation. Consider their addition as purely didactic.) > > Anyway, I hope that helps! Happy scripting. > > Best, > Eric > > ________________________________________ > From: bioconductor-bounces at r-project.org [bioconductor-bounces at r-project.org > ] on behalf of Jo?o Daniel Nunes Duarte [jdanielnd at gmail.com] > Sent: Friday, October 07, 2011 6:32 PM > Subject: [BioC] Merge dataframes > > Hello, > > I am having some problems to use the 'merge' function. I'm not sure > if I got > its working right. > > What I want to do is: > > 1) Suppose I have a dataframe like: > > height width > 1 1.1 2.3 > 2 2.1 2.5 > 3 1.8 1.9 > 4 1.6 2.1 > 5 1.8 2.4 > > 2) And I generate a second dataframe sampled from this one, like: > > height width > 1 1.1 2.3 > 3 1.8 1.9 > 5 1.8 2.4 > > 3) Next, I add a new variable from this dataframe: > > height width color > 1 1.1 2.3 red > 3 1.8 1.9 red > 5 1.8 2.4 blue > > 4) So, I want to merge those dataframes, so that the new variable, > color, is > binded to the first dataframe. Of course some cases won't have value > for it, > since I generated this variable in a smaller dataframe. In those > cases I > want the value to be NA. The result dataframe should be: > > height width color > 1 1.1 2.3 red > 2 2.1 2.5 NA > 3 1.8 1.9 red > 4 1.6 2.1 NA > 5 1.8 2.4 blue > > I have written some codes, but they're not working properly. The new > variable has its values mixed up, and they do not correspond to its > row.names. > > # Generate the first dataframe > data1 <- data.frame(height=rnorm(20,3,0.2),width=rnorm(20,2,0.5)) > # Sample a smaller dataframe from data1 > data2 <- data1[sample(1:20,15,replace=F),] > # Generate the new variable > color <- sample(c("red","blue"),15,replace=T) > # Bind the new variable to data2 > data2 <- cbind(data2, color) > # Merge the data1 and data2$color by row.names, and force it to has > the same > values that data1. Next it generates a new dataframe where column 1 > is the > row.name, and then sort it by the row.name from data1. > data.frame(merge(data1,data2$color, by=0, > all.x=T),row.names=1)[row.names(data1),] > > I'm not sure what am I doing wrong. Can anyone see where the mistake > is? > > Thank you! > > Cheers, > > Joao D. > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD COMMENT • link 12.6 years ago João Daniel Nunes Duarte ▴ 30

0

Entering edit mode

Hi Joao, A simple one line solution using two functions from the 'caroline' CRAN package could suffice as well: nerge(list(data=data1, col=nv(color, rownames(data2))), all.x=T) nv : named vector nerge : named merge Dave On Sat, 8 Oct 2011, "Jo?o Daniel N. Duarte" wrote: > Hi everybody, > > Paniagua explanation and solution worked very good! Thank you a lot! > I got the code to what I need. > > As Paniagua said, I was losing row.names when I merged a data.frame with a > factor object. Instead of merging data1 to data2$color, I merged it to > data2["color"]. It preserves row.names. > > When I do: > > d1 <- merge(data1, data2["color"], by=0, all.x=T) > > It creates a new variable on column 1, with the original row.names, so I have > to transform this data.frame in a new one, using this variable as the > row.names: > > d2 <- data.frame(d1, row.names=1) > > It does the job, but row.names are not in the same order as in data1, so I > reorder it: > > d2[row.names(data1),] > > And it gives me the dataframe data1, binded to the new variable color, with > the NAs in the proper position. > > Thank you all! > > Cheers, > > Joao D. > > Em 07/10/2011, ?s 20:36, Paniagua, Eric escreveu: > >> Hi Joao, >> >> To save space, I'm not going to paste the data frames in. Just start a >> session, run your original code, and then try out the commands below. >> >> The problem is a simple type confusion. If you check the class of >> data2$color, you'll see it's not a data frame: >> >>> class(data2$color) >> [1] "factor" >> >> The problem is that a factor is not a data frame, but merge expects data >> frames as arguments. If either one is not a data frame, merge will coerce >> the type if possible. In this case, that means that passing data2$color to >> merge is equivalent to this: >> >>> merge(data1, as.data.frame(data2$color), by=0, all.x=T) >> >> If you inspect the coerced version of the argument, you'll see that new row >> names have been assigned during coercion. Well, technically not "new" ones >> because a factor does not have row names. In other words, the row names >> were lost when you used the '$'. Just run the following to see: >> >>> data2$color >>> as.data.frame(data2$color) >> >> The solution is to just subset the data frame instead of pulling out the >> factor. If you replace data2$color in your code with data2["color"], your >> code behaves as intended. >> >> A good first clue in discovering this issue is the fact that in your >> original code, you end up in the <na> elements all at the bottom, rows >> 16-20, every time. The odds of this specific arrangement happening any one >> particular time are 1 in 15504, or about 0.00645%. Coincidentally, the 15 >> non <na> values are occupying rows 1-15, making the hypothesis that your >> rows are being relabeled somewhere along the way much more probable. And, >> in this case, correct. >> >> Underlying that insight or intuition is the much more general debugging >> approach of actually checking and looking at every intermediate value in >> your computation that you can. And, of course, reading the documentation >> :) >> >> For instance, check out the class and value of data2["color"]: >> >>> class(data2["color"]) >> [1] "data.frame" >> >> You can also see that data2["color"] preserves the row names: >> >>> data2["color"] >> >> You can try this to verify that with my solution the values work out >> correctly (though I'm sure there's a better way to do it directly in R): >> >>> data3 <- merge(data1, data2["color"], by=0, all.x=T) >>> data3 >>> data4 <- data.frame(data3, row.names=1) >>> data4 >>> data4[row.names(data1),] >>> data5 <- merge(data1, data4, by=0) >>> data5 >>> data6 <- data.frame(data5, row.names=1) >>> data6 >>> data6[row.names(data1),] >> >> (Of course, I'm using way more variables there than are required for the >> computation. Consider their addition as purely didactic.) >> >> Anyway, I hope that helps! Happy scripting. >> >> Best, >> Eric >> >> ________________________________________ >> From: bioconductor-bounces at r-project.org >> [bioconductor-bounces at r-project.org] on behalf of Jo?o Daniel Nunes Duarte >> [jdanielnd at gmail.com] >> Sent: Friday, October 07, 2011 6:32 PM >> Subject: [BioC] Merge dataframes >> >> Hello, >> >> I am having some problems to use the 'merge' function. I'm not sure if I >> got >> its working right. >> >> What I want to do is: >> >> 1) Suppose I have a dataframe like: >> >> height width >> 1 1.1 2.3 >> 2 2.1 2.5 >> 3 1.8 1.9 >> 4 1.6 2.1 >> 5 1.8 2.4 >> >> 2) And I generate a second dataframe sampled from this one, like: >> >> height width >> 1 1.1 2.3 >> 3 1.8 1.9 >> 5 1.8 2.4 >> >> 3) Next, I add a new variable from this dataframe: >> >> height width color >> 1 1.1 2.3 red >> 3 1.8 1.9 red >> 5 1.8 2.4 blue >> >> 4) So, I want to merge those dataframes, so that the new variable, color, >> is >> binded to the first dataframe. Of course some cases won't have value for >> it, >> since I generated this variable in a smaller dataframe. In those cases I >> want the value to be NA. The result dataframe should be: >> >> height width color >> 1 1.1 2.3 red >> 2 2.1 2.5 NA >> 3 1.8 1.9 red >> 4 1.6 2.1 NA >> 5 1.8 2.4 blue >> >> I have written some codes, but they're not working properly. The new >> variable has its values mixed up, and they do not correspond to its >> row.names. >> >> # Generate the first dataframe >> data1 <- data.frame(height=rnorm(20,3,0.2),width=rnorm(20,2,0.5)) >> # Sample a smaller dataframe from data1 >> data2 <- data1[sample(1:20,15,replace=F),] >> # Generate the new variable >> color <- sample(c("red","blue"),15,replace=T) >> # Bind the new variable to data2 >> data2 <- cbind(data2, color) >> # Merge the data1 and data2$color by row.names, and force it to has the >> same >> values that data1. Next it generates a new dataframe where column 1 is the >> row.name, and then sort it by the row.name from data1. >> data.frame(merge(data1,data2$color, by=0, >> all.x=T),row.names=1)[row.names(data1),] >> >> I'm not sure what am I doing wrong. Can anyone see where the mistake is? >> >> Thank you! >> >> Cheers, >> >> Joao D. >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD REPLY • link 12.6 years ago David Schruth ▴ 30

Login before adding your answer.