Quickest way to convert IDs in a data frame?

Entering edit mode

enricoferrero ▴ 660

@enricoferrero-6037

Last seen 2.4 years ago

Switzerland

Hello, I often have data frames where I need to perform ID conversions on one or more of the columns while preserving the order of the rows, e.g.: GeneSymbol Value1 Value2 GS1 2.5 0.1 GS2 3 0.2 .. And I want to obtain: GeneSymbol EntrezGeneID Value1 Value2 GS1 EG1 2.5 0.1 GS2 EG2 3 0.2 .. What I've done so far was to create a function that uses org.Hs.eg.db to loop over the rows of the column and does the conversion: library(org.Hs.eg.db) alias2EG <- function(x) { for (i in 1:length(x)) { if (!is.na(x[i])) { repl <- org.Hs.egALIAS2EG[[x[i]]][1] if (!is.null(repl)) { x[i] <- repl } else { x[i] <- NA } } } return(x) } and then call the function like this: df$EntrezGeneID <- alias2GS(df$GeneSymbol) This works well, but gets very slow when I need to do multiple conversions on large datasets. Is there any way I can achieve the same result but in a quicker, more efficient way? Thank you. -- Enrico Ferrero PhD Student Department of Genetics Cambridge Systems Biology Centre University of Cambridge [[alternative HTML version deleted]]

• 1.8k views

ADD COMMENT • link updated 10.8 years ago by Thomas Girke ★ 1.7k • written 10.8 years ago by enricoferrero ▴ 660

Entering edit mode

James W. MacDonald 65k

@james-w-macdonald-5106

Last seen 10 hours ago

United States

Hi Enrico, On 7/25/2013 11:35 AM, Enrico Ferrero wrote: > Hello, > > I often have data frames where I need to perform ID conversions on one or > more of the columns while preserving the order of the rows, e.g.: > > GeneSymbol Value1 Value2 > GS1 2.5 0.1 > GS2 3 0.2 > .. > > And I want to obtain: > > GeneSymbol EntrezGeneID Value1 Value2 > GS1 EG1 2.5 0.1 > GS2 EG2 3 0.2 > .. > > What I've done so far was to create a function that uses org.Hs.eg.db to > loop over the rows of the column and does the conversion: > > library(org.Hs.eg.db) > alias2EG<- function(x) { > for (i in 1:length(x)) { > if (!is.na(x[i])) { > repl<- org.Hs.egALIAS2EG[[x[i]]][1] > if (!is.null(repl)) { > x[i]<- repl > } > else { > x[i]<- NA > } > } > } > return(x) > } I should first note that gene symbols are not unique, so you are taking a chance on your mappings. Is there no other annotation for your data? In addition, you should note that it is almost always better to think of objects as vectors and matrices in R, rather than as things that need to be looped over (e.g., R isn't Perl or C). first.two <- select(org.Hs.eg.db, as.character(df$GeneSymbol), "ENTREZID", "SYMBOL") Note that there used to be a warning or an error (don't remember which) when you did something like this, stating that gene symbols are not unique, and that you shouldn't do this sort of thing. Apparently this warning has been removed, but the issue remains valid. ## check yourself all.equal(df$GeneSymbol, first.two$SYMBOL) ## if true, proceed df <- data.frame(first.two, df[,-1]) Best, Jim > > and then call the function like this: > > df$EntrezGeneID<- alias2GS(df$GeneSymbol) > > This works well, but gets very slow when I need to do multiple conversions > on large datasets. > > Is there any way I can achieve the same result but in a quicker, more > efficient way? > > Thank you. > -- James W. MacDonald, M.S. Biostatistician University of Washington Environmental and Occupational Health Sciences 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099

ADD COMMENT • link 10.8 years ago James W. MacDonald 65k

Entering edit mode

Dear James, Thanks very much for your prompt reply. I knew the problem was the for loop and the select function is indeed a lot faster than that and works perfectly with toy data. However, this is what happens when I try to use it with real data: > test <- select(org.Hs.eg.db, keys=df$GeneSymbol, keytype="ALIAS", cols=c("SYMBOL","ENTREZID","ENSEMBL")) Warning message: In .generateExtraRows(tab, keys, jointype) : 'select' and duplicate query keys resulted in 1:many mapping between keys and return rows which is probably the warning you mentioned. The real problem is that the number of rows is now different for the 2 objects: > nrow(df); nrow(test) [1] 573 [1] 201 So I obviously can't put the new data into the original df. My impression is that when the 1 to many mapping arises, the select functions exits, with that warning message. As a result, my test object is incomplete. On top of that, and I can't really explain this, the row positions are messed up, e.g. > all.equal(df[100,],test[100,]) returns FALSE. How can I work around this? Thanks a lot! Best, On 25 July 2013 16:58, James W. MacDonald <jmacdon at="" uw.edu=""> wrote: > Hi Enrico, > > > On 7/25/2013 11:35 AM, Enrico Ferrero wrote: >> >> Hello, >> >> I often have data frames where I need to perform ID conversions on one or >> more of the columns while preserving the order of the rows, e.g.: >> >> GeneSymbol Value1 Value2 >> GS1 2.5 0.1 >> GS2 3 0.2 >> .. >> >> And I want to obtain: >> >> GeneSymbol EntrezGeneID Value1 Value2 >> GS1 EG1 2.5 0.1 >> GS2 EG2 3 0.2 >> .. >> >> What I've done so far was to create a function that uses org.Hs.eg.db to >> loop over the rows of the column and does the conversion: >> >> library(org.Hs.eg.db) >> alias2EG<- function(x) { >> for (i in 1:length(x)) { >> if (!is.na(x[i])) { >> repl<- org.Hs.egALIAS2EG[[x[i]]][1] >> if (!is.null(repl)) { >> x[i]<- repl >> } >> else { >> x[i]<- NA >> } >> } >> } >> return(x) >> } > > > I should first note that gene symbols are not unique, so you are taking a > chance on your mappings. Is there no other annotation for your data? > > In addition, you should note that it is almost always better to think of > objects as vectors and matrices in R, rather than as things that need to be > looped over (e.g., R isn't Perl or C). > > first.two <- select(org.Hs.eg.db, as.character(df$GeneSymbol), "ENTREZID", > "SYMBOL") > > Note that there used to be a warning or an error (don't remember which) when > you did something like this, stating that gene symbols are not unique, and > that you shouldn't do this sort of thing. Apparently this warning has been > removed, but the issue remains valid. > > ## check yourself > > all.equal(df$GeneSymbol, first.two$SYMBOL) > > ## if true, proceed > > df <- data.frame(first.two, df[,-1]) > > Best, > > Jim > > > >> >> and then call the function like this: >> >> df$EntrezGeneID<- alias2GS(df$GeneSymbol) >> >> This works well, but gets very slow when I need to do multiple conversions >> on large datasets. >> >> Is there any way I can achieve the same result but in a quicker, more >> efficient way? >> >> Thank you. >> > > -- > James W. MacDonald, M.S. > Biostatistician > University of Washington > Environmental and Occupational Health Sciences > 4225 Roosevelt Way NE, # 100 > Seattle WA 98105-6099 > -- Enrico Ferrero PhD Student Department of Genetics Cambridge Systems Biology Centre University of Cambridge

ADD REPLY • link 10.8 years ago enricoferrero ▴ 660

Entering edit mode

Hi Enrico, On 7/25/2013 12:56 PM, Enrico Ferrero wrote: > Dear James, > > Thanks very much for your prompt reply. > I knew the problem was the for loop and the select function is indeed > a lot faster than that and works perfectly with toy data. > > However, this is what happens when I try to use it with real data: > >> test<- select(org.Hs.eg.db, keys=df$GeneSymbol, keytype="ALIAS", cols=c("SYMBOL","ENTREZID","ENSEMBL")) > Warning message: > In .generateExtraRows(tab, keys, jointype) : > 'select' and duplicate query keys resulted in 1:many mapping between > keys and return rows > > which is probably the warning you mentioned. That's not the warning I mentioned, but it does point out the same issue, which is that there is a one to many mapping between symbol and entrez gene ID. So now you have to decide if you want to be naive (or stupid, depending on your perspective) or not. You could just cover your eyes and do this: first.two <- first.two[!duplicated(first.two$SYMBOL),] which will choose for you the first symbol -> gene ID mapping and nuke the rest. That's nice and quick, but you are making huge assumptions. Or you could decide to be a bit more sophisticated and do something like thelst <- tapply(1:nrow(first.two), first.two$SYMBOL, function(x) first.two[x,]) At this point you can take a look at e.g., thelst[1:10] to see what we just did thelst <- do.call("rbind", lapply(thelst, function(x) c(x[1,1], paste(x[,2], collapse = "|"))) and here you can look at head(thelst). Then you can check to ensure that the first column of thelst is identical to the first column of df, and proceed as before. But there is still the problem of the multiple mappings. As an example: > thelst[1:5] $HBD SYMBOL ENTREZID 2535 HBD 3045 2536 HBD 100187828 $KIR3DL3 SYMBOL ENTREZID 17513 KIR3DL3 115653 17514 KIR3DL3 100133046 > mget(as.character(thelst[[1]][,2]), org.Hs.egGENENAME) $`3045` [1] "hemoglobin, delta" $`100187828` [1] "hypophosphatemic bone disease" > mget(as.character(thelst[[2]][,2]), org.Hs.egGENENAME) $`115653` [1] "killer cell immunoglobulin-like receptor, three domains, long cytoplasmic tail, 3" $`100133046` [1] "killer cell immunoglobulin-like receptor three domains long cytoplasmic tail 3" So HBD is the gene symbol for two different genes! If this gene symbol is in your data, you will now have attributed your data to two genes that apparently are not remotely similar. if KIR3DL3 is in your data, then it worked out OK for that gene. Best, Jim > > The real problem is that the number of rows is now different for the 2 objects: >> nrow(df); nrow(test) > [1] 573 > [1] 201 > > So I obviously can't put the new data into the original df. My > impression is that when the 1 to many mapping arises, the select > functions exits, with that warning message. As a result, my test > object is incomplete. > > On top of that, and I can't really explain this, the row positions are > messed up, e.g. > >> all.equal(df[100,],test[100,]) > returns FALSE. > > How can I work around this? > > Thanks a lot! > > Best, > > On 25 July 2013 16:58, James W. MacDonald<jmacdon at="" uw.edu=""> wrote: >> Hi Enrico, >> >> >> On 7/25/2013 11:35 AM, Enrico Ferrero wrote: >>> Hello, >>> >>> I often have data frames where I need to perform ID conversions on one or >>> more of the columns while preserving the order of the rows, e.g.: >>> >>> GeneSymbol Value1 Value2 >>> GS1 2.5 0.1 >>> GS2 3 0.2 >>> .. >>> >>> And I want to obtain: >>> >>> GeneSymbol EntrezGeneID Value1 Value2 >>> GS1 EG1 2.5 0.1 >>> GS2 EG2 3 0.2 >>> .. >>> >>> What I've done so far was to create a function that uses org.Hs.eg.db to >>> loop over the rows of the column and does the conversion: >>> >>> library(org.Hs.eg.db) >>> alias2EG<- function(x) { >>> for (i in 1:length(x)) { >>> if (!is.na(x[i])) { >>> repl<- org.Hs.egALIAS2EG[[x[i]]][1] >>> if (!is.null(repl)) { >>> x[i]<- repl >>> } >>> else { >>> x[i]<- NA >>> } >>> } >>> } >>> return(x) >>> } >> >> I should first note that gene symbols are not unique, so you are taking a >> chance on your mappings. Is there no other annotation for your data? >> >> In addition, you should note that it is almost always better to think of >> objects as vectors and matrices in R, rather than as things that need to be >> looped over (e.g., R isn't Perl or C). >> >> first.two<- select(org.Hs.eg.db, as.character(df$GeneSymbol), "ENTREZID", >> "SYMBOL") >> >> Note that there used to be a warning or an error (don't remember which) when >> you did something like this, stating that gene symbols are not unique, and >> that you shouldn't do this sort of thing. Apparently this warning has been >> removed, but the issue remains valid. >> >> ## check yourself >> >> all.equal(df$GeneSymbol, first.two$SYMBOL) >> >> ## if true, proceed >> >> df<- data.frame(first.two, df[,-1]) >> >> Best, >> >> Jim >> >> >> >>> and then call the function like this: >>> >>> df$EntrezGeneID<- alias2GS(df$GeneSymbol) >>> >>> This works well, but gets very slow when I need to do multiple conversions >>> on large datasets. >>> >>> Is there any way I can achieve the same result but in a quicker, more >>> efficient way? >>> >>> Thank you. >>> >> -- >> James W. MacDonald, M.S. >> Biostatistician >> University of Washington >> Environmental and Occupational Health Sciences >> 4225 Roosevelt Way NE, # 100 >> Seattle WA 98105-6099 >> > > -- James W. MacDonald, M.S. Biostatistician University of Washington Environmental and Occupational Health Sciences 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099

ADD REPLY • link 10.8 years ago James W. MacDonald 65k

Entering edit mode

James W. MacDonald 65k

@james-w-macdonald-5106

Last seen 10 hours ago

United States

Hi Enrico, Please don't take things off-list (e.g., use reply-all). On 7/25/2013 2:17 PM, Enrico Ferrero wrote: > Hi James, > > Thanks very much for your help. > There is an issue that needs to be solved before thinking about what's > the best approach in my opinion. > > I don't understand why, but the object created with the call to select > (test in my example, first.two in yours) has a different number of > rows from the original object (df in my example). Specifically it has > *less* rows. If all symbols were converted to all possible Entrez IDs, > I would expect it to have more rows, not less. To me, it looks like > not all rows are looked up and returned. > > Do you see what I mean? Sure. You could be using outdated gene symbols. Or perhaps you are using a mixture of symbols and aliases. Which is even cooler than just all symbols: > symb <- c(Rkeys(org.Hs.egSYMBOL)[1:10], Rkeys(org.Hs.egALIAS2EG)[31:45]) > symb [1] "A1BG" "A2M" "A2MP1" "NAT1" "NAT2" "AACP" [7] "SERPINA3" "AADAC" "AAMP" "AANAT" "AAMP" "AANAT" [13] "DSPS" "SNAT" "AARS" "CMT2N" "AAV" "AAVS1" [19] "ABAT" "GABA-AT" "GABAT" "NPD009" "ABC-1" "ABC1" [25] "ABCA1" > select(org.Hs.eg.db, symb, "ENTREZID","SYMBOL") SYMBOL ENTREZID 1 A1BG 1 2 A2M 2 3 A2MP1 3 4 NAT1 9 5 NAT2 10 6 AACP 11 7 SERPINA3 12 8 AADAC 13 9 AAMP 14 10 AANAT 15 11 AAMP 14 12 AANAT 15 13 DSPS <na> 14 SNAT <na> 15 AARS 16 16 CMT2N <na> 17 AAV <na> 18 AAVS1 17 19 ABAT 18 20 GABA-AT <na> 21 GABAT <na> 22 NPD009 <na> 23 ABC-1 <na> 24 ABC1 <na> 25 ABCA1 19 > select(org.Hs.eg.db, symb, "ENTREZID","ALIAS") ALIAS ENTREZID 1 A1BG 1 2 A2M 2 3 A2MP1 3 4 NAT1 9 5 NAT1 1982 6 NAT1 6530 7 NAT1 10991 8 NAT2 10 9 NAT2 81539 10 AACP 11 11 SERPINA3 12 12 AADAC 13 13 AAMP 14 14 AANAT 15 15 DSPS 15 16 SNAT 15 17 AARS 16 18 CMT2N 16 19 AAV 17 20 AAVS1 17 21 ABAT 18 22 GABA-AT 18 23 GABAT 18 24 NPD009 18 25 ABC-1 19 26 ABC1 19 27 ABC1 63897 28 ABCA1 19 Warning message: In .generateExtraRows(tab, keys, jointype) : 'select' and duplicate query keys resulted in 1:many mapping between keys and return rows > mget(c("1982","6530","10991"), org.Hs.egGENENAME) $`1982` [1] "eukaryotic translation initiation factor 4 gamma, 2" $`6530` [1] "solute carrier family 6 (neurotransmitter transporter, noradrenalin), member 2" $`10991` [1] "solute carrier family 38, member 3" Best, Jim > > On 25 July 2013 18:17, James W. MacDonald<jmacdon at="" uw.edu=""> wrote: >> Hi Enrico, >> >> >> On 7/25/2013 12:56 PM, Enrico Ferrero wrote: >>> Dear James, >>> >>> Thanks very much for your prompt reply. >>> I knew the problem was the for loop and the select function is indeed >>> a lot faster than that and works perfectly with toy data. >>> >>> However, this is what happens when I try to use it with real data: >>> >>>> test<- select(org.Hs.eg.db, keys=df$GeneSymbol, keytype="ALIAS", >>>> cols=c("SYMBOL","ENTREZID","ENSEMBL")) >>> Warning message: >>> In .generateExtraRows(tab, keys, jointype) : >>> 'select' and duplicate query keys resulted in 1:many mapping between >>> keys and return rows >>> >>> which is probably the warning you mentioned. >> >> That's not the warning I mentioned, but it does point out the same issue, >> which is that there is a one to many mapping between symbol and entrez gene >> ID. >> >> So now you have to decide if you want to be naive (or stupid, depending on >> your perspective) or not. You could just cover your eyes and do this: >> >> first.two<- first.two[!duplicated(first.two$SYMBOL),] >> >> which will choose for you the first symbol -> gene ID mapping and nuke the >> rest. That's nice and quick, but you are making huge assumptions. >> >> Or you could decide to be a bit more sophisticated and do something like >> >> thelst<- tapply(1:nrow(first.two), first.two$SYMBOL, function(x) >> first.two[x,]) >> >> At this point you can take a look at e.g., thelst[1:10] to see what we just >> did >> >> thelst<- do.call("rbind", lapply(thelst, function(x) c(x[1,1], paste(x[,2], >> collapse = "|"))) >> >> and here you can look at head(thelst). >> >> Then you can check to ensure that the first column of thelst is identical to >> the first column of df, and proceed as before. >> >> But there is still the problem of the multiple mappings. As an example: >> >>> thelst[1:5] >> $HBD >> SYMBOL ENTREZID >> 2535 HBD 3045 >> 2536 HBD 100187828 >> >> $KIR3DL3 >> SYMBOL ENTREZID >> 17513 KIR3DL3 115653 >> 17514 KIR3DL3 100133046 >> >>> mget(as.character(thelst[[1]][,2]), org.Hs.egGENENAME) >> $`3045` >> [1] "hemoglobin, delta" >> >> $`100187828` >> [1] "hypophosphatemic bone disease" >> >>> mget(as.character(thelst[[2]][,2]), org.Hs.egGENENAME) >> $`115653` >> [1] "killer cell immunoglobulin-like receptor, three domains, long >> cytoplasmic tail, 3" >> >> $`100133046` >> [1] "killer cell immunoglobulin-like receptor three domains long cytoplasmic >> tail 3" >> >> >> So HBD is the gene symbol for two different genes! If this gene symbol is in >> your data, you will now have attributed your data to two genes that >> apparently are not remotely similar. if KIR3DL3 is in your data, then it >> worked out OK for that gene. >> >> Best, >> >> Jim >> >> >> >> >> >>> The real problem is that the number of rows is now different for the 2 >>> objects: >>>> nrow(df); nrow(test) >>> [1] 573 >>> [1] 201 >>> >>> So I obviously can't put the new data into the original df. My >>> impression is that when the 1 to many mapping arises, the select >>> functions exits, with that warning message. As a result, my test >>> object is incomplete. >>> >>> On top of that, and I can't really explain this, the row positions are >>> messed up, e.g. >>> >>>> all.equal(df[100,],test[100,]) >>> returns FALSE. >>> >>> How can I work around this? >>> >>> Thanks a lot! >>> >>> Best, >>> >>> On 25 July 2013 16:58, James W. MacDonald<jmacdon at="" uw.edu=""> wrote: >>>> Hi Enrico, >>>> >>>> >>>> On 7/25/2013 11:35 AM, Enrico Ferrero wrote: >>>>> Hello, >>>>> >>>>> I often have data frames where I need to perform ID conversions on one >>>>> or >>>>> more of the columns while preserving the order of the rows, e.g.: >>>>> >>>>> GeneSymbol Value1 Value2 >>>>> GS1 2.5 0.1 >>>>> GS2 3 0.2 >>>>> .. >>>>> >>>>> And I want to obtain: >>>>> >>>>> GeneSymbol EntrezGeneID Value1 Value2 >>>>> GS1 EG1 2.5 0.1 >>>>> GS2 EG2 3 0.2 >>>>> .. >>>>> >>>>> What I've done so far was to create a function that uses org.Hs.eg.db to >>>>> loop over the rows of the column and does the conversion: >>>>> >>>>> library(org.Hs.eg.db) >>>>> alias2EG<- function(x) { >>>>> for (i in 1:length(x)) { >>>>> if (!is.na(x[i])) { >>>>> repl<- org.Hs.egALIAS2EG[[x[i]]][1] >>>>> if (!is.null(repl)) { >>>>> x[i]<- repl >>>>> } >>>>> else { >>>>> x[i]<- NA >>>>> } >>>>> } >>>>> } >>>>> return(x) >>>>> } >>>> >>>> I should first note that gene symbols are not unique, so you are taking a >>>> chance on your mappings. Is there no other annotation for your data? >>>> >>>> In addition, you should note that it is almost always better to think of >>>> objects as vectors and matrices in R, rather than as things that need to >>>> be >>>> looped over (e.g., R isn't Perl or C). >>>> >>>> first.two<- select(org.Hs.eg.db, as.character(df$GeneSymbol), "ENTREZID", >>>> "SYMBOL") >>>> >>>> Note that there used to be a warning or an error (don't remember which) >>>> when >>>> you did something like this, stating that gene symbols are not unique, >>>> and >>>> that you shouldn't do this sort of thing. Apparently this warning has >>>> been >>>> removed, but the issue remains valid. >>>> >>>> ## check yourself >>>> >>>> all.equal(df$GeneSymbol, first.two$SYMBOL) >>>> >>>> ## if true, proceed >>>> >>>> df<- data.frame(first.two, df[,-1]) >>>> >>>> Best, >>>> >>>> Jim >>>> >>>> >>>> >>>>> and then call the function like this: >>>>> >>>>> df$EntrezGeneID<- alias2GS(df$GeneSymbol) >>>>> >>>>> This works well, but gets very slow when I need to do multiple >>>>> conversions >>>>> on large datasets. >>>>> >>>>> Is there any way I can achieve the same result but in a quicker, more >>>>> efficient way? >>>>> >>>>> Thank you. >>>>> >>>> -- >>>> James W. MacDonald, M.S. >>>> Biostatistician >>>> University of Washington >>>> Environmental and Occupational Health Sciences >>>> 4225 Roosevelt Way NE, # 100 >>>> Seattle WA 98105-6099 >>>> >>> >> -- >> James W. MacDonald, M.S. >> Biostatistician >> University of Washington >> Environmental and Occupational Health Sciences >> 4225 Roosevelt Way NE, # 100 >> Seattle WA 98105-6099 >> > > -- James W. MacDonald, M.S. Biostatistician University of Washington Environmental and Occupational Health Sciences 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099

ADD COMMENT • link 10.8 years ago James W. MacDonald 65k

Entering edit mode

Hi Enrico, On 07/25/2013 01:20 PM, James W. MacDonald wrote: > Hi Enrico, > > Please don't take things off-list (e.g., use reply-all). > > > On 7/25/2013 2:17 PM, Enrico Ferrero wrote: >> Hi James, >> >> Thanks very much for your help. >> There is an issue that needs to be solved before thinking about what's >> the best approach in my opinion. >> >> I don't understand why, but the object created with the call to select >> (test in my example, first.two in yours) has a different number of >> rows from the original object (df in my example). Specifically it has >> *less* rows. I'm surprised it has less rows. It can definitely have more, when some of the keys passed to select() are mapped to more than 1 row, but my understanding was that select() would propagate unmapped keys to the output by placing them in rows stuffed with NAs. So maybe I misunderstood how select() works, or its behavior was changed, or there is a bug somewhere. Could you please send the code that allows us to reproduce this? Thanks. H. > If all symbols were converted to all possible Entrez IDs, >> I would expect it to have more rows, not less. To me, it looks like >> not all rows are looked up and returned. >> >> Do you see what I mean? > > Sure. You could be using outdated gene symbols. Or perhaps you are using > a mixture of symbols and aliases. Which is even cooler than just all > symbols: > > > symb <- c(Rkeys(org.Hs.egSYMBOL)[1:10], Rkeys(org.Hs.egALIAS2EG)[31:45]) > > symb > [1] "A1BG" "A2M" "A2MP1" "NAT1" "NAT2" "AACP" > [7] "SERPINA3" "AADAC" "AAMP" "AANAT" "AAMP" "AANAT" > [13] "DSPS" "SNAT" "AARS" "CMT2N" "AAV" "AAVS1" > [19] "ABAT" "GABA-AT" "GABAT" "NPD009" "ABC-1" "ABC1" > [25] "ABCA1" > > select(org.Hs.eg.db, symb, "ENTREZID","SYMBOL") > SYMBOL ENTREZID > 1 A1BG 1 > 2 A2M 2 > 3 A2MP1 3 > 4 NAT1 9 > 5 NAT2 10 > 6 AACP 11 > 7 SERPINA3 12 > 8 AADAC 13 > 9 AAMP 14 > 10 AANAT 15 > 11 AAMP 14 > 12 AANAT 15 > 13 DSPS <na> > 14 SNAT <na> > 15 AARS 16 > 16 CMT2N <na> > 17 AAV <na> > 18 AAVS1 17 > 19 ABAT 18 > 20 GABA-AT <na> > 21 GABAT <na> > 22 NPD009 <na> > 23 ABC-1 <na> > 24 ABC1 <na> > 25 ABCA1 19 > > select(org.Hs.eg.db, symb, "ENTREZID","ALIAS") > ALIAS ENTREZID > 1 A1BG 1 > 2 A2M 2 > 3 A2MP1 3 > 4 NAT1 9 > 5 NAT1 1982 > 6 NAT1 6530 > 7 NAT1 10991 > 8 NAT2 10 > 9 NAT2 81539 > 10 AACP 11 > 11 SERPINA3 12 > 12 AADAC 13 > 13 AAMP 14 > 14 AANAT 15 > 15 DSPS 15 > 16 SNAT 15 > 17 AARS 16 > 18 CMT2N 16 > 19 AAV 17 > 20 AAVS1 17 > 21 ABAT 18 > 22 GABA-AT 18 > 23 GABAT 18 > 24 NPD009 18 > 25 ABC-1 19 > 26 ABC1 19 > 27 ABC1 63897 > 28 ABCA1 19 > Warning message: > In .generateExtraRows(tab, keys, jointype) : > 'select' and duplicate query keys resulted in 1:many mapping between > keys and return rows > > mget(c("1982","6530","10991"), org.Hs.egGENENAME) > $`1982` > [1] "eukaryotic translation initiation factor 4 gamma, 2" > > $`6530` > [1] "solute carrier family 6 (neurotransmitter transporter, > noradrenalin), member 2" > > $`10991` > [1] "solute carrier family 38, member 3" > > Best, > > Jim > >> >> On 25 July 2013 18:17, James W. MacDonald<jmacdon at="" uw.edu=""> wrote: >>> Hi Enrico, >>> >>> >>> On 7/25/2013 12:56 PM, Enrico Ferrero wrote: >>>> Dear James, >>>> >>>> Thanks very much for your prompt reply. >>>> I knew the problem was the for loop and the select function is indeed >>>> a lot faster than that and works perfectly with toy data. >>>> >>>> However, this is what happens when I try to use it with real data: >>>> >>>>> test<- select(org.Hs.eg.db, keys=df$GeneSymbol, keytype="ALIAS", >>>>> cols=c("SYMBOL","ENTREZID","ENSEMBL")) >>>> Warning message: >>>> In .generateExtraRows(tab, keys, jointype) : >>>> 'select' and duplicate query keys resulted in 1:many mapping >>>> between >>>> keys and return rows >>>> >>>> which is probably the warning you mentioned. >>> >>> That's not the warning I mentioned, but it does point out the same >>> issue, >>> which is that there is a one to many mapping between symbol and >>> entrez gene >>> ID. >>> >>> So now you have to decide if you want to be naive (or stupid, >>> depending on >>> your perspective) or not. You could just cover your eyes and do this: >>> >>> first.two<- first.two[!duplicated(first.two$SYMBOL),] >>> >>> which will choose for you the first symbol -> gene ID mapping and >>> nuke the >>> rest. That's nice and quick, but you are making huge assumptions. >>> >>> Or you could decide to be a bit more sophisticated and do something like >>> >>> thelst<- tapply(1:nrow(first.two), first.two$SYMBOL, function(x) >>> first.two[x,]) >>> >>> At this point you can take a look at e.g., thelst[1:10] to see what >>> we just >>> did >>> >>> thelst<- do.call("rbind", lapply(thelst, function(x) c(x[1,1], >>> paste(x[,2], >>> collapse = "|"))) >>> >>> and here you can look at head(thelst). >>> >>> Then you can check to ensure that the first column of thelst is >>> identical to >>> the first column of df, and proceed as before. >>> >>> But there is still the problem of the multiple mappings. As an example: >>> >>>> thelst[1:5] >>> $HBD >>> SYMBOL ENTREZID >>> 2535 HBD 3045 >>> 2536 HBD 100187828 >>> >>> $KIR3DL3 >>> SYMBOL ENTREZID >>> 17513 KIR3DL3 115653 >>> 17514 KIR3DL3 100133046 >>> >>>> mget(as.character(thelst[[1]][,2]), org.Hs.egGENENAME) >>> $`3045` >>> [1] "hemoglobin, delta" >>> >>> $`100187828` >>> [1] "hypophosphatemic bone disease" >>> >>>> mget(as.character(thelst[[2]][,2]), org.Hs.egGENENAME) >>> $`115653` >>> [1] "killer cell immunoglobulin-like receptor, three domains, long >>> cytoplasmic tail, 3" >>> >>> $`100133046` >>> [1] "killer cell immunoglobulin-like receptor three domains long >>> cytoplasmic >>> tail 3" >>> >>> >>> So HBD is the gene symbol for two different genes! If this gene >>> symbol is in >>> your data, you will now have attributed your data to two genes that >>> apparently are not remotely similar. if KIR3DL3 is in your data, then it >>> worked out OK for that gene. >>> >>> Best, >>> >>> Jim >>> >>> >>> >>> >>> >>>> The real problem is that the number of rows is now different for the 2 >>>> objects: >>>>> nrow(df); nrow(test) >>>> [1] 573 >>>> [1] 201 >>>> >>>> So I obviously can't put the new data into the original df. My >>>> impression is that when the 1 to many mapping arises, the select >>>> functions exits, with that warning message. As a result, my test >>>> object is incomplete. >>>> >>>> On top of that, and I can't really explain this, the row positions are >>>> messed up, e.g. >>>> >>>>> all.equal(df[100,],test[100,]) >>>> returns FALSE. >>>> >>>> How can I work around this? >>>> >>>> Thanks a lot! >>>> >>>> Best, >>>> >>>> On 25 July 2013 16:58, James W. MacDonald<jmacdon at="" uw.edu=""> wrote: >>>>> Hi Enrico, >>>>> >>>>> >>>>> On 7/25/2013 11:35 AM, Enrico Ferrero wrote: >>>>>> Hello, >>>>>> >>>>>> I often have data frames where I need to perform ID conversions on >>>>>> one >>>>>> or >>>>>> more of the columns while preserving the order of the rows, e.g.: >>>>>> >>>>>> GeneSymbol Value1 Value2 >>>>>> GS1 2.5 0.1 >>>>>> GS2 3 0.2 >>>>>> .. >>>>>> >>>>>> And I want to obtain: >>>>>> >>>>>> GeneSymbol EntrezGeneID Value1 Value2 >>>>>> GS1 EG1 2.5 0.1 >>>>>> GS2 EG2 3 0.2 >>>>>> .. >>>>>> >>>>>> What I've done so far was to create a function that uses >>>>>> org.Hs.eg.db to >>>>>> loop over the rows of the column and does the conversion: >>>>>> >>>>>> library(org.Hs.eg.db) >>>>>> alias2EG<- function(x) { >>>>>> for (i in 1:length(x)) { >>>>>> if (!is.na(x[i])) { >>>>>> repl<- org.Hs.egALIAS2EG[[x[i]]][1] >>>>>> if (!is.null(repl)) { >>>>>> x[i]<- repl >>>>>> } >>>>>> else { >>>>>> x[i]<- NA >>>>>> } >>>>>> } >>>>>> } >>>>>> return(x) >>>>>> } >>>>> >>>>> I should first note that gene symbols are not unique, so you are >>>>> taking a >>>>> chance on your mappings. Is there no other annotation for your data? >>>>> >>>>> In addition, you should note that it is almost always better to >>>>> think of >>>>> objects as vectors and matrices in R, rather than as things that >>>>> need to >>>>> be >>>>> looped over (e.g., R isn't Perl or C). >>>>> >>>>> first.two<- select(org.Hs.eg.db, as.character(df$GeneSymbol), >>>>> "ENTREZID", >>>>> "SYMBOL") >>>>> >>>>> Note that there used to be a warning or an error (don't remember >>>>> which) >>>>> when >>>>> you did something like this, stating that gene symbols are not unique, >>>>> and >>>>> that you shouldn't do this sort of thing. Apparently this warning has >>>>> been >>>>> removed, but the issue remains valid. >>>>> >>>>> ## check yourself >>>>> >>>>> all.equal(df$GeneSymbol, first.two$SYMBOL) >>>>> >>>>> ## if true, proceed >>>>> >>>>> df<- data.frame(first.two, df[,-1]) >>>>> >>>>> Best, >>>>> >>>>> Jim >>>>> >>>>> >>>>> >>>>>> and then call the function like this: >>>>>> >>>>>> df$EntrezGeneID<- alias2GS(df$GeneSymbol) >>>>>> >>>>>> This works well, but gets very slow when I need to do multiple >>>>>> conversions >>>>>> on large datasets. >>>>>> >>>>>> Is there any way I can achieve the same result but in a quicker, more >>>>>> efficient way? >>>>>> >>>>>> Thank you. >>>>>> >>>>> -- >>>>> James W. MacDonald, M.S. >>>>> Biostatistician >>>>> University of Washington >>>>> Environmental and Occupational Health Sciences >>>>> 4225 Roosevelt Way NE, # 100 >>>>> Seattle WA 98105-6099 >>>>> >>>> >>> -- >>> James W. MacDonald, M.S. >>> Biostatistician >>> University of Washington >>> Environmental and Occupational Health Sciences >>> 4225 Roosevelt Way NE, # 100 >>> Seattle WA 98105-6099 >>> >> >> > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319