Quickest way to convert IDs in a data frame?
3
0
Entering edit mode
enricoferrero ▴ 660
@enricoferrero-6037
Last seen 3.1 years ago
Switzerland
Hello, I often have data frames where I need to perform ID conversions on one or more of the columns while preserving the order of the rows, e.g.: GeneSymbol Value1 Value2 GS1 2.5 0.1 GS2 3 0.2 .. And I want to obtain: GeneSymbol EntrezGeneID Value1 Value2 GS1 EG1 2.5 0.1 GS2 EG2 3 0.2 .. What I've done so far was to create a function that uses org.Hs.eg.db to loop over the rows of the column and does the conversion: library(org.Hs.eg.db) alias2EG <- function(x) { for (i in 1:length(x)) { if (!is.na(x[i])) { repl <- org.Hs.egALIAS2EG[[x[i]]][1] if (!is.null(repl)) { x[i] <- repl } else { x[i] <- NA } } } return(x) } and then call the function like this: df$EntrezGeneID <- alias2GS(df$GeneSymbol) This works well, but gets very slow when I need to do multiple conversions on large datasets. Is there any way I can achieve the same result but in a quicker, more efficient way? Thank you. -- Enrico Ferrero PhD Student Department of Genetics Cambridge Systems Biology Centre University of Cambridge [[alternative HTML version deleted]]
• 2.2k views
ADD COMMENT
0
Entering edit mode
@james-w-macdonald-5106
Last seen 8 hours ago
United States
Hi Enrico, On 7/25/2013 11:35 AM, Enrico Ferrero wrote: > Hello, > > I often have data frames where I need to perform ID conversions on one or > more of the columns while preserving the order of the rows, e.g.: > > GeneSymbol Value1 Value2 > GS1 2.5 0.1 > GS2 3 0.2 > .. > > And I want to obtain: > > GeneSymbol EntrezGeneID Value1 Value2 > GS1 EG1 2.5 0.1 > GS2 EG2 3 0.2 > .. > > What I've done so far was to create a function that uses org.Hs.eg.db to > loop over the rows of the column and does the conversion: > > library(org.Hs.eg.db) > alias2EG<- function(x) { > for (i in 1:length(x)) { > if (!is.na(x[i])) { > repl<- org.Hs.egALIAS2EG[[x[i]]][1] > if (!is.null(repl)) { > x[i]<- repl > } > else { > x[i]<- NA > } > } > } > return(x) > } I should first note that gene symbols are not unique, so you are taking a chance on your mappings. Is there no other annotation for your data? In addition, you should note that it is almost always better to think of objects as vectors and matrices in R, rather than as things that need to be looped over (e.g., R isn't Perl or C). first.two <- select(org.Hs.eg.db, as.character(df$GeneSymbol), "ENTREZID", "SYMBOL") Note that there used to be a warning or an error (don't remember which) when you did something like this, stating that gene symbols are not unique, and that you shouldn't do this sort of thing. Apparently this warning has been removed, but the issue remains valid. ## check yourself all.equal(df$GeneSymbol, first.two$SYMBOL) ## if true, proceed df <- data.frame(first.two, df[,-1]) Best, Jim > > and then call the function like this: > > df$EntrezGeneID<- alias2GS(df$GeneSymbol) > > This works well, but gets very slow when I need to do multiple conversions > on large datasets. > > Is there any way I can achieve the same result but in a quicker, more > efficient way? > > Thank you. > -- James W. MacDonald, M.S. Biostatistician University of Washington Environmental and Occupational Health Sciences 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099
ADD COMMENT
0
Entering edit mode
Dear James, Thanks very much for your prompt reply. I knew the problem was the for loop and the select function is indeed a lot faster than that and works perfectly with toy data. However, this is what happens when I try to use it with real data: > test <- select(org.Hs.eg.db, keys=df$GeneSymbol, keytype="ALIAS", cols=c("SYMBOL","ENTREZID","ENSEMBL")) Warning message: In .generateExtraRows(tab, keys, jointype) : 'select' and duplicate query keys resulted in 1:many mapping between keys and return rows which is probably the warning you mentioned. The real problem is that the number of rows is now different for the 2 objects: > nrow(df); nrow(test) [1] 573 [1] 201 So I obviously can't put the new data into the original df. My impression is that when the 1 to many mapping arises, the select functions exits, with that warning message. As a result, my test object is incomplete. On top of that, and I can't really explain this, the row positions are messed up, e.g. > all.equal(df[100,],test[100,]) returns FALSE. How can I work around this? Thanks a lot! Best, On 25 July 2013 16:58, James W. MacDonald <jmacdon at="" uw.edu=""> wrote: > Hi Enrico, > > > On 7/25/2013 11:35 AM, Enrico Ferrero wrote: >> >> Hello, >> >> I often have data frames where I need to perform ID conversions on one or >> more of the columns while preserving the order of the rows, e.g.: >> >> GeneSymbol Value1 Value2 >> GS1 2.5 0.1 >> GS2 3 0.2 >> .. >> >> And I want to obtain: >> >> GeneSymbol EntrezGeneID Value1 Value2 >> GS1 EG1 2.5 0.1 >> GS2 EG2 3 0.2 >> .. >> >> What I've done so far was to create a function that uses org.Hs.eg.db to >> loop over the rows of the column and does the conversion: >> >> library(org.Hs.eg.db) >> alias2EG<- function(x) { >> for (i in 1:length(x)) { >> if (!is.na(x[i])) { >> repl<- org.Hs.egALIAS2EG[[x[i]]][1] >> if (!is.null(repl)) { >> x[i]<- repl >> } >> else { >> x[i]<- NA >> } >> } >> } >> return(x) >> } > > > I should first note that gene symbols are not unique, so you are taking a > chance on your mappings. Is there no other annotation for your data? > > In addition, you should note that it is almost always better to think of > objects as vectors and matrices in R, rather than as things that need to be > looped over (e.g., R isn't Perl or C). > > first.two <- select(org.Hs.eg.db, as.character(df$GeneSymbol), "ENTREZID", > "SYMBOL") > > Note that there used to be a warning or an error (don't remember which) when > you did something like this, stating that gene symbols are not unique, and > that you shouldn't do this sort of thing. Apparently this warning has been > removed, but the issue remains valid. > > ## check yourself > > all.equal(df$GeneSymbol, first.two$SYMBOL) > > ## if true, proceed > > df <- data.frame(first.two, df[,-1]) > > Best, > > Jim > > > >> >> and then call the function like this: >> >> df$EntrezGeneID<- alias2GS(df$GeneSymbol) >> >> This works well, but gets very slow when I need to do multiple conversions >> on large datasets. >> >> Is there any way I can achieve the same result but in a quicker, more >> efficient way? >> >> Thank you. >> > > -- > James W. MacDonald, M.S. > Biostatistician > University of Washington > Environmental and Occupational Health Sciences > 4225 Roosevelt Way NE, # 100 > Seattle WA 98105-6099 > -- Enrico Ferrero PhD Student Department of Genetics Cambridge Systems Biology Centre University of Cambridge
ADD REPLY
0
Entering edit mode
Hi Enrico, On 7/25/2013 12:56 PM, Enrico Ferrero wrote: > Dear James, > > Thanks very much for your prompt reply. > I knew the problem was the for loop and the select function is indeed > a lot faster than that and works perfectly with toy data. > > However, this is what happens when I try to use it with real data: > >> test<- select(org.Hs.eg.db, keys=df$GeneSymbol, keytype="ALIAS", cols=c("SYMBOL","ENTREZID","ENSEMBL")) > Warning message: > In .generateExtraRows(tab, keys, jointype) : > 'select' and duplicate query keys resulted in 1:many mapping between > keys and return rows > > which is probably the warning you mentioned. That's not the warning I mentioned, but it does point out the same issue, which is that there is a one to many mapping between symbol and entrez gene ID. So now you have to decide if you want to be naive (or stupid, depending on your perspective) or not. You could just cover your eyes and do this: first.two <- first.two[!duplicated(first.two$SYMBOL),] which will choose for you the first symbol -> gene ID mapping and nuke the rest. That's nice and quick, but you are making huge assumptions. Or you could decide to be a bit more sophisticated and do something like thelst <- tapply(1:nrow(first.two), first.two$SYMBOL, function(x) first.two[x,]) At this point you can take a look at e.g., thelst[1:10] to see what we just did thelst <- do.call("rbind", lapply(thelst, function(x) c(x[1,1], paste(x[,2], collapse = "|"))) and here you can look at head(thelst). Then you can check to ensure that the first column of thelst is identical to the first column of df, and proceed as before. But there is still the problem of the multiple mappings. As an example: > thelst[1:5] $HBD SYMBOL ENTREZID 2535 HBD 3045 2536 HBD 100187828 $KIR3DL3 SYMBOL ENTREZID 17513 KIR3DL3 115653 17514 KIR3DL3 100133046 > mget(as.character(thelst[[1]][,2]), org.Hs.egGENENAME) $`3045` [1] "hemoglobin, delta" $`100187828` [1] "hypophosphatemic bone disease" > mget(as.character(thelst[[2]][,2]), org.Hs.egGENENAME) $`115653` [1] "killer cell immunoglobulin-like receptor, three domains, long cytoplasmic tail, 3" $`100133046` [1] "killer cell immunoglobulin-like receptor three domains long cytoplasmic tail 3" So HBD is the gene symbol for two different genes! If this gene symbol is in your data, you will now have attributed your data to two genes that apparently are not remotely similar. if KIR3DL3 is in your data, then it worked out OK for that gene. Best, Jim > > The real problem is that the number of rows is now different for the 2 objects: >> nrow(df); nrow(test) > [1] 573 > [1] 201 > > So I obviously can't put the new data into the original df. My > impression is that when the 1 to many mapping arises, the select > functions exits, with that warning message. As a result, my test > object is incomplete. > > On top of that, and I can't really explain this, the row positions are > messed up, e.g. > >> all.equal(df[100,],test[100,]) > returns FALSE. > > How can I work around this? > > Thanks a lot! > > Best, > > On 25 July 2013 16:58, James W. MacDonald<jmacdon at="" uw.edu=""> wrote: >> Hi Enrico, >> >> >> On 7/25/2013 11:35 AM, Enrico Ferrero wrote: >>> Hello, >>> >>> I often have data frames where I need to perform ID conversions on one or >>> more of the columns while preserving the order of the rows, e.g.: >>> >>> GeneSymbol Value1 Value2 >>> GS1 2.5 0.1 >>> GS2 3 0.2 >>> .. >>> >>> And I want to obtain: >>> >>> GeneSymbol EntrezGeneID Value1 Value2 >>> GS1 EG1 2.5 0.1 >>> GS2 EG2 3 0.2 >>> .. >>> >>> What I've done so far was to create a function that uses org.Hs.eg.db to >>> loop over the rows of the column and does the conversion: >>> >>> library(org.Hs.eg.db) >>> alias2EG<- function(x) { >>> for (i in 1:length(x)) { >>> if (!is.na(x[i])) { >>> repl<- org.Hs.egALIAS2EG[[x[i]]][1] >>> if (!is.null(repl)) { >>> x[i]<- repl >>> } >>> else { >>> x[i]<- NA >>> } >>> } >>> } >>> return(x) >>> } >> >> I should first note that gene symbols are not unique, so you are taking a >> chance on your mappings. Is there no other annotation for your data? >> >> In addition, you should note that it is almost always better to think of >> objects as vectors and matrices in R, rather than as things that need to be >> looped over (e.g., R isn't Perl or C). >> >> first.two<- select(org.Hs.eg.db, as.character(df$GeneSymbol), "ENTREZID", >> "SYMBOL") >> >> Note that there used to be a warning or an error (don't remember which) when >> you did something like this, stating that gene symbols are not unique, and >> that you shouldn't do this sort of thing. Apparently this warning has been >> removed, but the issue remains valid. >> >> ## check yourself >> >> all.equal(df$GeneSymbol, first.two$SYMBOL) >> >> ## if true, proceed >> >> df<- data.frame(first.two, df[,-1]) >> >> Best, >> >> Jim >> >> >> >>> and then call the function like this: >>> >>> df$EntrezGeneID<- alias2GS(df$GeneSymbol) >>> >>> This works well, but gets very slow when I need to do multiple conversions >>> on large datasets. >>> >>> Is there any way I can achieve the same result but in a quicker, more >>> efficient way? >>> >>> Thank you. >>> >> -- >> James W. MacDonald, M.S. >> Biostatistician >> University of Washington >> Environmental and Occupational Health Sciences >> 4225 Roosevelt Way NE, # 100 >> Seattle WA 98105-6099 >> > > -- James W. MacDonald, M.S. Biostatistician University of Washington Environmental and Occupational Health Sciences 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099
ADD REPLY
0
Entering edit mode
@james-w-macdonald-5106
Last seen 8 hours ago
United States
Hi Enrico, Please don't take things off-list (e.g., use reply-all). On 7/25/2013 2:17 PM, Enrico Ferrero wrote: > Hi James, > > Thanks very much for your help. > There is an issue that needs to be solved before thinking about what's > the best approach in my opinion. > > I don't understand why, but the object created with the call to select > (test in my example, first.two in yours) has a different number of > rows from the original object (df in my example). Specifically it has > *less* rows. If all symbols were converted to all possible Entrez IDs, > I would expect it to have more rows, not less. To me, it looks like > not all rows are looked up and returned. > > Do you see what I mean? Sure. You could be using outdated gene symbols. Or perhaps you are using a mixture of symbols and aliases. Which is even cooler than just all symbols: > symb <- c(Rkeys(org.Hs.egSYMBOL)[1:10], Rkeys(org.Hs.egALIAS2EG)[31:45]) > symb [1] "A1BG" "A2M" "A2MP1" "NAT1" "NAT2" "AACP" [7] "SERPINA3" "AADAC" "AAMP" "AANAT" "AAMP" "AANAT" [13] "DSPS" "SNAT" "AARS" "CMT2N" "AAV" "AAVS1" [19] "ABAT" "GABA-AT" "GABAT" "NPD009" "ABC-1" "ABC1" [25] "ABCA1" > select(org.Hs.eg.db, symb, "ENTREZID","SYMBOL") SYMBOL ENTREZID 1 A1BG 1 2 A2M 2 3 A2MP1 3 4 NAT1 9 5 NAT2 10 6 AACP 11 7 SERPINA3 12 8 AADAC 13 9 AAMP 14 10 AANAT 15 11 AAMP 14 12 AANAT 15 13 DSPS <na> 14 SNAT <na> 15 AARS 16 16 CMT2N <na> 17 AAV <na> 18 AAVS1 17 19 ABAT 18 20 GABA-AT <na> 21 GABAT <na> 22 NPD009 <na> 23 ABC-1 <na> 24 ABC1 <na> 25 ABCA1 19 > select(org.Hs.eg.db, symb, "ENTREZID","ALIAS") ALIAS ENTREZID 1 A1BG 1 2 A2M 2 3 A2MP1 3 4 NAT1 9 5 NAT1 1982 6 NAT1 6530 7 NAT1 10991 8 NAT2 10 9 NAT2 81539 10 AACP 11 11 SERPINA3 12 12 AADAC 13 13 AAMP 14 14 AANAT 15 15 DSPS 15 16 SNAT 15 17 AARS 16 18 CMT2N 16 19 AAV 17 20 AAVS1 17 21 ABAT 18 22 GABA-AT 18 23 GABAT 18 24 NPD009 18 25 ABC-1 19 26 ABC1 19 27 ABC1 63897 28 ABCA1 19 Warning message: In .generateExtraRows(tab, keys, jointype) : 'select' and duplicate query keys resulted in 1:many mapping between keys and return rows > mget(c("1982","6530","10991"), org.Hs.egGENENAME) $`1982` [1] "eukaryotic translation initiation factor 4 gamma, 2" $`6530` [1] "solute carrier family 6 (neurotransmitter transporter, noradrenalin), member 2" $`10991` [1] "solute carrier family 38, member 3" Best, Jim > > On 25 July 2013 18:17, James W. MacDonald<jmacdon at="" uw.edu=""> wrote: >> Hi Enrico, >> >> >> On 7/25/2013 12:56 PM, Enrico Ferrero wrote: >>> Dear James, >>> >>> Thanks very much for your prompt reply. >>> I knew the problem was the for loop and the select function is indeed >>> a lot faster than that and works perfectly with toy data. >>> >>> However, this is what happens when I try to use it with real data: >>> >>>> test<- select(org.Hs.eg.db, keys=df$GeneSymbol, keytype="ALIAS", >>>> cols=c("SYMBOL","ENTREZID","ENSEMBL")) >>> Warning message: >>> In .generateExtraRows(tab, keys, jointype) : >>> 'select' and duplicate query keys resulted in 1:many mapping between >>> keys and return rows >>> >>> which is probably the warning you mentioned. >> >> That's not the warning I mentioned, but it does point out the same issue, >> which is that there is a one to many mapping between symbol and entrez gene >> ID. >> >> So now you have to decide if you want to be naive (or stupid, depending on >> your perspective) or not. You could just cover your eyes and do this: >> >> first.two<- first.two[!duplicated(first.two$SYMBOL),] >> >> which will choose for you the first symbol -> gene ID mapping and nuke the >> rest. That's nice and quick, but you are making huge assumptions. >> >> Or you could decide to be a bit more sophisticated and do something like >> >> thelst<- tapply(1:nrow(first.two), first.two$SYMBOL, function(x) >> first.two[x,]) >> >> At this point you can take a look at e.g., thelst[1:10] to see what we just >> did >> >> thelst<- do.call("rbind", lapply(thelst, function(x) c(x[1,1], paste(x[,2], >> collapse = "|"))) >> >> and here you can look at head(thelst). >> >> Then you can check to ensure that the first column of thelst is identical to >> the first column of df, and proceed as before. >> >> But there is still the problem of the multiple mappings. As an example: >> >>> thelst[1:5] >> $HBD >> SYMBOL ENTREZID >> 2535 HBD 3045 >> 2536 HBD 100187828 >> >> $KIR3DL3 >> SYMBOL ENTREZID >> 17513 KIR3DL3 115653 >> 17514 KIR3DL3 100133046 >> >>> mget(as.character(thelst[[1]][,2]), org.Hs.egGENENAME) >> $`3045` >> [1] "hemoglobin, delta" >> >> $`100187828` >> [1] "hypophosphatemic bone disease" >> >>> mget(as.character(thelst[[2]][,2]), org.Hs.egGENENAME) >> $`115653` >> [1] "killer cell immunoglobulin-like receptor, three domains, long >> cytoplasmic tail, 3" >> >> $`100133046` >> [1] "killer cell immunoglobulin-like receptor three domains long cytoplasmic >> tail 3" >> >> >> So HBD is the gene symbol for two different genes! If this gene symbol is in >> your data, you will now have attributed your data to two genes that >> apparently are not remotely similar. if KIR3DL3 is in your data, then it >> worked out OK for that gene. >> >> Best, >> >> Jim >> >> >> >> >> >>> The real problem is that the number of rows is now different for the 2 >>> objects: >>>> nrow(df); nrow(test) >>> [1] 573 >>> [1] 201 >>> >>> So I obviously can't put the new data into the original df. My >>> impression is that when the 1 to many mapping arises, the select >>> functions exits, with that warning message. As a result, my test >>> object is incomplete. >>> >>> On top of that, and I can't really explain this, the row positions are >>> messed up, e.g. >>> >>>> all.equal(df[100,],test[100,]) >>> returns FALSE. >>> >>> How can I work around this? >>> >>> Thanks a lot! >>> >>> Best, >>> >>> On 25 July 2013 16:58, James W. MacDonald<jmacdon at="" uw.edu=""> wrote: >>>> Hi Enrico, >>>> >>>> >>>> On 7/25/2013 11:35 AM, Enrico Ferrero wrote: >>>>> Hello, >>>>> >>>>> I often have data frames where I need to perform ID conversions on one >>>>> or >>>>> more of the columns while preserving the order of the rows, e.g.: >>>>> >>>>> GeneSymbol Value1 Value2 >>>>> GS1 2.5 0.1 >>>>> GS2 3 0.2 >>>>> .. >>>>> >>>>> And I want to obtain: >>>>> >>>>> GeneSymbol EntrezGeneID Value1 Value2 >>>>> GS1 EG1 2.5 0.1 >>>>> GS2 EG2 3 0.2 >>>>> .. >>>>> >>>>> What I've done so far was to create a function that uses org.Hs.eg.db to >>>>> loop over the rows of the column and does the conversion: >>>>> >>>>> library(org.Hs.eg.db) >>>>> alias2EG<- function(x) { >>>>> for (i in 1:length(x)) { >>>>> if (!is.na(x[i])) { >>>>> repl<- org.Hs.egALIAS2EG[[x[i]]][1] >>>>> if (!is.null(repl)) { >>>>> x[i]<- repl >>>>> } >>>>> else { >>>>> x[i]<- NA >>>>> } >>>>> } >>>>> } >>>>> return(x) >>>>> } >>>> >>>> I should first note that gene symbols are not unique, so you are taking a >>>> chance on your mappings. Is there no other annotation for your data? >>>> >>>> In addition, you should note that it is almost always better to think of >>>> objects as vectors and matrices in R, rather than as things that need to >>>> be >>>> looped over (e.g., R isn't Perl or C). >>>> >>>> first.two<- select(org.Hs.eg.db, as.character(df$GeneSymbol), "ENTREZID", >>>> "SYMBOL") >>>> >>>> Note that there used to be a warning or an error (don't remember which) >>>> when >>>> you did something like this, stating that gene symbols are not unique, >>>> and >>>> that you shouldn't do this sort of thing. Apparently this warning has >>>> been >>>> removed, but the issue remains valid. >>>> >>>> ## check yourself >>>> >>>> all.equal(df$GeneSymbol, first.two$SYMBOL) >>>> >>>> ## if true, proceed >>>> >>>> df<- data.frame(first.two, df[,-1]) >>>> >>>> Best, >>>> >>>> Jim >>>> >>>> >>>> >>>>> and then call the function like this: >>>>> >>>>> df$EntrezGeneID<- alias2GS(df$GeneSymbol) >>>>> >>>>> This works well, but gets very slow when I need to do multiple >>>>> conversions >>>>> on large datasets. >>>>> >>>>> Is there any way I can achieve the same result but in a quicker, more >>>>> efficient way? >>>>> >>>>> Thank you. >>>>> >>>> -- >>>> James W. MacDonald, M.S. >>>> Biostatistician >>>> University of Washington >>>> Environmental and Occupational Health Sciences >>>> 4225 Roosevelt Way NE, # 100 >>>> Seattle WA 98105-6099 >>>> >>> >> -- >> James W. MacDonald, M.S. >> Biostatistician >> University of Washington >> Environmental and Occupational Health Sciences >> 4225 Roosevelt Way NE, # 100 >> Seattle WA 98105-6099 >> > > -- James W. MacDonald, M.S. Biostatistician University of Washington Environmental and Occupational Health Sciences 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099
ADD COMMENT
0
Entering edit mode
Hi Enrico, On 07/25/2013 01:20 PM, James W. MacDonald wrote: > Hi Enrico, > > Please don't take things off-list (e.g., use reply-all). > > > On 7/25/2013 2:17 PM, Enrico Ferrero wrote: >> Hi James, >> >> Thanks very much for your help. >> There is an issue that needs to be solved before thinking about what's >> the best approach in my opinion. >> >> I don't understand why, but the object created with the call to select >> (test in my example, first.two in yours) has a different number of >> rows from the original object (df in my example). Specifically it has >> *less* rows. I'm surprised it has less rows. It can definitely have more, when some of the keys passed to select() are mapped to more than 1 row, but my understanding was that select() would propagate unmapped keys to the output by placing them in rows stuffed with NAs. So maybe I misunderstood how select() works, or its behavior was changed, or there is a bug somewhere. Could you please send the code that allows us to reproduce this? Thanks. H. > If all symbols were converted to all possible Entrez IDs, >> I would expect it to have more rows, not less. To me, it looks like >> not all rows are looked up and returned. >> >> Do you see what I mean? > > Sure. You could be using outdated gene symbols. Or perhaps you are using > a mixture of symbols and aliases. Which is even cooler than just all > symbols: > > > symb <- c(Rkeys(org.Hs.egSYMBOL)[1:10], Rkeys(org.Hs.egALIAS2EG)[31:45]) > > symb > [1] "A1BG" "A2M" "A2MP1" "NAT1" "NAT2" "AACP" > [7] "SERPINA3" "AADAC" "AAMP" "AANAT" "AAMP" "AANAT" > [13] "DSPS" "SNAT" "AARS" "CMT2N" "AAV" "AAVS1" > [19] "ABAT" "GABA-AT" "GABAT" "NPD009" "ABC-1" "ABC1" > [25] "ABCA1" > > select(org.Hs.eg.db, symb, "ENTREZID","SYMBOL") > SYMBOL ENTREZID > 1 A1BG 1 > 2 A2M 2 > 3 A2MP1 3 > 4 NAT1 9 > 5 NAT2 10 > 6 AACP 11 > 7 SERPINA3 12 > 8 AADAC 13 > 9 AAMP 14 > 10 AANAT 15 > 11 AAMP 14 > 12 AANAT 15 > 13 DSPS <na> > 14 SNAT <na> > 15 AARS 16 > 16 CMT2N <na> > 17 AAV <na> > 18 AAVS1 17 > 19 ABAT 18 > 20 GABA-AT <na> > 21 GABAT <na> > 22 NPD009 <na> > 23 ABC-1 <na> > 24 ABC1 <na> > 25 ABCA1 19 > > select(org.Hs.eg.db, symb, "ENTREZID","ALIAS") > ALIAS ENTREZID > 1 A1BG 1 > 2 A2M 2 > 3 A2MP1 3 > 4 NAT1 9 > 5 NAT1 1982 > 6 NAT1 6530 > 7 NAT1 10991 > 8 NAT2 10 > 9 NAT2 81539 > 10 AACP 11 > 11 SERPINA3 12 > 12 AADAC 13 > 13 AAMP 14 > 14 AANAT 15 > 15 DSPS 15 > 16 SNAT 15 > 17 AARS 16 > 18 CMT2N 16 > 19 AAV 17 > 20 AAVS1 17 > 21 ABAT 18 > 22 GABA-AT 18 > 23 GABAT 18 > 24 NPD009 18 > 25 ABC-1 19 > 26 ABC1 19 > 27 ABC1 63897 > 28 ABCA1 19 > Warning message: > In .generateExtraRows(tab, keys, jointype) : > 'select' and duplicate query keys resulted in 1:many mapping between > keys and return rows > > mget(c("1982","6530","10991"), org.Hs.egGENENAME) > $`1982` > [1] "eukaryotic translation initiation factor 4 gamma, 2" > > $`6530` > [1] "solute carrier family 6 (neurotransmitter transporter, > noradrenalin), member 2" > > $`10991` > [1] "solute carrier family 38, member 3" > > Best, > > Jim > >> >> On 25 July 2013 18:17, James W. MacDonald<jmacdon at="" uw.edu=""> wrote: >>> Hi Enrico, >>> >>> >>> On 7/25/2013 12:56 PM, Enrico Ferrero wrote: >>>> Dear James, >>>> >>>> Thanks very much for your prompt reply. >>>> I knew the problem was the for loop and the select function is indeed >>>> a lot faster than that and works perfectly with toy data. >>>> >>>> However, this is what happens when I try to use it with real data: >>>> >>>>> test<- select(org.Hs.eg.db, keys=df$GeneSymbol, keytype="ALIAS", >>>>> cols=c("SYMBOL","ENTREZID","ENSEMBL")) >>>> Warning message: >>>> In .generateExtraRows(tab, keys, jointype) : >>>> 'select' and duplicate query keys resulted in 1:many mapping >>>> between >>>> keys and return rows >>>> >>>> which is probably the warning you mentioned. >>> >>> That's not the warning I mentioned, but it does point out the same >>> issue, >>> which is that there is a one to many mapping between symbol and >>> entrez gene >>> ID. >>> >>> So now you have to decide if you want to be naive (or stupid, >>> depending on >>> your perspective) or not. You could just cover your eyes and do this: >>> >>> first.two<- first.two[!duplicated(first.two$SYMBOL),] >>> >>> which will choose for you the first symbol -> gene ID mapping and >>> nuke the >>> rest. That's nice and quick, but you are making huge assumptions. >>> >>> Or you could decide to be a bit more sophisticated and do something like >>> >>> thelst<- tapply(1:nrow(first.two), first.two$SYMBOL, function(x) >>> first.two[x,]) >>> >>> At this point you can take a look at e.g., thelst[1:10] to see what >>> we just >>> did >>> >>> thelst<- do.call("rbind", lapply(thelst, function(x) c(x[1,1], >>> paste(x[,2], >>> collapse = "|"))) >>> >>> and here you can look at head(thelst). >>> >>> Then you can check to ensure that the first column of thelst is >>> identical to >>> the first column of df, and proceed as before. >>> >>> But there is still the problem of the multiple mappings. As an example: >>> >>>> thelst[1:5] >>> $HBD >>> SYMBOL ENTREZID >>> 2535 HBD 3045 >>> 2536 HBD 100187828 >>> >>> $KIR3DL3 >>> SYMBOL ENTREZID >>> 17513 KIR3DL3 115653 >>> 17514 KIR3DL3 100133046 >>> >>>> mget(as.character(thelst[[1]][,2]), org.Hs.egGENENAME) >>> $`3045` >>> [1] "hemoglobin, delta" >>> >>> $`100187828` >>> [1] "hypophosphatemic bone disease" >>> >>>> mget(as.character(thelst[[2]][,2]), org.Hs.egGENENAME) >>> $`115653` >>> [1] "killer cell immunoglobulin-like receptor, three domains, long >>> cytoplasmic tail, 3" >>> >>> $`100133046` >>> [1] "killer cell immunoglobulin-like receptor three domains long >>> cytoplasmic >>> tail 3" >>> >>> >>> So HBD is the gene symbol for two different genes! If this gene >>> symbol is in >>> your data, you will now have attributed your data to two genes that >>> apparently are not remotely similar. if KIR3DL3 is in your data, then it >>> worked out OK for that gene. >>> >>> Best, >>> >>> Jim >>> >>> >>> >>> >>> >>>> The real problem is that the number of rows is now different for the 2 >>>> objects: >>>>> nrow(df); nrow(test) >>>> [1] 573 >>>> [1] 201 >>>> >>>> So I obviously can't put the new data into the original df. My >>>> impression is that when the 1 to many mapping arises, the select >>>> functions exits, with that warning message. As a result, my test >>>> object is incomplete. >>>> >>>> On top of that, and I can't really explain this, the row positions are >>>> messed up, e.g. >>>> >>>>> all.equal(df[100,],test[100,]) >>>> returns FALSE. >>>> >>>> How can I work around this? >>>> >>>> Thanks a lot! >>>> >>>> Best, >>>> >>>> On 25 July 2013 16:58, James W. MacDonald<jmacdon at="" uw.edu=""> wrote: >>>>> Hi Enrico, >>>>> >>>>> >>>>> On 7/25/2013 11:35 AM, Enrico Ferrero wrote: >>>>>> Hello, >>>>>> >>>>>> I often have data frames where I need to perform ID conversions on >>>>>> one >>>>>> or >>>>>> more of the columns while preserving the order of the rows, e.g.: >>>>>> >>>>>> GeneSymbol Value1 Value2 >>>>>> GS1 2.5 0.1 >>>>>> GS2 3 0.2 >>>>>> .. >>>>>> >>>>>> And I want to obtain: >>>>>> >>>>>> GeneSymbol EntrezGeneID Value1 Value2 >>>>>> GS1 EG1 2.5 0.1 >>>>>> GS2 EG2 3 0.2 >>>>>> .. >>>>>> >>>>>> What I've done so far was to create a function that uses >>>>>> org.Hs.eg.db to >>>>>> loop over the rows of the column and does the conversion: >>>>>> >>>>>> library(org.Hs.eg.db) >>>>>> alias2EG<- function(x) { >>>>>> for (i in 1:length(x)) { >>>>>> if (!is.na(x[i])) { >>>>>> repl<- org.Hs.egALIAS2EG[[x[i]]][1] >>>>>> if (!is.null(repl)) { >>>>>> x[i]<- repl >>>>>> } >>>>>> else { >>>>>> x[i]<- NA >>>>>> } >>>>>> } >>>>>> } >>>>>> return(x) >>>>>> } >>>>> >>>>> I should first note that gene symbols are not unique, so you are >>>>> taking a >>>>> chance on your mappings. Is there no other annotation for your data? >>>>> >>>>> In addition, you should note that it is almost always better to >>>>> think of >>>>> objects as vectors and matrices in R, rather than as things that >>>>> need to >>>>> be >>>>> looped over (e.g., R isn't Perl or C). >>>>> >>>>> first.two<- select(org.Hs.eg.db, as.character(df$GeneSymbol), >>>>> "ENTREZID", >>>>> "SYMBOL") >>>>> >>>>> Note that there used to be a warning or an error (don't remember >>>>> which) >>>>> when >>>>> you did something like this, stating that gene symbols are not unique, >>>>> and >>>>> that you shouldn't do this sort of thing. Apparently this warning has >>>>> been >>>>> removed, but the issue remains valid. >>>>> >>>>> ## check yourself >>>>> >>>>> all.equal(df$GeneSymbol, first.two$SYMBOL) >>>>> >>>>> ## if true, proceed >>>>> >>>>> df<- data.frame(first.two, df[,-1]) >>>>> >>>>> Best, >>>>> >>>>> Jim >>>>> >>>>> >>>>> >>>>>> and then call the function like this: >>>>>> >>>>>> df$EntrezGeneID<- alias2GS(df$GeneSymbol) >>>>>> >>>>>> This works well, but gets very slow when I need to do multiple >>>>>> conversions >>>>>> on large datasets. >>>>>> >>>>>> Is there any way I can achieve the same result but in a quicker, more >>>>>> efficient way? >>>>>> >>>>>> Thank you. >>>>>> >>>>> -- >>>>> James W. MacDonald, M.S. >>>>> Biostatistician >>>>> University of Washington >>>>> Environmental and Occupational Health Sciences >>>>> 4225 Roosevelt Way NE, # 100 >>>>> Seattle WA 98105-6099 >>>>> >>>> >>> -- >>> James W. MacDonald, M.S. >>> Biostatistician >>> University of Washington >>> Environmental and Occupational Health Sciences >>> 4225 Roosevelt Way NE, # 100 >>> Seattle WA 98105-6099 >>> >> >> > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
ADD REPLY
0
Entering edit mode
Hi, Herv?, that's exactly what I'm trying to say. Attached to this email is a tab delimited file with two columns of GeneSymbols (or Aliases), and here is some simple code to reproduce the unexpected behaviour: library(org.Hs.eg.db) mydf <- read.table("testdata.txt", sep="\t", header=TRUE, as.is=TRUE) mytest <- select(org.Hs.eg.db, key=mydf$GeneSymbol1, keytype="ALIAS", cols=c("SYMBOL","ENTREZID","ENSEMBL")) # check that mytest has less rows than mydf nrow(mydf) nrow(mytest) # pick a random row: they don't match mydf[250,] mytest[250,] Ideally, mytest should have the same number and position of rows of mydf so that I can then cbind them. If mytest has more rows because of multiple mappings that's also fine: I can always use merge(mydf, mytest), right? Thanks a lot to both for your help, it's very appreciated. Best, On 25 July 2013 21:32, Hervé Pagès <hpages at="" fhcrc.org=""> wrote: > Hi Enrico, > > > On 07/25/2013 01:20 PM, James W. MacDonald wrote: >> >> Hi Enrico, >> >> Please don't take things off-list (e.g., use reply-all). >> >> >> On 7/25/2013 2:17 PM, Enrico Ferrero wrote: >>> >>> Hi James, >>> >>> Thanks very much for your help. >>> There is an issue that needs to be solved before thinking about what's >>> the best approach in my opinion. >>> >>> I don't understand why, but the object created with the call to select >>> (test in my example, first.two in yours) has a different number of >>> rows from the original object (df in my example). Specifically it has >>> *less* rows. > > > I'm surprised it has less rows. It can definitely have more, when some > of the keys passed to select() are mapped to more than 1 row, but my > understanding was that select() would propagate unmapped keys to the > output by placing them in rows stuffed with NAs. So maybe I > misunderstood how select() works, or its behavior was changed, or > there is a bug somewhere. Could you please send the code that allows > us to reproduce this? Thanks. > > H. > > >> If all symbols were converted to all possible Entrez IDs, >>> >>> I would expect it to have more rows, not less. To me, it looks like >>> not all rows are looked up and returned. >>> >>> Do you see what I mean? >> >> >> Sure. You could be using outdated gene symbols. Or perhaps you are using >> a mixture of symbols and aliases. Which is even cooler than just all >> symbols: >> >> > symb <- c(Rkeys(org.Hs.egSYMBOL)[1:10], >> Rkeys(org.Hs.egALIAS2EG)[31:45]) >> > symb >> [1] "A1BG" "A2M" "A2MP1" "NAT1" "NAT2" "AACP" >> [7] "SERPINA3" "AADAC" "AAMP" "AANAT" "AAMP" "AANAT" >> [13] "DSPS" "SNAT" "AARS" "CMT2N" "AAV" "AAVS1" >> [19] "ABAT" "GABA-AT" "GABAT" "NPD009" "ABC-1" "ABC1" >> [25] "ABCA1" >> > select(org.Hs.eg.db, symb, "ENTREZID","SYMBOL") >> SYMBOL ENTREZID >> 1 A1BG 1 >> 2 A2M 2 >> 3 A2MP1 3 >> 4 NAT1 9 >> 5 NAT2 10 >> 6 AACP 11 >> 7 SERPINA3 12 >> 8 AADAC 13 >> 9 AAMP 14 >> 10 AANAT 15 >> 11 AAMP 14 >> 12 AANAT 15 >> 13 DSPS <na> >> 14 SNAT <na> >> 15 AARS 16 >> 16 CMT2N <na> >> 17 AAV <na> >> 18 AAVS1 17 >> 19 ABAT 18 >> 20 GABA-AT <na> >> 21 GABAT <na> >> 22 NPD009 <na> >> 23 ABC-1 <na> >> 24 ABC1 <na> >> 25 ABCA1 19 >> > select(org.Hs.eg.db, symb, "ENTREZID","ALIAS") >> ALIAS ENTREZID >> 1 A1BG 1 >> 2 A2M 2 >> 3 A2MP1 3 >> 4 NAT1 9 >> 5 NAT1 1982 >> 6 NAT1 6530 >> 7 NAT1 10991 >> 8 NAT2 10 >> 9 NAT2 81539 >> 10 AACP 11 >> 11 SERPINA3 12 >> 12 AADAC 13 >> 13 AAMP 14 >> 14 AANAT 15 >> 15 DSPS 15 >> 16 SNAT 15 >> 17 AARS 16 >> 18 CMT2N 16 >> 19 AAV 17 >> 20 AAVS1 17 >> 21 ABAT 18 >> 22 GABA-AT 18 >> 23 GABAT 18 >> 24 NPD009 18 >> 25 ABC-1 19 >> 26 ABC1 19 >> 27 ABC1 63897 >> 28 ABCA1 19 >> Warning message: >> In .generateExtraRows(tab, keys, jointype) : >> 'select' and duplicate query keys resulted in 1:many mapping between >> keys and return rows >> > mget(c("1982","6530","10991"), org.Hs.egGENENAME) >> $`1982` >> [1] "eukaryotic translation initiation factor 4 gamma, 2" >> >> $`6530` >> [1] "solute carrier family 6 (neurotransmitter transporter, >> noradrenalin), member 2" >> >> $`10991` >> [1] "solute carrier family 38, member 3" >> >> Best, >> >> Jim >> >>> >>> On 25 July 2013 18:17, James W. MacDonald<jmacdon at="" uw.edu=""> wrote: >>>> >>>> Hi Enrico, >>>> >>>> >>>> On 7/25/2013 12:56 PM, Enrico Ferrero wrote: >>>>> >>>>> Dear James, >>>>> >>>>> Thanks very much for your prompt reply. >>>>> I knew the problem was the for loop and the select function is indeed >>>>> a lot faster than that and works perfectly with toy data. >>>>> >>>>> However, this is what happens when I try to use it with real data: >>>>> >>>>>> test<- select(org.Hs.eg.db, keys=df$GeneSymbol, keytype="ALIAS", >>>>>> cols=c("SYMBOL","ENTREZID","ENSEMBL")) >>>>> >>>>> Warning message: >>>>> In .generateExtraRows(tab, keys, jointype) : >>>>> 'select' and duplicate query keys resulted in 1:many mapping >>>>> between >>>>> keys and return rows >>>>> >>>>> which is probably the warning you mentioned. >>>> >>>> >>>> That's not the warning I mentioned, but it does point out the same >>>> issue, >>>> which is that there is a one to many mapping between symbol and >>>> entrez gene >>>> ID. >>>> >>>> So now you have to decide if you want to be naive (or stupid, >>>> depending on >>>> your perspective) or not. You could just cover your eyes and do this: >>>> >>>> first.two<- first.two[!duplicated(first.two$SYMBOL),] >>>> >>>> which will choose for you the first symbol -> gene ID mapping and >>>> nuke the >>>> rest. That's nice and quick, but you are making huge assumptions. >>>> >>>> Or you could decide to be a bit more sophisticated and do something like >>>> >>>> thelst<- tapply(1:nrow(first.two), first.two$SYMBOL, function(x) >>>> first.two[x,]) >>>> >>>> At this point you can take a look at e.g., thelst[1:10] to see what >>>> we just >>>> did >>>> >>>> thelst<- do.call("rbind", lapply(thelst, function(x) c(x[1,1], >>>> paste(x[,2], >>>> collapse = "|"))) >>>> >>>> and here you can look at head(thelst). >>>> >>>> Then you can check to ensure that the first column of thelst is >>>> identical to >>>> the first column of df, and proceed as before. >>>> >>>> But there is still the problem of the multiple mappings. As an example: >>>> >>>>> thelst[1:5] >>>> >>>> $HBD >>>> SYMBOL ENTREZID >>>> 2535 HBD 3045 >>>> 2536 HBD 100187828 >>>> >>>> $KIR3DL3 >>>> SYMBOL ENTREZID >>>> 17513 KIR3DL3 115653 >>>> 17514 KIR3DL3 100133046 >>>> >>>>> mget(as.character(thelst[[1]][,2]), org.Hs.egGENENAME) >>>> >>>> $`3045` >>>> [1] "hemoglobin, delta" >>>> >>>> $`100187828` >>>> [1] "hypophosphatemic bone disease" >>>> >>>>> mget(as.character(thelst[[2]][,2]), org.Hs.egGENENAME) >>>> >>>> $`115653` >>>> [1] "killer cell immunoglobulin-like receptor, three domains, long >>>> cytoplasmic tail, 3" >>>> >>>> $`100133046` >>>> [1] "killer cell immunoglobulin-like receptor three domains long >>>> cytoplasmic >>>> tail 3" >>>> >>>> >>>> So HBD is the gene symbol for two different genes! If this gene >>>> symbol is in >>>> your data, you will now have attributed your data to two genes that >>>> apparently are not remotely similar. if KIR3DL3 is in your data, then it >>>> worked out OK for that gene. >>>> >>>> Best, >>>> >>>> Jim >>>> >>>> >>>> >>>> >>>> >>>>> The real problem is that the number of rows is now different for the 2 >>>>> objects: >>>>>> >>>>>> nrow(df); nrow(test) >>>>> >>>>> [1] 573 >>>>> [1] 201 >>>>> >>>>> So I obviously can't put the new data into the original df. My >>>>> impression is that when the 1 to many mapping arises, the select >>>>> functions exits, with that warning message. As a result, my test >>>>> object is incomplete. >>>>> >>>>> On top of that, and I can't really explain this, the row positions are >>>>> messed up, e.g. >>>>> >>>>>> all.equal(df[100,],test[100,]) >>>>> >>>>> returns FALSE. >>>>> >>>>> How can I work around this? >>>>> >>>>> Thanks a lot! >>>>> >>>>> Best, >>>>> >>>>> On 25 July 2013 16:58, James W. MacDonald<jmacdon at="" uw.edu=""> wrote: >>>>>> >>>>>> Hi Enrico, >>>>>> >>>>>> >>>>>> On 7/25/2013 11:35 AM, Enrico Ferrero wrote: >>>>>>> >>>>>>> Hello, >>>>>>> >>>>>>> I often have data frames where I need to perform ID conversions on >>>>>>> one >>>>>>> or >>>>>>> more of the columns while preserving the order of the rows, e.g.: >>>>>>> >>>>>>> GeneSymbol Value1 Value2 >>>>>>> GS1 2.5 0.1 >>>>>>> GS2 3 0.2 >>>>>>> .. >>>>>>> >>>>>>> And I want to obtain: >>>>>>> >>>>>>> GeneSymbol EntrezGeneID Value1 Value2 >>>>>>> GS1 EG1 2.5 0.1 >>>>>>> GS2 EG2 3 0.2 >>>>>>> .. >>>>>>> >>>>>>> What I've done so far was to create a function that uses >>>>>>> org.Hs.eg.db to >>>>>>> loop over the rows of the column and does the conversion: >>>>>>> >>>>>>> library(org.Hs.eg.db) >>>>>>> alias2EG<- function(x) { >>>>>>> for (i in 1:length(x)) { >>>>>>> if (!is.na(x[i])) { >>>>>>> repl<- org.Hs.egALIAS2EG[[x[i]]][1] >>>>>>> if (!is.null(repl)) { >>>>>>> x[i]<- repl >>>>>>> } >>>>>>> else { >>>>>>> x[i]<- NA >>>>>>> } >>>>>>> } >>>>>>> } >>>>>>> return(x) >>>>>>> } >>>>>> >>>>>> >>>>>> I should first note that gene symbols are not unique, so you are >>>>>> taking a >>>>>> chance on your mappings. Is there no other annotation for your data? >>>>>> >>>>>> In addition, you should note that it is almost always better to >>>>>> think of >>>>>> objects as vectors and matrices in R, rather than as things that >>>>>> need to >>>>>> be >>>>>> looped over (e.g., R isn't Perl or C). >>>>>> >>>>>> first.two<- select(org.Hs.eg.db, as.character(df$GeneSymbol), >>>>>> "ENTREZID", >>>>>> "SYMBOL") >>>>>> >>>>>> Note that there used to be a warning or an error (don't remember >>>>>> which) >>>>>> when >>>>>> you did something like this, stating that gene symbols are not unique, >>>>>> and >>>>>> that you shouldn't do this sort of thing. Apparently this warning has >>>>>> been >>>>>> removed, but the issue remains valid. >>>>>> >>>>>> ## check yourself >>>>>> >>>>>> all.equal(df$GeneSymbol, first.two$SYMBOL) >>>>>> >>>>>> ## if true, proceed >>>>>> >>>>>> df<- data.frame(first.two, df[,-1]) >>>>>> >>>>>> Best, >>>>>> >>>>>> Jim >>>>>> >>>>>> >>>>>> >>>>>>> and then call the function like this: >>>>>>> >>>>>>> df$EntrezGeneID<- alias2GS(df$GeneSymbol) >>>>>>> >>>>>>> This works well, but gets very slow when I need to do multiple >>>>>>> conversions >>>>>>> on large datasets. >>>>>>> >>>>>>> Is there any way I can achieve the same result but in a quicker, more >>>>>>> efficient way? >>>>>>> >>>>>>> Thank you. >>>>>>> >>>>>> -- >>>>>> James W. MacDonald, M.S. >>>>>> Biostatistician >>>>>> University of Washington >>>>>> Environmental and Occupational Health Sciences >>>>>> 4225 Roosevelt Way NE, # 100 >>>>>> Seattle WA 98105-6099 >>>>>> >>>>> >>>> -- >>>> James W. MacDonald, M.S. >>>> Biostatistician >>>> University of Washington >>>> Environmental and Occupational Health Sciences >>>> 4225 Roosevelt Way NE, # 100 >>>> Seattle WA 98105-6099 >>>> >>> >>> >> > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages at fhcrc.org > Phone: (206) 667-5791 > Fax: (206) 667-1319 -- Enrico Ferrero PhD Student Steve Russell Lab - Department of Genetics FlyChip - Cambridge Systems Biology Centre University of Cambridge e.ferrero at gen.cam.ac.uk http://flypress.gen.cam.ac.uk/ -------------- next part -------------- GeneSymbol1 GeneSymbol2 HTR1A NA HTR1B NA HTR1D NA HTR1E NA HTR1F NA HTR2A NA HTR2B NA HTR2C NA NA NA HTR4 NA HTR5A NA HTR5BP NA HTR6 NA HTR7 NA ALOX5 ALOX5AP ALOX5 NA ADCY5 NA ACACA NA ACACB NA SLC33A1 NA ADA NA ADK NA SLC1A5 NA GABRA1 NA CHRNA1 NA GLRA1 NA CHRNA10 NA GABRA2 NA GLRA2 NA CHRNA2 NA GABRA3 NA GLRA3 NA CHRNA3 NA GABRA4 NA CHRNA4 NA GABRA5 NA CHRNA5 NA GABRA6 NA CHRNA6 NA CHRNA7 NA CHRNA9 NA ADRA1D NA ADRA1B NA ADRA2A NA ADRA2B NA ADRA2C NA NA ADM NA ADM2 NA CALCA NA CALCB NA IAPP AKR1B1 NA ACE NA ACE2 NA APLNR APLN APLNR NA TNPO1 NA AQP1 NA AQP10 NA AQP2 NA AQP3 NA AQP5 NA AQP6 NA AQP7 NA AQP8 NA ASIC1 NA ASIC2 NA ASIC3 NA ADORA1 NA ADORA2A NA ADORA2B NA ADORA3 NA SLC6A14 NA AGTR1 AGT AGTR1 NA AGTR2 AGT AGTR2 NA AXL GAS6 AXL PROS1 SLC12A2 NA NMBR NA NMBR GRP NMBR NMB GRPR NA GRPR GRP GRPR NMB BRS3 NA GLRB NA GABRB1 NA GABRB2 NA GABRB3 NA ADRB1 NA ADRB2 NA ADRB3 NA SLC6A12 NA LTB4R NA LTB4R2 NA BDKRB1 KNG1 BDKRB1 NA BDKRB2 KNG1 BDKRB2 NA C3AR1 C3 C3AR1 NA C5AR1 NA C5AR1 C5 C5AR1 RPS19 C5AR2 C5 C5AR2 NA ANO1 NA NA CALM2 CASP1 NA CASP2 NA CASP3 NA CASP5 NA CASP6 NA CASP8 NA CASP9 NA CASR NA CACNA1S NA CACNA1C NA CACNA1D NA CACNA1F NA CACNA1A NA CACNA1B NA CACNA1E NA CACNA1G NA CACNA1H NA CACNA1I NA CATSPER1 NA CNR1 NA CNR2 NA CCKAR NA CCKBR NA CCKBR CCK CCR1 CCL13 CCR1 CCL8 CCR1 NA CCR1 CCL14 CCR1 CCL15 CCR1 CCL23 CCR1 CCL3 CCR1 CCL5 CCR1 CCL7 CCR1 CCL4 CCBP2 CCL27 CCBP2 ENC1 CCR2 NA CCR2 CCL11 CCR2 CCL13 CCR2 CCL16 CCR2 CCL2 CCR2 CCL7 CCR2 CCL8 CCR2 CCL24 CCR2 CCL26 CCR3 NA CCR3 CCL13 CCR3 CCL15 CCR3 CCL2 CCR3 CCL24 CCR3 CCL26 CCR3 ENC1 CCR3 CCL5 CCR3 CCL7 CCR3 CCL8 CCR3 CCL11 CCR3 CXCL10 CCR3 CXCL11 CCR3 CXCL9 CCR4 NA CCR4 CCL17 CCR4 CCL22 CCR5 NA CCR5 CCL11 CCR5 CCL13 CCR5 CCL14 CCR5 CCL16 CCR5 CCL2 CCR5 CCL3 CCR5 CCL4 CCR5 CCL5 CCR5 CCL8 CCR5 CCL7 CCR6 DEFB4A CCR6 CCL20 CCR6 NA CCR7 CCL19 CCR7 CCL21 CCR7 NA CCR8 CCL17 CCR8 CCL1 CCR8 CCL16 CCR8 CCL4 CCR8 NA CCBP2 CCL25 CCBP2 NA CCRL2 NA CCRL2 CCL19 CFTR NA CMKLR1 NA CMKLR1 RARRES2 SLC44A1 NA SLC5A7 NA CNTFR CNTF NA CNTF CLCN1 NA CLCN2 NA CLCN3 NA CLCN4 NA CLCN6 NA CLCN7 NA CLCNKA NA CLCNKB NA MERTK GAS6 CNGA1 NA CNGA3 NA CSF1R CSF3 CSF1R CSF2 CSF1R CSF1 CSF2RA CSF2 NR1I3 NA PTGS1 NA PTGS2 NA CRHR1 CRH CRHR1 NA CRHR1 UCN CRHR1 UTS2 CRHR2 CRH CRHR2 NA CRHR2 UCN CRHR2 UTS2 CRHR2 UCN3 CALCR ADM CALCR ADM2 CALCR CALCA CALCR IAPP CALCR CALCB CALCR NA GJC3 NA GJB7 NA GJB2 NA GJB6 NA GJB4 NA GJB3 NA GJB5 NA GJD3 NA GJB1 NA GJD2 NA GJA4 NA GJA5 NA GJD4 NA GJA1 NA GJC1 NA GJA3 NA GJC2 NA GJA8 NA GJA9 NA CXCR1 YARS CXCR1 CXCL1 CXCR1 CXCL6 CXCR1 IL8 CXCR1 NA CXCR2 NA CXCR2 CXCL1 CXCR2 CXCL2 CXCR2 CXCL3 CXCR2 CXCL5 CXCR2 CXCL6 CXCR2 PPBP CXCR2 IL8 CXCR3 CCL11 CXCR3 CCL13 CXCR3 CCL19 CXCR3 CCL20 CXCR3 CCL5 CXCR3 CCL7 CXCR3 CXCL10 CXCR3 CXCL11 CXCR3 CXCL9 CXCR3 CXCL12 CXCR3 NA CXCR4 NA CXCR4 CXCL12 CXCR5 CXCL13 CXCR6 CXCL16 CXCR6 NA CXCR7 NA CXCR7 CXCL12 CX3CR1 CX3CL1 CX3CR1 NA CYSLTR1 NA CYSLTR2 NA CBS NA CTH NA SLC6A3 NA GABRD NA OPRD1 NA OPRD1 POMC OPRD1 PDYN OPRD1 PENK NAPEPLD NA DAGLA NA DPEP1 NA MVD NA PTGDR NA PTGDR2 NA DRD1 NA DRD2 NA DRD3 NA DRD4 NA DRD5 NA NT5E NA ECE1 NA EGFR AREG EGFR BTC EGFR EGF EGFR EPGN EGFR EREG EGFR HBEGF EGFR TGFA GABRE NA PTGER1 NA PTGER2 NA PTGER3 NA PTGER4 NA SLC29A1 NA SLC29A2 NA ESR1 NA ESR2 NA ESRRA NA ESRRB NA ESRRG NA EDNRA EDN1 EDNRA EDN2 EDNRA NA EDNRB NA EDNRB EDN3 SLC1A3 NA SLC1A2 NA SLC1A1 NA SLC1A6 NA SLC1A7 NA EPOR EPO EPOR NA EPOR IL1RN NR1H4 NA FDPS NA FAAH NA FAAH2 NA FFAR1 NA FFAR2 NA FFAR3 NA O3FAR1 NA FLT1 VEGFA FLT1 VEGFB FLT3 FLT3LG FLT4 VEGFC FLT4 FIGF FLT4 PDGFC SLC19A1 NA FPR1 NA FPR1 ANXA1 FPR2 NA FPR2 ANXA1 FPR3 ANXA1 FPR3 HEBP1 FPR3 NA FPR3 MT-RNR2 PTGFR NA FSHR CGA FSHR NA FZD6 WNT3A FZD6 WNT4 FZD6 WNT5A GABBR1 NA GALR1 NA GALR2 NA GABRG1 NA GABRG2 NA GABRG3 NA SLC6A1 NA SLC6A13 NA GFRA1 NA GGPS1 NA GHSR NA GHSR GHRL GHRHR NA GIPR GIP GIPR NA GLP1R NA GLP1R GCG GLP2R GCG GRIA1 NA GRIA2 NA GRIA3 NA GRIA4 NA GCGR GCG GCGR NA NR3C1 NA SLC2A1 NA SLC2A2 NA SLC2A3 NA SLC2A4 NA GRIK1 NA GRIK2 NA GRIK3 NA GRIK4 NA GRIK5 NA GRIN1 NA GRIN2A NA GRIN2B NA GRIN2C NA GRIN2D NA SLC6A9 NA SLC6A2 NA GNRHR2 NA GNRHR NA GNRHR GNRH1 GNRHR GNRH2 GPBAR1 NA GPER NA GPRC6A NA CSF3R CSF3 NA CSF2 GHR CSH1 GHR GH1 GHR GH2 GHR IL36RN GUCY2C NA HCAR1 NA HCAR2 NA HCAR3 NA HCN1 NA HCN2 NA HCN3 NA HCN4 NA HNF4A NA HDC NA DHX8 NA DHX15 NA HRH3 NA HRH4 NA HRH4 CCL16 HVCN1 NA HMGCR NA BAI1 NA GPR119 NA GPR12 NA GPR132 NA GPR143 NA GPR17 NA GPR183 NA GPR18 NA GPR1 NA GPR1 RARRES2 GPR32 NA GPR34 NA GPR35 NA GPR37 NA GPR3 NA GPR55 NA GPR63 NA GPR75 CCL5 GPR84 NA GPR87 NA LGR4 RSPO1 LGR4 RSPO2 LGR4 RSPO3 LGR4 RSPO4 LGR5 RSPO2 LGR5 RSPO1 LGR5 RSPO3 LGR5 RSPO4 LGR6 RSPO3 LGR6 RSPO1 LGR6 RSPO2 LGR6 RSPO4 MAS1 AGT MRGPRD NA MRGPRX1 NA MRGPRX2 NA MRGPRX2 CORT NOS2 NA NAAA NA IGF1R IGF1 IGF1R IGF2 INSR INS NA IFNA10 NA IFNA1 NA IFNA14 NA IFNA16 NA IFNA17 NA IFNA2 NA IFNA21 NA IFNA4 NA IFNA5 NA IFNA6 NA IFNA7 NA IFNA8 NA IFNB1 NA IFNK NA IFNW1 NA IFNG NA IL10 NA IL11 IL11RA IL11 NA IL12B IL12RB1 IL12B IL13RA1 IL13 IL13RA2 IL13 NA IL15 IL15RA IL15 NA IL17C NA IL17A NA IL17F IL18R1 IL18 NA IL18 IL1RL1 IL33 IL1RL2 IL36A IL1RL2 IL36B IL1RL2 IL36G IL1R1 IL1A IL1R1 IL1B IL1R1 NA IL1R1 IL1RN NA IL1A NA IL1B NA IL1RN NA IL19 NA IL20 NA IL24 IL21R IL21 NA IL22 IL23R IL12A NA IL12A NA IL17B NA C19orf10 NA IL28A NA IL28B NA IL29 NA IL2 IL2RA IL2 IL2RB IL15 IL2RB IL2 NA IL31 IL31RA IL31 NA IL33 NA IL36A NA IL36B NA IL36G NA IL36RN NA IL3 IL3RA IL3 IL4R IL13 IL4R IL4 NA IL4 NA IL13 NA IL5 IL5RA IL5 IL6R IL6 NA IL6 IL7R IL7 NA IL7 IL9R IL9 NA IL9 P2RY10 NA PTGIR NA ITPR1 NA ITPR2 NA ITPR3 NA IDI1 NA OPRK1 PDYN OPRK1 NA OPRK1 POMC SLC12A4 NA SLC12A5 NA SLC12A6 NA SLC12A7 NA SLC12A1 NA KDR VEGFA KDR VEGFC KDR PDGFC KISS1R NA KISS1R KISS1 KCNK10 NA KCNK12 NA KCNK13 NA KCNK15 NA KCNK18 NA KCNK2 NA KCNK3 NA KCNK4 NA KCNK9 NA KCNMA1 NA KCNN1 NA KCNN2 NA KCNN3 NA KCNN4 NA KCNT1 NA KCNT2 NA KCNU1 NA KCNJ1 NA KCNJ2 NA KCNJ12 NA KCNJ4 NA KCNJ14 NA KCNJ3 NA KCNJ6 NA KCNJ9 NA KCNJ5 NA KCNJ10 NA KCNJ8 NA KCNJ11 NA KCNJ13 NA KCNH1 NA KCNA1 NA KCNH2 NA KCNH6 NA KCNA2 NA KCNH8 NA KCNH3 NA KCNA3 NA KCNA4 NA KCNA5 NA KCNA6 NA KCNA7 NA KCNA10 NA KCNB1 NA KCNB2 NA KCNC1 NA KCNC2 NA KCNC3 NA KCNC4 NA KCND1 NA KCNQ1 NA KCNQ2 NA KCNQ3 NA KCNQ4 NA KCNQ5 NA LEPR LEP NA CTF1 NA LIF NA OSM LIFR CTF1 LIFR LIF LIFR OSM LTA4H NA LHCGR CGA LHCGR LHB LHCGR NA NR1H3 NA NR1H2 NA LPAR1 NA LPAR2 NA LPAR3 NA LPAR4 NA LPAR5 NA LPAR6 NA PAH NA TPH1 NA TDO2 NA TH NA MST1R MST1 SLC47A2 NA MCHR1 NA MCHR1 PMCH MCHR2 NA MCHR2 PMCH MC1R POMC MC1R NA MC2R NA MC3R POMC MC3R NA MC3R AGRP MC4R POMC MC4R NA MC4R AGRP MC5R POMC MC5R NA MC5R AGRP MET HGF MVK NA GRM1 NA GRM2 NA GRM3 NA GRM4 NA GRM5 NA GRM6 NA GRM7 NA GRM8 NA NR3C2 NA SLC25A4 NA SLC25A1 NA MMP13 NA MMP2 NA MGLL NA MLNR NA MLNR MLN CHRM1 NA CHRM2 NA CHRM3 NA CHRM4 NA CHRM5 NA MTNR1A NA MTNR1B NA SLC47A1 NA OPRM1 NA OPRM1 PDYN OPRM1 PENK MYLK CALM2 MYLK2 CALM2 SLC12A3 NA NALCN NA SCN1A NA SCN2A NA SCN3A NA SCN4A NA SCN5A NA SCN8A NA SCN9A NA SCN10A NA SCN11A NA NGFR BDNF NGFR NTF3 NGFR NTF4 NGFR NGF NOS1 NA NTRK3 NTF3 MME NA TACR1 NA TACR1 TAC4 TACR1 TAC1 TACR1 TAC3 TACR2 NA TACR2 TAC4 TACR2 TAC1 TACR2 TAC3 TACR3 NA TACR3 TAC4 TACR3 TAC1 TACR3 TAC3 NOD1 NA NOD2 NA NLRP1 NA NMUR1 NMS NMUR1 NMU NMUR1 NA NMUR2 NMS NMUR2 NMU NMUR2 NA OPRL1 NA OPRL1 PNOC NPBWR1 NA NPBWR1 NPB NPBWR1 NPW NPBWR2 NA NPBWR2 NPB NPBWR2 NPW NPFFR1 NA NPFFR1 NPFF NPFFR1 NPVF NPFFR1 PPY NPFFR2 NA NPFFR2 NPFF NPFFR2 PPY NPFFR2 NPVF NPR1 NPPA NPR1 NPPB NPR1 NA NPR2 NPPC NPR2 NA NPR3 NA NPR3 OSTN NPSR1 NPS NPSR1 NA NTSR1 NA NTSR1 NTS NTSR2 NA NTSR2 NTS SLCO1A2 NA SLCO1B1 NA SLCO1B3 NA SLCO1C1 NA SLCO2A1 NA SLCO2B1 NA SLCO3A1 NA SLCO4A1 NA SLCO4C1 NA OSMR OSM ODC1 NA OXTR NA OXTR OXT OXTR AVP OXER1 NA OXGR1 NA HCRTR1 NA HCRTR1 HCRT HCRTR2 NA HCRTR2 HCRT P2RX1 NA P2RX3 NA P2RX7 NA P2RY11 NA P2RY12 NA P2RY13 NA P2RY14 NA P2RY1 NA P2RY2 NA P2RY4 NA P2RY6 NA ADCYAP1R1 NA ADCYAP1R1 ADCYAP1 ADCYAP1R1 VIP PTAFR NA F2R NA F2RL1 NA F2RL3 NA PDE1A CALM2 PDE1A NA PDE1B CALM2 PDE1B NA PDE1C CALM2 PDE1C NA PDE2A NA PDE3A NA PDE3B NA PDE4A NA PDE4B NA PDE4C NA PDE4D NA PDE5A NA PDE7A NA PDE7B NA PDE8A NA PDE8B NA PDE9A NA SLC15A1 NA SLC15A2 NA SLC15A3 NA SLC15A4 NA PPARA NA PPARD NA PPARG NA SFI1 NA PMVK NA GABRP NA PROKR1 NA PROKR1 PROK1 PROKR1 PROK2 PROKR2 NA PROKR2 PROK1 PROKR2 PROK2 SLC29A4 NA PLD2 ARF1 PLD2 NA NR1I2 NA PGR NA SLC6A7 NA PRKCB NA PRKCZ NA PRKG1 NA SLC36A1 NA SLC36A2 NA SLC46A1 NA PRLHR PRLH PRLHR NA PRLHR PTHLH PRLHR NPY PTH1R NA PTH1R PTHLH PTH1R PTENP1 PTH2R NA PTH2R PTHLH PTH2R PTENP1 PANX1 NA PANX2 NA PANX3 NA QRFPR QRFP QRFPR NA RORA NA RORB NA RORC NA RARA NA RARB NA RARG NA RXRA NA RXRB NA RXRG NA NR1D1 NA NR1D2 NA RHAG NA RHCG NA GABRR1 NA GABRR2 NA GABRR3 NA RXFP1 NA RXFP1 RLN1 RXFP1 RLN2 RXFP1 RLN3 RXFP1 RLN RXFP2 NA RXFP2 RLN1 RXFP2 RLN2 RXFP2 RLN3 RXFP2 INSL3 RXFP2 RLN RXFP3 NA RXFP3 RLN3 RXFP3 INSL5 RXFP4 NA RXFP4 RLN3 RXFP4 INSL5 RYR1 NA RYR2 NA RYR3 NA S1PR1 NA S1PR2 NA S1PR3 NA S1PR4 NA S1PR5 NA AHCY NA SCTR FLNB SCTR VIP SCTR NA SLC6A4 NA SLC5A1 NA SLC5A2 NA SLC5A11 NA SLC5A8 NA SLC5A3 NA SLC38A1 NA SLC38A2 NA SLC38A3 NA SLC38A4 NA SLC38A5 NA SLC10A1 NA SLC10A2 NA SLC23A2 NA FDFT1 NA SSTR1 NA SSTR1 CORT SSTR1 SST SSTR2 NA SSTR2 CORT SSTR2 SST SSTR3 NA SSTR3 CORT SSTR3 SST SSTR4 NA SSTR4 CORT SSTR4 SST SSTR5 NA SSTR5 CORT SSTR5 SST SUCNR1 NA TAAR1 NA TEK ANGPT1 TEK ANGPT4 NR2C2 NA GABRQ NA SLC19A2 NA SLC19A3 NA MPL THPO NA TSLP THRA NA THRB NA TLR2 NA TLR3 NA TLR4 NA TLR5 NA TLR7 NA TLR8 NA TLR9 NA TNFRSF10A TNFSF10 TNFRSF10B TNFSF10 TNFRSF11A TNFSF11 BTF3P11 TNFSF11 TNFRSF25 TNFSF12 TNFRSF25 TNFSF15 TNFRSF12A TNFSF12 TNFRSF13B TNFSF13B TNFRSF13C TNFSF13B TNFRSF14 BTLA TNFRSF14 LTA TNFRSF14 TNFSF14 TNFRSF17 TNFSF13 TNFRSF17 TNFSF13B TNFRSF18 TNFSF18 TNFRSF1A LTA TNFRSF1A TNF TNFRSF1B LTA TNFRSF1B TNF LTBR TNFSF14 LTBR LTB TNFRSF4 TNFSF4 CD40 CD40LG FAS FASLG CD27 CD70 TNFRSF8 TNFSF8 TNFRSF9 TNFSF9 TBXA2R NA TRHR NA TRPA1 NA TRPC1 NA TRPC3 NA TRPC4 NA TRPC5 NA TRPC6 NA TRPM2 NA TRPM1 NA CLU NA TRPM3 NA TRPM4 NA TRPM5 NA TRPM6 NA TRPM7 NA TRPM8 NA MCOLN3 NA PKD2 NA PKD2L1 NA TRPV1 NA TRPV2 NA TRPV3 NA TRPV4 NA TRPV5 NA TRPV6 NA TSHR NA TSHR TSHB TYRO3 GAS6 TYRO3 PROS1 UTS2R NA UTS2R UTS2 UTS2R UTS2D AKT1 NA ERBB4 BTC ERBB4 EREG ERBB4 HBEGF ERBB4 NRG1 ERBB4 NRG2 ERBB4 NRG3 ERBB4 NRG4 ERBB3 NRG1 ERBB3 NRG2 SLC18A3 NA SLC32A1 NA SLC18A1 NA SLC18A2 NA CYP27B1 NA KIT KITLG VIPR1 NA VIPR1 GHRH VIPR1 ADCYAP1 VIPR1 VIP VIPR2 NA VIPR2 ADCYAP1 VIPR2 VIP AVPR1A NA AVPR1A AVP AVPR1A OXT AVPR1B NA AVPR1B AVP AVPR1B OXT AVPR2 NA AVPR2 AVP AVPR2 OXT XCR1 XCL2 XCR1 NA XCR1 XCL1 NPY1R NA NPY1R NPY NPY2R NA NPY2R NPY NPY2R PYY PPYR1 NA PPYR1 NPY PPYR1 PPY PPYR1 PYY NPY5R NA NPY5R NPY NPY5R PPY NPY5R PYY ZACN NA
ADD REPLY
0
Entering edit mode
On 07/25/2013 02:06 PM, Enrico Ferrero wrote: > Hi, > > Herv?, that's exactly what I'm trying to say. > > Attached to this email is a tab delimited file with two columns of > GeneSymbols (or Aliases), and here is some simple code to reproduce > the unexpected behaviour: > > library(org.Hs.eg.db) > mydf <- read.table("testdata.txt", sep="\t", header=TRUE, as.is=TRUE) > mytest <- select(org.Hs.eg.db, key=mydf$GeneSymbol1, keytype="ALIAS", > cols=c("SYMBOL","ENTREZID","ENSEMBL")) > # check that mytest has less rows than mydf > nrow(mydf) > nrow(mytest) I see. You have a lot of NAs in the vector of keys you're passing to select(): > tableis.na(mydf$GeneSymbol1)) FALSE TRUE 1018 64 As indicated by the warning I get (you should have gotten it too), select() removes those NAs from the input: > mytest <- select(org.Hs.eg.db, key=mydf$GeneSymbol1, keytype="ALIAS", cols=c("SYMBOL","ENTREZID","ENSEMBL")) Warning messages: 1: In .select(x, keys, cols, keytype, jointype = jointype) : 'NA' keys have been removed 2: In .generateExtraRows(tab, keys, jointype) : 'select' and duplicate query keys resulted in 1:many mapping between keys and return rows which explains why the output has less rows than the length of the input. Trying this again with a vector of keys of length 3: > select(org.Hs.eg.db, key=c("HTR1E", NA, "HTR1F"), keytype="ALIAS", cols=c("SYMBOL","ENTREZID","ENSEMBL")) ALIAS SYMBOL ENTREZID ENSEMBL 1 HTR1E HTR1E 3354 ENSG00000168830 2 HTR1F HTR1F 3355 ENSG00000179097 Warning message: In .select(x, keys, cols, keytype, jointype = jointype) : 'NA' keys have been removed > # pick a random row: they don't match > mydf[250,] > mytest[250,] As a general rule, you cannot expect the rows of the output produced by select() to match the vector of keys passed to it. Unless you know your keys are mapped to at most 1 value but it's not the case here. Preserving positions between the input and output of select() could have been achieved by returning a DataFrame instead of a data.frame, and by using list-like columns, but I think what drove the current design was the desire to keep the returned object as simple as possible. Cheers, H. > > Ideally, mytest should have the same number and position of rows of > mydf so that I can then cbind them. > If mytest has more rows because of multiple mappings that's also fine: > I can always use merge(mydf, mytest), right? > > Thanks a lot to both for your help, it's very appreciated. > Best, > > > On 25 July 2013 21:32, Hervé Pagès <hpages at="" fhcrc.org=""> wrote: >> Hi Enrico, >> >> >> On 07/25/2013 01:20 PM, James W. MacDonald wrote: >>> >>> Hi Enrico, >>> >>> Please don't take things off-list (e.g., use reply-all). >>> >>> >>> On 7/25/2013 2:17 PM, Enrico Ferrero wrote: >>>> >>>> Hi James, >>>> >>>> Thanks very much for your help. >>>> There is an issue that needs to be solved before thinking about what's >>>> the best approach in my opinion. >>>> >>>> I don't understand why, but the object created with the call to select >>>> (test in my example, first.two in yours) has a different number of >>>> rows from the original object (df in my example). Specifically it has >>>> *less* rows. >> >> >> I'm surprised it has less rows. It can definitely have more, when some >> of the keys passed to select() are mapped to more than 1 row, but my >> understanding was that select() would propagate unmapped keys to the >> output by placing them in rows stuffed with NAs. So maybe I >> misunderstood how select() works, or its behavior was changed, or >> there is a bug somewhere. Could you please send the code that allows >> us to reproduce this? Thanks. >> >> H. >> >> >>> If all symbols were converted to all possible Entrez IDs, >>>> >>>> I would expect it to have more rows, not less. To me, it looks like >>>> not all rows are looked up and returned. >>>> >>>> Do you see what I mean? >>> >>> >>> Sure. You could be using outdated gene symbols. Or perhaps you are using >>> a mixture of symbols and aliases. Which is even cooler than just all >>> symbols: >>> >>> > symb <- c(Rkeys(org.Hs.egSYMBOL)[1:10], >>> Rkeys(org.Hs.egALIAS2EG)[31:45]) >>> > symb >>> [1] "A1BG" "A2M" "A2MP1" "NAT1" "NAT2" "AACP" >>> [7] "SERPINA3" "AADAC" "AAMP" "AANAT" "AAMP" "AANAT" >>> [13] "DSPS" "SNAT" "AARS" "CMT2N" "AAV" "AAVS1" >>> [19] "ABAT" "GABA-AT" "GABAT" "NPD009" "ABC-1" "ABC1" >>> [25] "ABCA1" >>> > select(org.Hs.eg.db, symb, "ENTREZID","SYMBOL") >>> SYMBOL ENTREZID >>> 1 A1BG 1 >>> 2 A2M 2 >>> 3 A2MP1 3 >>> 4 NAT1 9 >>> 5 NAT2 10 >>> 6 AACP 11 >>> 7 SERPINA3 12 >>> 8 AADAC 13 >>> 9 AAMP 14 >>> 10 AANAT 15 >>> 11 AAMP 14 >>> 12 AANAT 15 >>> 13 DSPS <na> >>> 14 SNAT <na> >>> 15 AARS 16 >>> 16 CMT2N <na> >>> 17 AAV <na> >>> 18 AAVS1 17 >>> 19 ABAT 18 >>> 20 GABA-AT <na> >>> 21 GABAT <na> >>> 22 NPD009 <na> >>> 23 ABC-1 <na> >>> 24 ABC1 <na> >>> 25 ABCA1 19 >>> > select(org.Hs.eg.db, symb, "ENTREZID","ALIAS") >>> ALIAS ENTREZID >>> 1 A1BG 1 >>> 2 A2M 2 >>> 3 A2MP1 3 >>> 4 NAT1 9 >>> 5 NAT1 1982 >>> 6 NAT1 6530 >>> 7 NAT1 10991 >>> 8 NAT2 10 >>> 9 NAT2 81539 >>> 10 AACP 11 >>> 11 SERPINA3 12 >>> 12 AADAC 13 >>> 13 AAMP 14 >>> 14 AANAT 15 >>> 15 DSPS 15 >>> 16 SNAT 15 >>> 17 AARS 16 >>> 18 CMT2N 16 >>> 19 AAV 17 >>> 20 AAVS1 17 >>> 21 ABAT 18 >>> 22 GABA-AT 18 >>> 23 GABAT 18 >>> 24 NPD009 18 >>> 25 ABC-1 19 >>> 26 ABC1 19 >>> 27 ABC1 63897 >>> 28 ABCA1 19 >>> Warning message: >>> In .generateExtraRows(tab, keys, jointype) : >>> 'select' and duplicate query keys resulted in 1:many mapping between >>> keys and return rows >>> > mget(c("1982","6530","10991"), org.Hs.egGENENAME) >>> $`1982` >>> [1] "eukaryotic translation initiation factor 4 gamma, 2" >>> >>> $`6530` >>> [1] "solute carrier family 6 (neurotransmitter transporter, >>> noradrenalin), member 2" >>> >>> $`10991` >>> [1] "solute carrier family 38, member 3" >>> >>> Best, >>> >>> Jim >>> >>>> >>>> On 25 July 2013 18:17, James W. MacDonald<jmacdon at="" uw.edu=""> wrote: >>>>> >>>>> Hi Enrico, >>>>> >>>>> >>>>> On 7/25/2013 12:56 PM, Enrico Ferrero wrote: >>>>>> >>>>>> Dear James, >>>>>> >>>>>> Thanks very much for your prompt reply. >>>>>> I knew the problem was the for loop and the select function is indeed >>>>>> a lot faster than that and works perfectly with toy data. >>>>>> >>>>>> However, this is what happens when I try to use it with real data: >>>>>> >>>>>>> test<- select(org.Hs.eg.db, keys=df$GeneSymbol, keytype="ALIAS", >>>>>>> cols=c("SYMBOL","ENTREZID","ENSEMBL")) >>>>>> >>>>>> Warning message: >>>>>> In .generateExtraRows(tab, keys, jointype) : >>>>>> 'select' and duplicate query keys resulted in 1:many mapping >>>>>> between >>>>>> keys and return rows >>>>>> >>>>>> which is probably the warning you mentioned. >>>>> >>>>> >>>>> That's not the warning I mentioned, but it does point out the same >>>>> issue, >>>>> which is that there is a one to many mapping between symbol and >>>>> entrez gene >>>>> ID. >>>>> >>>>> So now you have to decide if you want to be naive (or stupid, >>>>> depending on >>>>> your perspective) or not. You could just cover your eyes and do this: >>>>> >>>>> first.two<- first.two[!duplicated(first.two$SYMBOL),] >>>>> >>>>> which will choose for you the first symbol -> gene ID mapping and >>>>> nuke the >>>>> rest. That's nice and quick, but you are making huge assumptions. >>>>> >>>>> Or you could decide to be a bit more sophisticated and do something like >>>>> >>>>> thelst<- tapply(1:nrow(first.two), first.two$SYMBOL, function(x) >>>>> first.two[x,]) >>>>> >>>>> At this point you can take a look at e.g., thelst[1:10] to see what >>>>> we just >>>>> did >>>>> >>>>> thelst<- do.call("rbind", lapply(thelst, function(x) c(x[1,1], >>>>> paste(x[,2], >>>>> collapse = "|"))) >>>>> >>>>> and here you can look at head(thelst). >>>>> >>>>> Then you can check to ensure that the first column of thelst is >>>>> identical to >>>>> the first column of df, and proceed as before. >>>>> >>>>> But there is still the problem of the multiple mappings. As an example: >>>>> >>>>>> thelst[1:5] >>>>> >>>>> $HBD >>>>> SYMBOL ENTREZID >>>>> 2535 HBD 3045 >>>>> 2536 HBD 100187828 >>>>> >>>>> $KIR3DL3 >>>>> SYMBOL ENTREZID >>>>> 17513 KIR3DL3 115653 >>>>> 17514 KIR3DL3 100133046 >>>>> >>>>>> mget(as.character(thelst[[1]][,2]), org.Hs.egGENENAME) >>>>> >>>>> $`3045` >>>>> [1] "hemoglobin, delta" >>>>> >>>>> $`100187828` >>>>> [1] "hypophosphatemic bone disease" >>>>> >>>>>> mget(as.character(thelst[[2]][,2]), org.Hs.egGENENAME) >>>>> >>>>> $`115653` >>>>> [1] "killer cell immunoglobulin-like receptor, three domains, long >>>>> cytoplasmic tail, 3" >>>>> >>>>> $`100133046` >>>>> [1] "killer cell immunoglobulin-like receptor three domains long >>>>> cytoplasmic >>>>> tail 3" >>>>> >>>>> >>>>> So HBD is the gene symbol for two different genes! If this gene >>>>> symbol is in >>>>> your data, you will now have attributed your data to two genes that >>>>> apparently are not remotely similar. if KIR3DL3 is in your data, then it >>>>> worked out OK for that gene. >>>>> >>>>> Best, >>>>> >>>>> Jim >>>>> >>>>> >>>>> >>>>> >>>>> >>>>>> The real problem is that the number of rows is now different for the 2 >>>>>> objects: >>>>>>> >>>>>>> nrow(df); nrow(test) >>>>>> >>>>>> [1] 573 >>>>>> [1] 201 >>>>>> >>>>>> So I obviously can't put the new data into the original df. My >>>>>> impression is that when the 1 to many mapping arises, the select >>>>>> functions exits, with that warning message. As a result, my test >>>>>> object is incomplete. >>>>>> >>>>>> On top of that, and I can't really explain this, the row positions are >>>>>> messed up, e.g. >>>>>> >>>>>>> all.equal(df[100,],test[100,]) >>>>>> >>>>>> returns FALSE. >>>>>> >>>>>> How can I work around this? >>>>>> >>>>>> Thanks a lot! >>>>>> >>>>>> Best, >>>>>> >>>>>> On 25 July 2013 16:58, James W. MacDonald<jmacdon at="" uw.edu=""> wrote: >>>>>>> >>>>>>> Hi Enrico, >>>>>>> >>>>>>> >>>>>>> On 7/25/2013 11:35 AM, Enrico Ferrero wrote: >>>>>>>> >>>>>>>> Hello, >>>>>>>> >>>>>>>> I often have data frames where I need to perform ID conversions on >>>>>>>> one >>>>>>>> or >>>>>>>> more of the columns while preserving the order of the rows, e.g.: >>>>>>>> >>>>>>>> GeneSymbol Value1 Value2 >>>>>>>> GS1 2.5 0.1 >>>>>>>> GS2 3 0.2 >>>>>>>> .. >>>>>>>> >>>>>>>> And I want to obtain: >>>>>>>> >>>>>>>> GeneSymbol EntrezGeneID Value1 Value2 >>>>>>>> GS1 EG1 2.5 0.1 >>>>>>>> GS2 EG2 3 0.2 >>>>>>>> .. >>>>>>>> >>>>>>>> What I've done so far was to create a function that uses >>>>>>>> org.Hs.eg.db to >>>>>>>> loop over the rows of the column and does the conversion: >>>>>>>> >>>>>>>> library(org.Hs.eg.db) >>>>>>>> alias2EG<- function(x) { >>>>>>>> for (i in 1:length(x)) { >>>>>>>> if (!is.na(x[i])) { >>>>>>>> repl<- org.Hs.egALIAS2EG[[x[i]]][1] >>>>>>>> if (!is.null(repl)) { >>>>>>>> x[i]<- repl >>>>>>>> } >>>>>>>> else { >>>>>>>> x[i]<- NA >>>>>>>> } >>>>>>>> } >>>>>>>> } >>>>>>>> return(x) >>>>>>>> } >>>>>>> >>>>>>> >>>>>>> I should first note that gene symbols are not unique, so you are >>>>>>> taking a >>>>>>> chance on your mappings. Is there no other annotation for your data? >>>>>>> >>>>>>> In addition, you should note that it is almost always better to >>>>>>> think of >>>>>>> objects as vectors and matrices in R, rather than as things that >>>>>>> need to >>>>>>> be >>>>>>> looped over (e.g., R isn't Perl or C). >>>>>>> >>>>>>> first.two<- select(org.Hs.eg.db, as.character(df$GeneSymbol), >>>>>>> "ENTREZID", >>>>>>> "SYMBOL") >>>>>>> >>>>>>> Note that there used to be a warning or an error (don't remember >>>>>>> which) >>>>>>> when >>>>>>> you did something like this, stating that gene symbols are not unique, >>>>>>> and >>>>>>> that you shouldn't do this sort of thing. Apparently this warning has >>>>>>> been >>>>>>> removed, but the issue remains valid. >>>>>>> >>>>>>> ## check yourself >>>>>>> >>>>>>> all.equal(df$GeneSymbol, first.two$SYMBOL) >>>>>>> >>>>>>> ## if true, proceed >>>>>>> >>>>>>> df<- data.frame(first.two, df[,-1]) >>>>>>> >>>>>>> Best, >>>>>>> >>>>>>> Jim >>>>>>> >>>>>>> >>>>>>> >>>>>>>> and then call the function like this: >>>>>>>> >>>>>>>> df$EntrezGeneID<- alias2GS(df$GeneSymbol) >>>>>>>> >>>>>>>> This works well, but gets very slow when I need to do multiple >>>>>>>> conversions >>>>>>>> on large datasets. >>>>>>>> >>>>>>>> Is there any way I can achieve the same result but in a quicker, more >>>>>>>> efficient way? >>>>>>>> >>>>>>>> Thank you. >>>>>>>> >>>>>>> -- >>>>>>> James W. MacDonald, M.S. >>>>>>> Biostatistician >>>>>>> University of Washington >>>>>>> Environmental and Occupational Health Sciences >>>>>>> 4225 Roosevelt Way NE, # 100 >>>>>>> Seattle WA 98105-6099 >>>>>>> >>>>>> >>>>> -- >>>>> James W. MacDonald, M.S. >>>>> Biostatistician >>>>> University of Washington >>>>> Environmental and Occupational Health Sciences >>>>> 4225 Roosevelt Way NE, # 100 >>>>> Seattle WA 98105-6099 >>>>> >>>> >>>> >>> >> >> -- >> Hervé Pagès >> >> Program in Computational Biology >> Division of Public Health Sciences >> Fred Hutchinson Cancer Research Center >> 1100 Fairview Ave. N, M1-B514 >> P.O. Box 19024 >> Seattle, WA 98109-1024 >> >> E-mail: hpages at fhcrc.org >> Phone: (206) 667-5791 >> Fax: (206) 667-1319 > > > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
ADD REPLY
0
Entering edit mode
Hi Enrico and Herve, This has to do with duplicate entries, but only when the duplicate entry maps to many ENTREZID: > select(org.Hs.eg.db, rep("ADORA2A", 4), "ENTREZID", "ALIAS") ALIAS ENTREZID 1 ADORA2A 135 2 ADORA2A 135 3 ADORA2A 135 4 ADORA2A 135 > select(org.Hs.eg.db, rep("AGT", 4), "ENTREZID", "ALIAS") ALIAS ENTREZID 1 AGT 183 2 AGT 189 Warning message: In .generateExtraRows(tab, keys, jointype) : 'select' and duplicate query keys resulted in 1:many mapping between keys and return rows > select(org.Hs.eg.db, "AGT", "ENTREZID", "ALIAS") ALIAS ENTREZID 1 AGT 183 2 AGT 189 Warning message: In .generateExtraRows(tab, keys, jointype) : 'select' resulted in 1:many mapping between keys and return rows So in the instances where a gene symbol maps to more than one ENTREZID, the output gets truncated, whereas if it is a one-to-one mapping, it does not. Best, Jim On 7/25/2013 5:06 PM, Enrico Ferrero wrote: > Hi, > > Herv?, that's exactly what I'm trying to say. > > Attached to this email is a tab delimited file with two columns of > GeneSymbols (or Aliases), and here is some simple code to reproduce > the unexpected behaviour: > > library(org.Hs.eg.db) > mydf<- read.table("testdata.txt", sep="\t", header=TRUE, as.is=TRUE) > mytest<- select(org.Hs.eg.db, key=mydf$GeneSymbol1, keytype="ALIAS", > cols=c("SYMBOL","ENTREZID","ENSEMBL")) > # check that mytest has less rows than mydf > nrow(mydf) > nrow(mytest) > # pick a random row: they don't match > mydf[250,] > mytest[250,] > > Ideally, mytest should have the same number and position of rows of > mydf so that I can then cbind them. > If mytest has more rows because of multiple mappings that's also fine: > I can always use merge(mydf, mytest), right? > > Thanks a lot to both for your help, it's very appreciated. > Best, > > > On 25 July 2013 21:32, Hervé Pagès<hpages at="" fhcrc.org=""> wrote: >> Hi Enrico, >> >> >> On 07/25/2013 01:20 PM, James W. MacDonald wrote: >>> Hi Enrico, >>> >>> Please don't take things off-list (e.g., use reply-all). >>> >>> >>> On 7/25/2013 2:17 PM, Enrico Ferrero wrote: >>>> Hi James, >>>> >>>> Thanks very much for your help. >>>> There is an issue that needs to be solved before thinking about what's >>>> the best approach in my opinion. >>>> >>>> I don't understand why, but the object created with the call to select >>>> (test in my example, first.two in yours) has a different number of >>>> rows from the original object (df in my example). Specifically it has >>>> *less* rows. >> >> I'm surprised it has less rows. It can definitely have more, when some >> of the keys passed to select() are mapped to more than 1 row, but my >> understanding was that select() would propagate unmapped keys to the >> output by placing them in rows stuffed with NAs. So maybe I >> misunderstood how select() works, or its behavior was changed, or >> there is a bug somewhere. Could you please send the code that allows >> us to reproduce this? Thanks. >> >> H. >> >> >>> If all symbols were converted to all possible Entrez IDs, >>>> I would expect it to have more rows, not less. To me, it looks like >>>> not all rows are looked up and returned. >>>> >>>> Do you see what I mean? >>> >>> Sure. You could be using outdated gene symbols. Or perhaps you are using >>> a mixture of symbols and aliases. Which is even cooler than just all >>> symbols: >>> >>> > symb<- c(Rkeys(org.Hs.egSYMBOL)[1:10], >>> Rkeys(org.Hs.egALIAS2EG)[31:45]) >>> > symb >>> [1] "A1BG" "A2M" "A2MP1" "NAT1" "NAT2" "AACP" >>> [7] "SERPINA3" "AADAC" "AAMP" "AANAT" "AAMP" "AANAT" >>> [13] "DSPS" "SNAT" "AARS" "CMT2N" "AAV" "AAVS1" >>> [19] "ABAT" "GABA-AT" "GABAT" "NPD009" "ABC-1" "ABC1" >>> [25] "ABCA1" >>> > select(org.Hs.eg.db, symb, "ENTREZID","SYMBOL") >>> SYMBOL ENTREZID >>> 1 A1BG 1 >>> 2 A2M 2 >>> 3 A2MP1 3 >>> 4 NAT1 9 >>> 5 NAT2 10 >>> 6 AACP 11 >>> 7 SERPINA3 12 >>> 8 AADAC 13 >>> 9 AAMP 14 >>> 10 AANAT 15 >>> 11 AAMP 14 >>> 12 AANAT 15 >>> 13 DSPS<na> >>> 14 SNAT<na> >>> 15 AARS 16 >>> 16 CMT2N<na> >>> 17 AAV<na> >>> 18 AAVS1 17 >>> 19 ABAT 18 >>> 20 GABA-AT<na> >>> 21 GABAT<na> >>> 22 NPD009<na> >>> 23 ABC-1<na> >>> 24 ABC1<na> >>> 25 ABCA1 19 >>> > select(org.Hs.eg.db, symb, "ENTREZID","ALIAS") >>> ALIAS ENTREZID >>> 1 A1BG 1 >>> 2 A2M 2 >>> 3 A2MP1 3 >>> 4 NAT1 9 >>> 5 NAT1 1982 >>> 6 NAT1 6530 >>> 7 NAT1 10991 >>> 8 NAT2 10 >>> 9 NAT2 81539 >>> 10 AACP 11 >>> 11 SERPINA3 12 >>> 12 AADAC 13 >>> 13 AAMP 14 >>> 14 AANAT 15 >>> 15 DSPS 15 >>> 16 SNAT 15 >>> 17 AARS 16 >>> 18 CMT2N 16 >>> 19 AAV 17 >>> 20 AAVS1 17 >>> 21 ABAT 18 >>> 22 GABA-AT 18 >>> 23 GABAT 18 >>> 24 NPD009 18 >>> 25 ABC-1 19 >>> 26 ABC1 19 >>> 27 ABC1 63897 >>> 28 ABCA1 19 >>> Warning message: >>> In .generateExtraRows(tab, keys, jointype) : >>> 'select' and duplicate query keys resulted in 1:many mapping between >>> keys and return rows >>> > mget(c("1982","6530","10991"), org.Hs.egGENENAME) >>> $`1982` >>> [1] "eukaryotic translation initiation factor 4 gamma, 2" >>> >>> $`6530` >>> [1] "solute carrier family 6 (neurotransmitter transporter, >>> noradrenalin), member 2" >>> >>> $`10991` >>> [1] "solute carrier family 38, member 3" >>> >>> Best, >>> >>> Jim >>> >>>> On 25 July 2013 18:17, James W. MacDonald<jmacdon at="" uw.edu=""> wrote: >>>>> Hi Enrico, >>>>> >>>>> >>>>> On 7/25/2013 12:56 PM, Enrico Ferrero wrote: >>>>>> Dear James, >>>>>> >>>>>> Thanks very much for your prompt reply. >>>>>> I knew the problem was the for loop and the select function is indeed >>>>>> a lot faster than that and works perfectly with toy data. >>>>>> >>>>>> However, this is what happens when I try to use it with real data: >>>>>> >>>>>>> test<- select(org.Hs.eg.db, keys=df$GeneSymbol, keytype="ALIAS", >>>>>>> cols=c("SYMBOL","ENTREZID","ENSEMBL")) >>>>>> Warning message: >>>>>> In .generateExtraRows(tab, keys, jointype) : >>>>>> 'select' and duplicate query keys resulted in 1:many mapping >>>>>> between >>>>>> keys and return rows >>>>>> >>>>>> which is probably the warning you mentioned. >>>>> >>>>> That's not the warning I mentioned, but it does point out the same >>>>> issue, >>>>> which is that there is a one to many mapping between symbol and >>>>> entrez gene >>>>> ID. >>>>> >>>>> So now you have to decide if you want to be naive (or stupid, >>>>> depending on >>>>> your perspective) or not. You could just cover your eyes and do this: >>>>> >>>>> first.two<- first.two[!duplicated(first.two$SYMBOL),] >>>>> >>>>> which will choose for you the first symbol -> gene ID mapping and >>>>> nuke the >>>>> rest. That's nice and quick, but you are making huge assumptions. >>>>> >>>>> Or you could decide to be a bit more sophisticated and do something like >>>>> >>>>> thelst<- tapply(1:nrow(first.two), first.two$SYMBOL, function(x) >>>>> first.two[x,]) >>>>> >>>>> At this point you can take a look at e.g., thelst[1:10] to see what >>>>> we just >>>>> did >>>>> >>>>> thelst<- do.call("rbind", lapply(thelst, function(x) c(x[1,1], >>>>> paste(x[,2], >>>>> collapse = "|"))) >>>>> >>>>> and here you can look at head(thelst). >>>>> >>>>> Then you can check to ensure that the first column of thelst is >>>>> identical to >>>>> the first column of df, and proceed as before. >>>>> >>>>> But there is still the problem of the multiple mappings. As an example: >>>>> >>>>>> thelst[1:5] >>>>> $HBD >>>>> SYMBOL ENTREZID >>>>> 2535 HBD 3045 >>>>> 2536 HBD 100187828 >>>>> >>>>> $KIR3DL3 >>>>> SYMBOL ENTREZID >>>>> 17513 KIR3DL3 115653 >>>>> 17514 KIR3DL3 100133046 >>>>> >>>>>> mget(as.character(thelst[[1]][,2]), org.Hs.egGENENAME) >>>>> $`3045` >>>>> [1] "hemoglobin, delta" >>>>> >>>>> $`100187828` >>>>> [1] "hypophosphatemic bone disease" >>>>> >>>>>> mget(as.character(thelst[[2]][,2]), org.Hs.egGENENAME) >>>>> $`115653` >>>>> [1] "killer cell immunoglobulin-like receptor, three domains, long >>>>> cytoplasmic tail, 3" >>>>> >>>>> $`100133046` >>>>> [1] "killer cell immunoglobulin-like receptor three domains long >>>>> cytoplasmic >>>>> tail 3" >>>>> >>>>> >>>>> So HBD is the gene symbol for two different genes! If this gene >>>>> symbol is in >>>>> your data, you will now have attributed your data to two genes that >>>>> apparently are not remotely similar. if KIR3DL3 is in your data, then it >>>>> worked out OK for that gene. >>>>> >>>>> Best, >>>>> >>>>> Jim >>>>> >>>>> >>>>> >>>>> >>>>> >>>>>> The real problem is that the number of rows is now different for the 2 >>>>>> objects: >>>>>>> nrow(df); nrow(test) >>>>>> [1] 573 >>>>>> [1] 201 >>>>>> >>>>>> So I obviously can't put the new data into the original df. My >>>>>> impression is that when the 1 to many mapping arises, the select >>>>>> functions exits, with that warning message. As a result, my test >>>>>> object is incomplete. >>>>>> >>>>>> On top of that, and I can't really explain this, the row positions are >>>>>> messed up, e.g. >>>>>> >>>>>>> all.equal(df[100,],test[100,]) >>>>>> returns FALSE. >>>>>> >>>>>> How can I work around this? >>>>>> >>>>>> Thanks a lot! >>>>>> >>>>>> Best, >>>>>> >>>>>> On 25 July 2013 16:58, James W. MacDonald<jmacdon at="" uw.edu=""> wrote: >>>>>>> Hi Enrico, >>>>>>> >>>>>>> >>>>>>> On 7/25/2013 11:35 AM, Enrico Ferrero wrote: >>>>>>>> Hello, >>>>>>>> >>>>>>>> I often have data frames where I need to perform ID conversions on >>>>>>>> one >>>>>>>> or >>>>>>>> more of the columns while preserving the order of the rows, e.g.: >>>>>>>> >>>>>>>> GeneSymbol Value1 Value2 >>>>>>>> GS1 2.5 0.1 >>>>>>>> GS2 3 0.2 >>>>>>>> .. >>>>>>>> >>>>>>>> And I want to obtain: >>>>>>>> >>>>>>>> GeneSymbol EntrezGeneID Value1 Value2 >>>>>>>> GS1 EG1 2.5 0.1 >>>>>>>> GS2 EG2 3 0.2 >>>>>>>> .. >>>>>>>> >>>>>>>> What I've done so far was to create a function that uses >>>>>>>> org.Hs.eg.db to >>>>>>>> loop over the rows of the column and does the conversion: >>>>>>>> >>>>>>>> library(org.Hs.eg.db) >>>>>>>> alias2EG<- function(x) { >>>>>>>> for (i in 1:length(x)) { >>>>>>>> if (!is.na(x[i])) { >>>>>>>> repl<- org.Hs.egALIAS2EG[[x[i]]][1] >>>>>>>> if (!is.null(repl)) { >>>>>>>> x[i]<- repl >>>>>>>> } >>>>>>>> else { >>>>>>>> x[i]<- NA >>>>>>>> } >>>>>>>> } >>>>>>>> } >>>>>>>> return(x) >>>>>>>> } >>>>>>> >>>>>>> I should first note that gene symbols are not unique, so you are >>>>>>> taking a >>>>>>> chance on your mappings. Is there no other annotation for your data? >>>>>>> >>>>>>> In addition, you should note that it is almost always better to >>>>>>> think of >>>>>>> objects as vectors and matrices in R, rather than as things that >>>>>>> need to >>>>>>> be >>>>>>> looped over (e.g., R isn't Perl or C). >>>>>>> >>>>>>> first.two<- select(org.Hs.eg.db, as.character(df$GeneSymbol), >>>>>>> "ENTREZID", >>>>>>> "SYMBOL") >>>>>>> >>>>>>> Note that there used to be a warning or an error (don't remember >>>>>>> which) >>>>>>> when >>>>>>> you did something like this, stating that gene symbols are not unique, >>>>>>> and >>>>>>> that you shouldn't do this sort of thing. Apparently this warning has >>>>>>> been >>>>>>> removed, but the issue remains valid. >>>>>>> >>>>>>> ## check yourself >>>>>>> >>>>>>> all.equal(df$GeneSymbol, first.two$SYMBOL) >>>>>>> >>>>>>> ## if true, proceed >>>>>>> >>>>>>> df<- data.frame(first.two, df[,-1]) >>>>>>> >>>>>>> Best, >>>>>>> >>>>>>> Jim >>>>>>> >>>>>>> >>>>>>> >>>>>>>> and then call the function like this: >>>>>>>> >>>>>>>> df$EntrezGeneID<- alias2GS(df$GeneSymbol) >>>>>>>> >>>>>>>> This works well, but gets very slow when I need to do multiple >>>>>>>> conversions >>>>>>>> on large datasets. >>>>>>>> >>>>>>>> Is there any way I can achieve the same result but in a quicker, more >>>>>>>> efficient way? >>>>>>>> >>>>>>>> Thank you. >>>>>>>> >>>>>>> -- >>>>>>> James W. MacDonald, M.S. >>>>>>> Biostatistician >>>>>>> University of Washington >>>>>>> Environmental and Occupational Health Sciences >>>>>>> 4225 Roosevelt Way NE, # 100 >>>>>>> Seattle WA 98105-6099 >>>>>>> >>>>> -- >>>>> James W. MacDonald, M.S. >>>>> Biostatistician >>>>> University of Washington >>>>> Environmental and Occupational Health Sciences >>>>> 4225 Roosevelt Way NE, # 100 >>>>> Seattle WA 98105-6099 >>>>> >>>> >> -- >> Hervé Pagès >> >> Program in Computational Biology >> Division of Public Health Sciences >> Fred Hutchinson Cancer Research Center >> 1100 Fairview Ave. N, M1-B514 >> P.O. Box 19024 >> Seattle, WA 98109-1024 >> >> E-mail: hpages at fhcrc.org >> Phone: (206) 667-5791 >> Fax: (206) 667-1319 > > -- James W. MacDonald, M.S. Biostatistician University of Washington Environmental and Occupational Health Sciences 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099
ADD REPLY
0
Entering edit mode
Hi James, You're right. It's actually both: NAs *and* duplicated keys that are mapped to more than 1 row are removed from the input. I don't think this is documented. I wonder if select() behavior couldn't be a little bit simpler by either preserving or removing all duplicated keys, and not just some of them (on a somewhat arbitrary criteria). Thanks, H. On 07/25/2013 02:57 PM, James W. MacDonald wrote: > Hi Enrico and Herve, > > This has to do with duplicate entries, but only when the duplicate entry > maps to many ENTREZID: > > > select(org.Hs.eg.db, rep("ADORA2A", 4), "ENTREZID", "ALIAS") > ALIAS ENTREZID > 1 ADORA2A 135 > 2 ADORA2A 135 > 3 ADORA2A 135 > 4 ADORA2A 135 > > > select(org.Hs.eg.db, rep("AGT", 4), "ENTREZID", "ALIAS") > ALIAS ENTREZID > 1 AGT 183 > 2 AGT 189 > Warning message: > In .generateExtraRows(tab, keys, jointype) : > 'select' and duplicate query keys resulted in 1:many mapping between > keys and return rows > > > select(org.Hs.eg.db, "AGT", "ENTREZID", "ALIAS") > ALIAS ENTREZID > 1 AGT 183 > 2 AGT 189 > Warning message: > In .generateExtraRows(tab, keys, jointype) : > 'select' resulted in 1:many mapping between keys and return rows > > > So in the instances where a gene symbol maps to more than one ENTREZID, > the output gets truncated, whereas if it is a one-to-one mapping, it > does not. > > Best, > > Jim > > > > > On 7/25/2013 5:06 PM, Enrico Ferrero wrote: >> Hi, >> >> Herv?, that's exactly what I'm trying to say. >> >> Attached to this email is a tab delimited file with two columns of >> GeneSymbols (or Aliases), and here is some simple code to reproduce >> the unexpected behaviour: >> >> library(org.Hs.eg.db) >> mydf<- read.table("testdata.txt", sep="\t", header=TRUE, as.is=TRUE) >> mytest<- select(org.Hs.eg.db, key=mydf$GeneSymbol1, keytype="ALIAS", >> cols=c("SYMBOL","ENTREZID","ENSEMBL")) >> # check that mytest has less rows than mydf >> nrow(mydf) >> nrow(mytest) >> # pick a random row: they don't match >> mydf[250,] >> mytest[250,] >> >> Ideally, mytest should have the same number and position of rows of >> mydf so that I can then cbind them. >> If mytest has more rows because of multiple mappings that's also fine: >> I can always use merge(mydf, mytest), right? >> >> Thanks a lot to both for your help, it's very appreciated. >> Best, >> >> >> On 25 July 2013 21:32, Hervé Pagès<hpages at="" fhcrc.org=""> wrote: >>> Hi Enrico, >>> >>> >>> On 07/25/2013 01:20 PM, James W. MacDonald wrote: >>>> Hi Enrico, >>>> >>>> Please don't take things off-list (e.g., use reply-all). >>>> >>>> >>>> On 7/25/2013 2:17 PM, Enrico Ferrero wrote: >>>>> Hi James, >>>>> >>>>> Thanks very much for your help. >>>>> There is an issue that needs to be solved before thinking about what's >>>>> the best approach in my opinion. >>>>> >>>>> I don't understand why, but the object created with the call to select >>>>> (test in my example, first.two in yours) has a different number of >>>>> rows from the original object (df in my example). Specifically it has >>>>> *less* rows. >>> >>> I'm surprised it has less rows. It can definitely have more, when some >>> of the keys passed to select() are mapped to more than 1 row, but my >>> understanding was that select() would propagate unmapped keys to the >>> output by placing them in rows stuffed with NAs. So maybe I >>> misunderstood how select() works, or its behavior was changed, or >>> there is a bug somewhere. Could you please send the code that allows >>> us to reproduce this? Thanks. >>> >>> H. >>> >>> >>>> If all symbols were converted to all possible Entrez IDs, >>>>> I would expect it to have more rows, not less. To me, it looks like >>>>> not all rows are looked up and returned. >>>>> >>>>> Do you see what I mean? >>>> >>>> Sure. You could be using outdated gene symbols. Or perhaps you are >>>> using >>>> a mixture of symbols and aliases. Which is even cooler than just all >>>> symbols: >>>> >>>> > symb<- c(Rkeys(org.Hs.egSYMBOL)[1:10], >>>> Rkeys(org.Hs.egALIAS2EG)[31:45]) >>>> > symb >>>> [1] "A1BG" "A2M" "A2MP1" "NAT1" "NAT2" "AACP" >>>> [7] "SERPINA3" "AADAC" "AAMP" "AANAT" "AAMP" "AANAT" >>>> [13] "DSPS" "SNAT" "AARS" "CMT2N" "AAV" "AAVS1" >>>> [19] "ABAT" "GABA-AT" "GABAT" "NPD009" "ABC-1" "ABC1" >>>> [25] "ABCA1" >>>> > select(org.Hs.eg.db, symb, "ENTREZID","SYMBOL") >>>> SYMBOL ENTREZID >>>> 1 A1BG 1 >>>> 2 A2M 2 >>>> 3 A2MP1 3 >>>> 4 NAT1 9 >>>> 5 NAT2 10 >>>> 6 AACP 11 >>>> 7 SERPINA3 12 >>>> 8 AADAC 13 >>>> 9 AAMP 14 >>>> 10 AANAT 15 >>>> 11 AAMP 14 >>>> 12 AANAT 15 >>>> 13 DSPS<na> >>>> 14 SNAT<na> >>>> 15 AARS 16 >>>> 16 CMT2N<na> >>>> 17 AAV<na> >>>> 18 AAVS1 17 >>>> 19 ABAT 18 >>>> 20 GABA-AT<na> >>>> 21 GABAT<na> >>>> 22 NPD009<na> >>>> 23 ABC-1<na> >>>> 24 ABC1<na> >>>> 25 ABCA1 19 >>>> > select(org.Hs.eg.db, symb, "ENTREZID","ALIAS") >>>> ALIAS ENTREZID >>>> 1 A1BG 1 >>>> 2 A2M 2 >>>> 3 A2MP1 3 >>>> 4 NAT1 9 >>>> 5 NAT1 1982 >>>> 6 NAT1 6530 >>>> 7 NAT1 10991 >>>> 8 NAT2 10 >>>> 9 NAT2 81539 >>>> 10 AACP 11 >>>> 11 SERPINA3 12 >>>> 12 AADAC 13 >>>> 13 AAMP 14 >>>> 14 AANAT 15 >>>> 15 DSPS 15 >>>> 16 SNAT 15 >>>> 17 AARS 16 >>>> 18 CMT2N 16 >>>> 19 AAV 17 >>>> 20 AAVS1 17 >>>> 21 ABAT 18 >>>> 22 GABA-AT 18 >>>> 23 GABAT 18 >>>> 24 NPD009 18 >>>> 25 ABC-1 19 >>>> 26 ABC1 19 >>>> 27 ABC1 63897 >>>> 28 ABCA1 19 >>>> Warning message: >>>> In .generateExtraRows(tab, keys, jointype) : >>>> 'select' and duplicate query keys resulted in 1:many mapping >>>> between >>>> keys and return rows >>>> > mget(c("1982","6530","10991"), org.Hs.egGENENAME) >>>> $`1982` >>>> [1] "eukaryotic translation initiation factor 4 gamma, 2" >>>> >>>> $`6530` >>>> [1] "solute carrier family 6 (neurotransmitter transporter, >>>> noradrenalin), member 2" >>>> >>>> $`10991` >>>> [1] "solute carrier family 38, member 3" >>>> >>>> Best, >>>> >>>> Jim >>>> >>>>> On 25 July 2013 18:17, James W. MacDonald<jmacdon at="" uw.edu=""> wrote: >>>>>> Hi Enrico, >>>>>> >>>>>> >>>>>> On 7/25/2013 12:56 PM, Enrico Ferrero wrote: >>>>>>> Dear James, >>>>>>> >>>>>>> Thanks very much for your prompt reply. >>>>>>> I knew the problem was the for loop and the select function is >>>>>>> indeed >>>>>>> a lot faster than that and works perfectly with toy data. >>>>>>> >>>>>>> However, this is what happens when I try to use it with real data: >>>>>>> >>>>>>>> test<- select(org.Hs.eg.db, keys=df$GeneSymbol, keytype="ALIAS", >>>>>>>> cols=c("SYMBOL","ENTREZID","ENSEMBL")) >>>>>>> Warning message: >>>>>>> In .generateExtraRows(tab, keys, jointype) : >>>>>>> 'select' and duplicate query keys resulted in 1:many mapping >>>>>>> between >>>>>>> keys and return rows >>>>>>> >>>>>>> which is probably the warning you mentioned. >>>>>> >>>>>> That's not the warning I mentioned, but it does point out the same >>>>>> issue, >>>>>> which is that there is a one to many mapping between symbol and >>>>>> entrez gene >>>>>> ID. >>>>>> >>>>>> So now you have to decide if you want to be naive (or stupid, >>>>>> depending on >>>>>> your perspective) or not. You could just cover your eyes and do this: >>>>>> >>>>>> first.two<- first.two[!duplicated(first.two$SYMBOL),] >>>>>> >>>>>> which will choose for you the first symbol -> gene ID mapping and >>>>>> nuke the >>>>>> rest. That's nice and quick, but you are making huge assumptions. >>>>>> >>>>>> Or you could decide to be a bit more sophisticated and do >>>>>> something like >>>>>> >>>>>> thelst<- tapply(1:nrow(first.two), first.two$SYMBOL, function(x) >>>>>> first.two[x,]) >>>>>> >>>>>> At this point you can take a look at e.g., thelst[1:10] to see what >>>>>> we just >>>>>> did >>>>>> >>>>>> thelst<- do.call("rbind", lapply(thelst, function(x) c(x[1,1], >>>>>> paste(x[,2], >>>>>> collapse = "|"))) >>>>>> >>>>>> and here you can look at head(thelst). >>>>>> >>>>>> Then you can check to ensure that the first column of thelst is >>>>>> identical to >>>>>> the first column of df, and proceed as before. >>>>>> >>>>>> But there is still the problem of the multiple mappings. As an >>>>>> example: >>>>>> >>>>>>> thelst[1:5] >>>>>> $HBD >>>>>> SYMBOL ENTREZID >>>>>> 2535 HBD 3045 >>>>>> 2536 HBD 100187828 >>>>>> >>>>>> $KIR3DL3 >>>>>> SYMBOL ENTREZID >>>>>> 17513 KIR3DL3 115653 >>>>>> 17514 KIR3DL3 100133046 >>>>>> >>>>>>> mget(as.character(thelst[[1]][,2]), org.Hs.egGENENAME) >>>>>> $`3045` >>>>>> [1] "hemoglobin, delta" >>>>>> >>>>>> $`100187828` >>>>>> [1] "hypophosphatemic bone disease" >>>>>> >>>>>>> mget(as.character(thelst[[2]][,2]), org.Hs.egGENENAME) >>>>>> $`115653` >>>>>> [1] "killer cell immunoglobulin-like receptor, three domains, long >>>>>> cytoplasmic tail, 3" >>>>>> >>>>>> $`100133046` >>>>>> [1] "killer cell immunoglobulin-like receptor three domains long >>>>>> cytoplasmic >>>>>> tail 3" >>>>>> >>>>>> >>>>>> So HBD is the gene symbol for two different genes! If this gene >>>>>> symbol is in >>>>>> your data, you will now have attributed your data to two genes that >>>>>> apparently are not remotely similar. if KIR3DL3 is in your data, >>>>>> then it >>>>>> worked out OK for that gene. >>>>>> >>>>>> Best, >>>>>> >>>>>> Jim >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>>> The real problem is that the number of rows is now different for >>>>>>> the 2 >>>>>>> objects: >>>>>>>> nrow(df); nrow(test) >>>>>>> [1] 573 >>>>>>> [1] 201 >>>>>>> >>>>>>> So I obviously can't put the new data into the original df. My >>>>>>> impression is that when the 1 to many mapping arises, the select >>>>>>> functions exits, with that warning message. As a result, my test >>>>>>> object is incomplete. >>>>>>> >>>>>>> On top of that, and I can't really explain this, the row >>>>>>> positions are >>>>>>> messed up, e.g. >>>>>>> >>>>>>>> all.equal(df[100,],test[100,]) >>>>>>> returns FALSE. >>>>>>> >>>>>>> How can I work around this? >>>>>>> >>>>>>> Thanks a lot! >>>>>>> >>>>>>> Best, >>>>>>> >>>>>>> On 25 July 2013 16:58, James W. MacDonald<jmacdon at="" uw.edu=""> wrote: >>>>>>>> Hi Enrico, >>>>>>>> >>>>>>>> >>>>>>>> On 7/25/2013 11:35 AM, Enrico Ferrero wrote: >>>>>>>>> Hello, >>>>>>>>> >>>>>>>>> I often have data frames where I need to perform ID conversions on >>>>>>>>> one >>>>>>>>> or >>>>>>>>> more of the columns while preserving the order of the rows, e.g.: >>>>>>>>> >>>>>>>>> GeneSymbol Value1 Value2 >>>>>>>>> GS1 2.5 0.1 >>>>>>>>> GS2 3 0.2 >>>>>>>>> .. >>>>>>>>> >>>>>>>>> And I want to obtain: >>>>>>>>> >>>>>>>>> GeneSymbol EntrezGeneID Value1 Value2 >>>>>>>>> GS1 EG1 2.5 0.1 >>>>>>>>> GS2 EG2 3 0.2 >>>>>>>>> .. >>>>>>>>> >>>>>>>>> What I've done so far was to create a function that uses >>>>>>>>> org.Hs.eg.db to >>>>>>>>> loop over the rows of the column and does the conversion: >>>>>>>>> >>>>>>>>> library(org.Hs.eg.db) >>>>>>>>> alias2EG<- function(x) { >>>>>>>>> for (i in 1:length(x)) { >>>>>>>>> if (!is.na(x[i])) { >>>>>>>>> repl<- org.Hs.egALIAS2EG[[x[i]]][1] >>>>>>>>> if (!is.null(repl)) { >>>>>>>>> x[i]<- repl >>>>>>>>> } >>>>>>>>> else { >>>>>>>>> x[i]<- NA >>>>>>>>> } >>>>>>>>> } >>>>>>>>> } >>>>>>>>> return(x) >>>>>>>>> } >>>>>>>> >>>>>>>> I should first note that gene symbols are not unique, so you are >>>>>>>> taking a >>>>>>>> chance on your mappings. Is there no other annotation for your >>>>>>>> data? >>>>>>>> >>>>>>>> In addition, you should note that it is almost always better to >>>>>>>> think of >>>>>>>> objects as vectors and matrices in R, rather than as things that >>>>>>>> need to >>>>>>>> be >>>>>>>> looped over (e.g., R isn't Perl or C). >>>>>>>> >>>>>>>> first.two<- select(org.Hs.eg.db, as.character(df$GeneSymbol), >>>>>>>> "ENTREZID", >>>>>>>> "SYMBOL") >>>>>>>> >>>>>>>> Note that there used to be a warning or an error (don't remember >>>>>>>> which) >>>>>>>> when >>>>>>>> you did something like this, stating that gene symbols are not >>>>>>>> unique, >>>>>>>> and >>>>>>>> that you shouldn't do this sort of thing. Apparently this >>>>>>>> warning has >>>>>>>> been >>>>>>>> removed, but the issue remains valid. >>>>>>>> >>>>>>>> ## check yourself >>>>>>>> >>>>>>>> all.equal(df$GeneSymbol, first.two$SYMBOL) >>>>>>>> >>>>>>>> ## if true, proceed >>>>>>>> >>>>>>>> df<- data.frame(first.two, df[,-1]) >>>>>>>> >>>>>>>> Best, >>>>>>>> >>>>>>>> Jim >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>> and then call the function like this: >>>>>>>>> >>>>>>>>> df$EntrezGeneID<- alias2GS(df$GeneSymbol) >>>>>>>>> >>>>>>>>> This works well, but gets very slow when I need to do multiple >>>>>>>>> conversions >>>>>>>>> on large datasets. >>>>>>>>> >>>>>>>>> Is there any way I can achieve the same result but in a >>>>>>>>> quicker, more >>>>>>>>> efficient way? >>>>>>>>> >>>>>>>>> Thank you. >>>>>>>>> >>>>>>>> -- >>>>>>>> James W. MacDonald, M.S. >>>>>>>> Biostatistician >>>>>>>> University of Washington >>>>>>>> Environmental and Occupational Health Sciences >>>>>>>> 4225 Roosevelt Way NE, # 100 >>>>>>>> Seattle WA 98105-6099 >>>>>>>> >>>>>> -- >>>>>> James W. MacDonald, M.S. >>>>>> Biostatistician >>>>>> University of Washington >>>>>> Environmental and Occupational Health Sciences >>>>>> 4225 Roosevelt Way NE, # 100 >>>>>> Seattle WA 98105-6099 >>>>>> >>>>> >>> -- >>> Hervé Pagès >>> >>> Program in Computational Biology >>> Division of Public Health Sciences >>> Fred Hutchinson Cancer Research Center >>> 1100 Fairview Ave. N, M1-B514 >>> P.O. Box 19024 >>> Seattle, WA 98109-1024 >>> >>> E-mail: hpages at fhcrc.org >>> Phone: (206) 667-5791 >>> Fax: (206) 667-1319 >> >> > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
ADD REPLY
0
Entering edit mode
Hi both, Thanks for your insights, this is extremely interesting! While I (kind of) understand why NAs get removed, deliberately truncating the output that way is probably not what most people expect. It may be worth considering filing a bug report for this? This also brings me back to my original question: what's the simplest and most effienct way to create an exact copy of a column containing converted IDs in a data.frame? I'm surprised there doesn't seem to be an easy ready-to-go solution, as I would imagine it is a rather common task to perform. As I mentioned in my first post, the for loop function works, but it's highly inefficient. Any help is greatly appreciated, thank you. Best, On 25 July 2013 23:18, Hervé Pagès <hpages at="" fhcrc.org=""> wrote: > Hi James, > > You're right. > > It's actually both: NAs *and* duplicated keys that are mapped to > more than 1 row are removed from the input. I don't think this > is documented. > > I wonder if select() behavior couldn't be a little bit simpler by > either preserving or removing all duplicated keys, and not just some > of them (on a somewhat arbitrary criteria). > > Thanks, > H. > > > > On 07/25/2013 02:57 PM, James W. MacDonald wrote: >> >> Hi Enrico and Herve, >> >> This has to do with duplicate entries, but only when the duplicate entry >> maps to many ENTREZID: >> >> > select(org.Hs.eg.db, rep("ADORA2A", 4), "ENTREZID", "ALIAS") >> ALIAS ENTREZID >> 1 ADORA2A 135 >> 2 ADORA2A 135 >> 3 ADORA2A 135 >> 4 ADORA2A 135 >> >> > select(org.Hs.eg.db, rep("AGT", 4), "ENTREZID", "ALIAS") >> ALIAS ENTREZID >> 1 AGT 183 >> 2 AGT 189 >> Warning message: >> In .generateExtraRows(tab, keys, jointype) : >> 'select' and duplicate query keys resulted in 1:many mapping between >> keys and return rows >> >> > select(org.Hs.eg.db, "AGT", "ENTREZID", "ALIAS") >> ALIAS ENTREZID >> 1 AGT 183 >> 2 AGT 189 >> Warning message: >> In .generateExtraRows(tab, keys, jointype) : >> 'select' resulted in 1:many mapping between keys and return rows >> >> >> So in the instances where a gene symbol maps to more than one ENTREZID, >> the output gets truncated, whereas if it is a one-to-one mapping, it >> does not. >> >> Best, >> >> Jim >> >> >> >> >> On 7/25/2013 5:06 PM, Enrico Ferrero wrote: >>> >>> Hi, >>> >>> Herv?, that's exactly what I'm trying to say. >>> >>> Attached to this email is a tab delimited file with two columns of >>> GeneSymbols (or Aliases), and here is some simple code to reproduce >>> the unexpected behaviour: >>> >>> library(org.Hs.eg.db) >>> mydf<- read.table("testdata.txt", sep="\t", header=TRUE, as.is=TRUE) >>> mytest<- select(org.Hs.eg.db, key=mydf$GeneSymbol1, keytype="ALIAS", >>> cols=c("SYMBOL","ENTREZID","ENSEMBL")) >>> # check that mytest has less rows than mydf >>> nrow(mydf) >>> nrow(mytest) >>> # pick a random row: they don't match >>> mydf[250,] >>> mytest[250,] >>> >>> Ideally, mytest should have the same number and position of rows of >>> mydf so that I can then cbind them. >>> If mytest has more rows because of multiple mappings that's also fine: >>> I can always use merge(mydf, mytest), right? >>> >>> Thanks a lot to both for your help, it's very appreciated. >>> Best, >>> >>> >>> On 25 July 2013 21:32, Hervé Pagès<hpages at="" fhcrc.org=""> wrote: >>>> >>>> Hi Enrico, >>>> >>>> >>>> On 07/25/2013 01:20 PM, James W. MacDonald wrote: >>>>> >>>>> Hi Enrico, >>>>> >>>>> Please don't take things off-list (e.g., use reply-all). >>>>> >>>>> >>>>> On 7/25/2013 2:17 PM, Enrico Ferrero wrote: >>>>>> >>>>>> Hi James, >>>>>> >>>>>> Thanks very much for your help. >>>>>> There is an issue that needs to be solved before thinking about what's >>>>>> the best approach in my opinion. >>>>>> >>>>>> I don't understand why, but the object created with the call to select >>>>>> (test in my example, first.two in yours) has a different number of >>>>>> rows from the original object (df in my example). Specifically it has >>>>>> *less* rows. >>>> >>>> >>>> I'm surprised it has less rows. It can definitely have more, when some >>>> of the keys passed to select() are mapped to more than 1 row, but my >>>> understanding was that select() would propagate unmapped keys to the >>>> output by placing them in rows stuffed with NAs. So maybe I >>>> misunderstood how select() works, or its behavior was changed, or >>>> there is a bug somewhere. Could you please send the code that allows >>>> us to reproduce this? Thanks. >>>> >>>> H. >>>> >>>> >>>>> If all symbols were converted to all possible Entrez IDs, >>>>>> >>>>>> I would expect it to have more rows, not less. To me, it looks like >>>>>> not all rows are looked up and returned. >>>>>> >>>>>> Do you see what I mean? >>>>> >>>>> >>>>> Sure. You could be using outdated gene symbols. Or perhaps you are >>>>> using >>>>> a mixture of symbols and aliases. Which is even cooler than just all >>>>> symbols: >>>>> >>>>> > symb<- c(Rkeys(org.Hs.egSYMBOL)[1:10], >>>>> Rkeys(org.Hs.egALIAS2EG)[31:45]) >>>>> > symb >>>>> [1] "A1BG" "A2M" "A2MP1" "NAT1" "NAT2" "AACP" >>>>> [7] "SERPINA3" "AADAC" "AAMP" "AANAT" "AAMP" "AANAT" >>>>> [13] "DSPS" "SNAT" "AARS" "CMT2N" "AAV" "AAVS1" >>>>> [19] "ABAT" "GABA-AT" "GABAT" "NPD009" "ABC-1" "ABC1" >>>>> [25] "ABCA1" >>>>> > select(org.Hs.eg.db, symb, "ENTREZID","SYMBOL") >>>>> SYMBOL ENTREZID >>>>> 1 A1BG 1 >>>>> 2 A2M 2 >>>>> 3 A2MP1 3 >>>>> 4 NAT1 9 >>>>> 5 NAT2 10 >>>>> 6 AACP 11 >>>>> 7 SERPINA3 12 >>>>> 8 AADAC 13 >>>>> 9 AAMP 14 >>>>> 10 AANAT 15 >>>>> 11 AAMP 14 >>>>> 12 AANAT 15 >>>>> 13 DSPS<na> >>>>> 14 SNAT<na> >>>>> 15 AARS 16 >>>>> 16 CMT2N<na> >>>>> 17 AAV<na> >>>>> 18 AAVS1 17 >>>>> 19 ABAT 18 >>>>> 20 GABA-AT<na> >>>>> 21 GABAT<na> >>>>> 22 NPD009<na> >>>>> 23 ABC-1<na> >>>>> 24 ABC1<na> >>>>> 25 ABCA1 19 >>>>> > select(org.Hs.eg.db, symb, "ENTREZID","ALIAS") >>>>> ALIAS ENTREZID >>>>> 1 A1BG 1 >>>>> 2 A2M 2 >>>>> 3 A2MP1 3 >>>>> 4 NAT1 9 >>>>> 5 NAT1 1982 >>>>> 6 NAT1 6530 >>>>> 7 NAT1 10991 >>>>> 8 NAT2 10 >>>>> 9 NAT2 81539 >>>>> 10 AACP 11 >>>>> 11 SERPINA3 12 >>>>> 12 AADAC 13 >>>>> 13 AAMP 14 >>>>> 14 AANAT 15 >>>>> 15 DSPS 15 >>>>> 16 SNAT 15 >>>>> 17 AARS 16 >>>>> 18 CMT2N 16 >>>>> 19 AAV 17 >>>>> 20 AAVS1 17 >>>>> 21 ABAT 18 >>>>> 22 GABA-AT 18 >>>>> 23 GABAT 18 >>>>> 24 NPD009 18 >>>>> 25 ABC-1 19 >>>>> 26 ABC1 19 >>>>> 27 ABC1 63897 >>>>> 28 ABCA1 19 >>>>> Warning message: >>>>> In .generateExtraRows(tab, keys, jointype) : >>>>> 'select' and duplicate query keys resulted in 1:many mapping >>>>> between >>>>> keys and return rows >>>>> > mget(c("1982","6530","10991"), org.Hs.egGENENAME) >>>>> $`1982` >>>>> [1] "eukaryotic translation initiation factor 4 gamma, 2" >>>>> >>>>> $`6530` >>>>> [1] "solute carrier family 6 (neurotransmitter transporter, >>>>> noradrenalin), member 2" >>>>> >>>>> $`10991` >>>>> [1] "solute carrier family 38, member 3" >>>>> >>>>> Best, >>>>> >>>>> Jim >>>>> >>>>>> On 25 July 2013 18:17, James W. MacDonald<jmacdon at="" uw.edu=""> wrote: >>>>>>> >>>>>>> Hi Enrico, >>>>>>> >>>>>>> >>>>>>> On 7/25/2013 12:56 PM, Enrico Ferrero wrote: >>>>>>>> >>>>>>>> Dear James, >>>>>>>> >>>>>>>> Thanks very much for your prompt reply. >>>>>>>> I knew the problem was the for loop and the select function is >>>>>>>> indeed >>>>>>>> a lot faster than that and works perfectly with toy data. >>>>>>>> >>>>>>>> However, this is what happens when I try to use it with real data: >>>>>>>> >>>>>>>>> test<- select(org.Hs.eg.db, keys=df$GeneSymbol, keytype="ALIAS", >>>>>>>>> cols=c("SYMBOL","ENTREZID","ENSEMBL")) >>>>>>>> >>>>>>>> Warning message: >>>>>>>> In .generateExtraRows(tab, keys, jointype) : >>>>>>>> 'select' and duplicate query keys resulted in 1:many mapping >>>>>>>> between >>>>>>>> keys and return rows >>>>>>>> >>>>>>>> which is probably the warning you mentioned. >>>>>>> >>>>>>> >>>>>>> That's not the warning I mentioned, but it does point out the same >>>>>>> issue, >>>>>>> which is that there is a one to many mapping between symbol and >>>>>>> entrez gene >>>>>>> ID. >>>>>>> >>>>>>> So now you have to decide if you want to be naive (or stupid, >>>>>>> depending on >>>>>>> your perspective) or not. You could just cover your eyes and do this: >>>>>>> >>>>>>> first.two<- first.two[!duplicated(first.two$SYMBOL),] >>>>>>> >>>>>>> which will choose for you the first symbol -> gene ID mapping and >>>>>>> nuke the >>>>>>> rest. That's nice and quick, but you are making huge assumptions. >>>>>>> >>>>>>> Or you could decide to be a bit more sophisticated and do >>>>>>> something like >>>>>>> >>>>>>> thelst<- tapply(1:nrow(first.two), first.two$SYMBOL, function(x) >>>>>>> first.two[x,]) >>>>>>> >>>>>>> At this point you can take a look at e.g., thelst[1:10] to see what >>>>>>> we just >>>>>>> did >>>>>>> >>>>>>> thelst<- do.call("rbind", lapply(thelst, function(x) c(x[1,1], >>>>>>> paste(x[,2], >>>>>>> collapse = "|"))) >>>>>>> >>>>>>> and here you can look at head(thelst). >>>>>>> >>>>>>> Then you can check to ensure that the first column of thelst is >>>>>>> identical to >>>>>>> the first column of df, and proceed as before. >>>>>>> >>>>>>> But there is still the problem of the multiple mappings. As an >>>>>>> example: >>>>>>> >>>>>>>> thelst[1:5] >>>>>>> >>>>>>> $HBD >>>>>>> SYMBOL ENTREZID >>>>>>> 2535 HBD 3045 >>>>>>> 2536 HBD 100187828 >>>>>>> >>>>>>> $KIR3DL3 >>>>>>> SYMBOL ENTREZID >>>>>>> 17513 KIR3DL3 115653 >>>>>>> 17514 KIR3DL3 100133046 >>>>>>> >>>>>>>> mget(as.character(thelst[[1]][,2]), org.Hs.egGENENAME) >>>>>>> >>>>>>> $`3045` >>>>>>> [1] "hemoglobin, delta" >>>>>>> >>>>>>> $`100187828` >>>>>>> [1] "hypophosphatemic bone disease" >>>>>>> >>>>>>>> mget(as.character(thelst[[2]][,2]), org.Hs.egGENENAME) >>>>>>> >>>>>>> $`115653` >>>>>>> [1] "killer cell immunoglobulin-like receptor, three domains, long >>>>>>> cytoplasmic tail, 3" >>>>>>> >>>>>>> $`100133046` >>>>>>> [1] "killer cell immunoglobulin-like receptor three domains long >>>>>>> cytoplasmic >>>>>>> tail 3" >>>>>>> >>>>>>> >>>>>>> So HBD is the gene symbol for two different genes! If this gene >>>>>>> symbol is in >>>>>>> your data, you will now have attributed your data to two genes that >>>>>>> apparently are not remotely similar. if KIR3DL3 is in your data, >>>>>>> then it >>>>>>> worked out OK for that gene. >>>>>>> >>>>>>> Best, >>>>>>> >>>>>>> Jim >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>>> The real problem is that the number of rows is now different for >>>>>>>> the 2 >>>>>>>> objects: >>>>>>>>> >>>>>>>>> nrow(df); nrow(test) >>>>>>>> >>>>>>>> [1] 573 >>>>>>>> [1] 201 >>>>>>>> >>>>>>>> So I obviously can't put the new data into the original df. My >>>>>>>> impression is that when the 1 to many mapping arises, the select >>>>>>>> functions exits, with that warning message. As a result, my test >>>>>>>> object is incomplete. >>>>>>>> >>>>>>>> On top of that, and I can't really explain this, the row >>>>>>>> positions are >>>>>>>> messed up, e.g. >>>>>>>> >>>>>>>>> all.equal(df[100,],test[100,]) >>>>>>>> >>>>>>>> returns FALSE. >>>>>>>> >>>>>>>> How can I work around this? >>>>>>>> >>>>>>>> Thanks a lot! >>>>>>>> >>>>>>>> Best, >>>>>>>> >>>>>>>> On 25 July 2013 16:58, James W. MacDonald<jmacdon at="" uw.edu=""> wrote: >>>>>>>>> >>>>>>>>> Hi Enrico, >>>>>>>>> >>>>>>>>> >>>>>>>>> On 7/25/2013 11:35 AM, Enrico Ferrero wrote: >>>>>>>>>> >>>>>>>>>> Hello, >>>>>>>>>> >>>>>>>>>> I often have data frames where I need to perform ID conversions on >>>>>>>>>> one >>>>>>>>>> or >>>>>>>>>> more of the columns while preserving the order of the rows, e.g.: >>>>>>>>>> >>>>>>>>>> GeneSymbol Value1 Value2 >>>>>>>>>> GS1 2.5 0.1 >>>>>>>>>> GS2 3 0.2 >>>>>>>>>> .. >>>>>>>>>> >>>>>>>>>> And I want to obtain: >>>>>>>>>> >>>>>>>>>> GeneSymbol EntrezGeneID Value1 Value2 >>>>>>>>>> GS1 EG1 2.5 0.1 >>>>>>>>>> GS2 EG2 3 0.2 >>>>>>>>>> .. >>>>>>>>>> >>>>>>>>>> What I've done so far was to create a function that uses >>>>>>>>>> org.Hs.eg.db to >>>>>>>>>> loop over the rows of the column and does the conversion: >>>>>>>>>> >>>>>>>>>> library(org.Hs.eg.db) >>>>>>>>>> alias2EG<- function(x) { >>>>>>>>>> for (i in 1:length(x)) { >>>>>>>>>> if (!is.na(x[i])) { >>>>>>>>>> repl<- org.Hs.egALIAS2EG[[x[i]]][1] >>>>>>>>>> if (!is.null(repl)) { >>>>>>>>>> x[i]<- repl >>>>>>>>>> } >>>>>>>>>> else { >>>>>>>>>> x[i]<- NA >>>>>>>>>> } >>>>>>>>>> } >>>>>>>>>> } >>>>>>>>>> return(x) >>>>>>>>>> } >>>>>>>>> >>>>>>>>> >>>>>>>>> I should first note that gene symbols are not unique, so you are >>>>>>>>> taking a >>>>>>>>> chance on your mappings. Is there no other annotation for your >>>>>>>>> data? >>>>>>>>> >>>>>>>>> In addition, you should note that it is almost always better to >>>>>>>>> think of >>>>>>>>> objects as vectors and matrices in R, rather than as things that >>>>>>>>> need to >>>>>>>>> be >>>>>>>>> looped over (e.g., R isn't Perl or C). >>>>>>>>> >>>>>>>>> first.two<- select(org.Hs.eg.db, as.character(df$GeneSymbol), >>>>>>>>> "ENTREZID", >>>>>>>>> "SYMBOL") >>>>>>>>> >>>>>>>>> Note that there used to be a warning or an error (don't remember >>>>>>>>> which) >>>>>>>>> when >>>>>>>>> you did something like this, stating that gene symbols are not >>>>>>>>> unique, >>>>>>>>> and >>>>>>>>> that you shouldn't do this sort of thing. Apparently this >>>>>>>>> warning has >>>>>>>>> been >>>>>>>>> removed, but the issue remains valid. >>>>>>>>> >>>>>>>>> ## check yourself >>>>>>>>> >>>>>>>>> all.equal(df$GeneSymbol, first.two$SYMBOL) >>>>>>>>> >>>>>>>>> ## if true, proceed >>>>>>>>> >>>>>>>>> df<- data.frame(first.two, df[,-1]) >>>>>>>>> >>>>>>>>> Best, >>>>>>>>> >>>>>>>>> Jim >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>>> and then call the function like this: >>>>>>>>>> >>>>>>>>>> df$EntrezGeneID<- alias2GS(df$GeneSymbol) >>>>>>>>>> >>>>>>>>>> This works well, but gets very slow when I need to do multiple >>>>>>>>>> conversions >>>>>>>>>> on large datasets. >>>>>>>>>> >>>>>>>>>> Is there any way I can achieve the same result but in a >>>>>>>>>> quicker, more >>>>>>>>>> efficient way? >>>>>>>>>> >>>>>>>>>> Thank you. >>>>>>>>>> >>>>>>>>> -- >>>>>>>>> James W. MacDonald, M.S. >>>>>>>>> Biostatistician >>>>>>>>> University of Washington >>>>>>>>> Environmental and Occupational Health Sciences >>>>>>>>> 4225 Roosevelt Way NE, # 100 >>>>>>>>> Seattle WA 98105-6099 >>>>>>>>> >>>>>>> -- >>>>>>> James W. MacDonald, M.S. >>>>>>> Biostatistician >>>>>>> University of Washington >>>>>>> Environmental and Occupational Health Sciences >>>>>>> 4225 Roosevelt Way NE, # 100 >>>>>>> Seattle WA 98105-6099 >>>>>>> >>>>>> >>>> -- >>>> Hervé Pagès >>>> >>>> Program in Computational Biology >>>> Division of Public Health Sciences >>>> Fred Hutchinson Cancer Research Center >>>> 1100 Fairview Ave. N, M1-B514 >>>> P.O. Box 19024 >>>> Seattle, WA 98109-1024 >>>> >>>> E-mail: hpages at fhcrc.org >>>> Phone: (206) 667-5791 >>>> Fax: (206) 667-1319 >>> >>> >>> >> > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages at fhcrc.org > Phone: (206) 667-5791 > Fax: (206) 667-1319 -- Enrico Ferrero PhD Student Steve Russell Lab - Department of Genetics FlyChip - Cambridge Systems Biology Centre University of Cambridge e.ferrero at gen.cam.ac.uk http://flypress.gen.cam.ac.uk/
ADD REPLY
0
Entering edit mode
On 07/25/2013 03:54 PM, Enrico Ferrero wrote: > Hi both, > > Thanks for your insights, this is extremely interesting! > > While I (kind of) understand why NAs get removed, deliberately > truncating the output that way is probably not what most people > expect. It may be worth considering filing a bug report for this? > > This also brings me back to my original question: what's the simplest > and most effienct way to create an exact copy of a column containing > converted IDs in a data.frame? > > I'm surprised there doesn't seem to be an easy ready-to-go solution, > as I would imagine it is a rather common task to perform. There is no ready-to-go solution, because, as Jim pointed out, the problem of the multiple mappings cannot be solved in a meaningful way without some extra knowledge. It's not a limitation of the software, it's a problem inherent to the nature of the data itself. However, the 1st thing you can do to reduce the number of multiple mappings is to request only the columns you are interested in. For example: > library(org.Hs.eg.db) > select(org.Hs.eg.db, key="ALOX5", keytype="ALIAS", cols=c("SYMBOL","ENTREZID","ENSEMBL")) ALIAS SYMBOL ENTREZID ENSEMBL 1 ALOX5 ALOX5 240 ENSG00000012779 2 ALOX5 ALOX5 240 ENSG00000262552 > select(org.Hs.eg.db, key="ALOX5", keytype="ALIAS", cols="ENTREZID") ALIAS ENTREZID 1 ALOX5 240 ALOX5 is mapped to 2 Ensembl ids, but only to one Entrez id. So by requesting only the ENTREZID, ALOX5 does not generate 2 rows anymore. Now a *blunt* approach to get rid of all keys with multiple mapping is to treat them as if they had no mapping (this avoid having to choose a particular row for the key, convenient but of course not satisfactory). The way to do this is to do a little bit of preprocessing of the 'key' vector and a little bit of post-processing of the data.frame returned by select(): library(org.Hs.eg.db) mydf <- read.table("testdata.txt", sep="\t", header=TRUE, as.is=TRUE) mykeys0 <- mydf$GeneSymbol1 mykeys <- unique(mykeys0[!is.na(mykeys0)]) mytest <- select(org.Hs.eg.db, key=mykeys, keytype="ALIAS", cols="ENTREZID") is_multiple_mapping <- duplicated(mytest$ALIAS) | duplicated(mytest$ALIAS, fromLast=TRUE) mytest0 <- mytest[!is_multiple_mapping, ] mytest0 <- mytest0[match(mykeys0, mytest0$ALIAS), ] mytest0$ALIAS <- mykeys0 rownames(mytest0) <- NULL Each row in 'mytest0' faces the corresponding key in 'mykeys0'. Cheers, H. > As I > mentioned in my first post, the for loop function works, but it's > highly inefficient. > > Any help is greatly appreciated, thank you. > > Best, > > > > On 25 July 2013 23:18, Hervé Pagès <hpages at="" fhcrc.org=""> wrote: >> Hi James, >> >> You're right. >> >> It's actually both: NAs *and* duplicated keys that are mapped to >> more than 1 row are removed from the input. I don't think this >> is documented. >> >> I wonder if select() behavior couldn't be a little bit simpler by >> either preserving or removing all duplicated keys, and not just some >> of them (on a somewhat arbitrary criteria). >> >> Thanks, >> H. >> >> >> >> On 07/25/2013 02:57 PM, James W. MacDonald wrote: >>> >>> Hi Enrico and Herve, >>> >>> This has to do with duplicate entries, but only when the duplicate entry >>> maps to many ENTREZID: >>> >>> > select(org.Hs.eg.db, rep("ADORA2A", 4), "ENTREZID", "ALIAS") >>> ALIAS ENTREZID >>> 1 ADORA2A 135 >>> 2 ADORA2A 135 >>> 3 ADORA2A 135 >>> 4 ADORA2A 135 >>> >>> > select(org.Hs.eg.db, rep("AGT", 4), "ENTREZID", "ALIAS") >>> ALIAS ENTREZID >>> 1 AGT 183 >>> 2 AGT 189 >>> Warning message: >>> In .generateExtraRows(tab, keys, jointype) : >>> 'select' and duplicate query keys resulted in 1:many mapping between >>> keys and return rows >>> >>> > select(org.Hs.eg.db, "AGT", "ENTREZID", "ALIAS") >>> ALIAS ENTREZID >>> 1 AGT 183 >>> 2 AGT 189 >>> Warning message: >>> In .generateExtraRows(tab, keys, jointype) : >>> 'select' resulted in 1:many mapping between keys and return rows >>> >>> >>> So in the instances where a gene symbol maps to more than one ENTREZID, >>> the output gets truncated, whereas if it is a one-to-one mapping, it >>> does not. >>> >>> Best, >>> >>> Jim >>> >>> >>> >>> >>> On 7/25/2013 5:06 PM, Enrico Ferrero wrote: >>>> >>>> Hi, >>>> >>>> Herv?, that's exactly what I'm trying to say. >>>> >>>> Attached to this email is a tab delimited file with two columns of >>>> GeneSymbols (or Aliases), and here is some simple code to reproduce >>>> the unexpected behaviour: >>>> >>>> library(org.Hs.eg.db) >>>> mydf<- read.table("testdata.txt", sep="\t", header=TRUE, as.is=TRUE) >>>> mytest<- select(org.Hs.eg.db, key=mydf$GeneSymbol1, keytype="ALIAS", >>>> cols=c("SYMBOL","ENTREZID","ENSEMBL")) >>>> # check that mytest has less rows than mydf >>>> nrow(mydf) >>>> nrow(mytest) >>>> # pick a random row: they don't match >>>> mydf[250,] >>>> mytest[250,] >>>> >>>> Ideally, mytest should have the same number and position of rows of >>>> mydf so that I can then cbind them. >>>> If mytest has more rows because of multiple mappings that's also fine: >>>> I can always use merge(mydf, mytest), right? >>>> >>>> Thanks a lot to both for your help, it's very appreciated. >>>> Best, >>>> >>>> >>>> On 25 July 2013 21:32, Hervé Pagès<hpages at="" fhcrc.org=""> wrote: >>>>> >>>>> Hi Enrico, >>>>> >>>>> >>>>> On 07/25/2013 01:20 PM, James W. MacDonald wrote: >>>>>> >>>>>> Hi Enrico, >>>>>> >>>>>> Please don't take things off-list (e.g., use reply-all). >>>>>> >>>>>> >>>>>> On 7/25/2013 2:17 PM, Enrico Ferrero wrote: >>>>>>> >>>>>>> Hi James, >>>>>>> >>>>>>> Thanks very much for your help. >>>>>>> There is an issue that needs to be solved before thinking about what's >>>>>>> the best approach in my opinion. >>>>>>> >>>>>>> I don't understand why, but the object created with the call to select >>>>>>> (test in my example, first.two in yours) has a different number of >>>>>>> rows from the original object (df in my example). Specifically it has >>>>>>> *less* rows. >>>>> >>>>> >>>>> I'm surprised it has less rows. It can definitely have more, when some >>>>> of the keys passed to select() are mapped to more than 1 row, but my >>>>> understanding was that select() would propagate unmapped keys to the >>>>> output by placing them in rows stuffed with NAs. So maybe I >>>>> misunderstood how select() works, or its behavior was changed, or >>>>> there is a bug somewhere. Could you please send the code that allows >>>>> us to reproduce this? Thanks. >>>>> >>>>> H. >>>>> >>>>> >>>>>> If all symbols were converted to all possible Entrez IDs, >>>>>>> >>>>>>> I would expect it to have more rows, not less. To me, it looks like >>>>>>> not all rows are looked up and returned. >>>>>>> >>>>>>> Do you see what I mean? >>>>>> >>>>>> >>>>>> Sure. You could be using outdated gene symbols. Or perhaps you are >>>>>> using >>>>>> a mixture of symbols and aliases. Which is even cooler than just all >>>>>> symbols: >>>>>> >>>>>> > symb<- c(Rkeys(org.Hs.egSYMBOL)[1:10], >>>>>> Rkeys(org.Hs.egALIAS2EG)[31:45]) >>>>>> > symb >>>>>> [1] "A1BG" "A2M" "A2MP1" "NAT1" "NAT2" "AACP" >>>>>> [7] "SERPINA3" "AADAC" "AAMP" "AANAT" "AAMP" "AANAT" >>>>>> [13] "DSPS" "SNAT" "AARS" "CMT2N" "AAV" "AAVS1" >>>>>> [19] "ABAT" "GABA-AT" "GABAT" "NPD009" "ABC-1" "ABC1" >>>>>> [25] "ABCA1" >>>>>> > select(org.Hs.eg.db, symb, "ENTREZID","SYMBOL") >>>>>> SYMBOL ENTREZID >>>>>> 1 A1BG 1 >>>>>> 2 A2M 2 >>>>>> 3 A2MP1 3 >>>>>> 4 NAT1 9 >>>>>> 5 NAT2 10 >>>>>> 6 AACP 11 >>>>>> 7 SERPINA3 12 >>>>>> 8 AADAC 13 >>>>>> 9 AAMP 14 >>>>>> 10 AANAT 15 >>>>>> 11 AAMP 14 >>>>>> 12 AANAT 15 >>>>>> 13 DSPS<na> >>>>>> 14 SNAT<na> >>>>>> 15 AARS 16 >>>>>> 16 CMT2N<na> >>>>>> 17 AAV<na> >>>>>> 18 AAVS1 17 >>>>>> 19 ABAT 18 >>>>>> 20 GABA-AT<na> >>>>>> 21 GABAT<na> >>>>>> 22 NPD009<na> >>>>>> 23 ABC-1<na> >>>>>> 24 ABC1<na> >>>>>> 25 ABCA1 19 >>>>>> > select(org.Hs.eg.db, symb, "ENTREZID","ALIAS") >>>>>> ALIAS ENTREZID >>>>>> 1 A1BG 1 >>>>>> 2 A2M 2 >>>>>> 3 A2MP1 3 >>>>>> 4 NAT1 9 >>>>>> 5 NAT1 1982 >>>>>> 6 NAT1 6530 >>>>>> 7 NAT1 10991 >>>>>> 8 NAT2 10 >>>>>> 9 NAT2 81539 >>>>>> 10 AACP 11 >>>>>> 11 SERPINA3 12 >>>>>> 12 AADAC 13 >>>>>> 13 AAMP 14 >>>>>> 14 AANAT 15 >>>>>> 15 DSPS 15 >>>>>> 16 SNAT 15 >>>>>> 17 AARS 16 >>>>>> 18 CMT2N 16 >>>>>> 19 AAV 17 >>>>>> 20 AAVS1 17 >>>>>> 21 ABAT 18 >>>>>> 22 GABA-AT 18 >>>>>> 23 GABAT 18 >>>>>> 24 NPD009 18 >>>>>> 25 ABC-1 19 >>>>>> 26 ABC1 19 >>>>>> 27 ABC1 63897 >>>>>> 28 ABCA1 19 >>>>>> Warning message: >>>>>> In .generateExtraRows(tab, keys, jointype) : >>>>>> 'select' and duplicate query keys resulted in 1:many mapping >>>>>> between >>>>>> keys and return rows >>>>>> > mget(c("1982","6530","10991"), org.Hs.egGENENAME) >>>>>> $`1982` >>>>>> [1] "eukaryotic translation initiation factor 4 gamma, 2" >>>>>> >>>>>> $`6530` >>>>>> [1] "solute carrier family 6 (neurotransmitter transporter, >>>>>> noradrenalin), member 2" >>>>>> >>>>>> $`10991` >>>>>> [1] "solute carrier family 38, member 3" >>>>>> >>>>>> Best, >>>>>> >>>>>> Jim >>>>>> >>>>>>> On 25 July 2013 18:17, James W. MacDonald<jmacdon at="" uw.edu=""> wrote: >>>>>>>> >>>>>>>> Hi Enrico, >>>>>>>> >>>>>>>> >>>>>>>> On 7/25/2013 12:56 PM, Enrico Ferrero wrote: >>>>>>>>> >>>>>>>>> Dear James, >>>>>>>>> >>>>>>>>> Thanks very much for your prompt reply. >>>>>>>>> I knew the problem was the for loop and the select function is >>>>>>>>> indeed >>>>>>>>> a lot faster than that and works perfectly with toy data. >>>>>>>>> >>>>>>>>> However, this is what happens when I try to use it with real data: >>>>>>>>> >>>>>>>>>> test<- select(org.Hs.eg.db, keys=df$GeneSymbol, keytype="ALIAS", >>>>>>>>>> cols=c("SYMBOL","ENTREZID","ENSEMBL")) >>>>>>>>> >>>>>>>>> Warning message: >>>>>>>>> In .generateExtraRows(tab, keys, jointype) : >>>>>>>>> 'select' and duplicate query keys resulted in 1:many mapping >>>>>>>>> between >>>>>>>>> keys and return rows >>>>>>>>> >>>>>>>>> which is probably the warning you mentioned. >>>>>>>> >>>>>>>> >>>>>>>> That's not the warning I mentioned, but it does point out the same >>>>>>>> issue, >>>>>>>> which is that there is a one to many mapping between symbol and >>>>>>>> entrez gene >>>>>>>> ID. >>>>>>>> >>>>>>>> So now you have to decide if you want to be naive (or stupid, >>>>>>>> depending on >>>>>>>> your perspective) or not. You could just cover your eyes and do this: >>>>>>>> >>>>>>>> first.two<- first.two[!duplicated(first.two$SYMBOL),] >>>>>>>> >>>>>>>> which will choose for you the first symbol -> gene ID mapping and >>>>>>>> nuke the >>>>>>>> rest. That's nice and quick, but you are making huge assumptions. >>>>>>>> >>>>>>>> Or you could decide to be a bit more sophisticated and do >>>>>>>> something like >>>>>>>> >>>>>>>> thelst<- tapply(1:nrow(first.two), first.two$SYMBOL, function(x) >>>>>>>> first.two[x,]) >>>>>>>> >>>>>>>> At this point you can take a look at e.g., thelst[1:10] to see what >>>>>>>> we just >>>>>>>> did >>>>>>>> >>>>>>>> thelst<- do.call("rbind", lapply(thelst, function(x) c(x[1,1], >>>>>>>> paste(x[,2], >>>>>>>> collapse = "|"))) >>>>>>>> >>>>>>>> and here you can look at head(thelst). >>>>>>>> >>>>>>>> Then you can check to ensure that the first column of thelst is >>>>>>>> identical to >>>>>>>> the first column of df, and proceed as before. >>>>>>>> >>>>>>>> But there is still the problem of the multiple mappings. As an >>>>>>>> example: >>>>>>>> >>>>>>>>> thelst[1:5] >>>>>>>> >>>>>>>> $HBD >>>>>>>> SYMBOL ENTREZID >>>>>>>> 2535 HBD 3045 >>>>>>>> 2536 HBD 100187828 >>>>>>>> >>>>>>>> $KIR3DL3 >>>>>>>> SYMBOL ENTREZID >>>>>>>> 17513 KIR3DL3 115653 >>>>>>>> 17514 KIR3DL3 100133046 >>>>>>>> >>>>>>>>> mget(as.character(thelst[[1]][,2]), org.Hs.egGENENAME) >>>>>>>> >>>>>>>> $`3045` >>>>>>>> [1] "hemoglobin, delta" >>>>>>>> >>>>>>>> $`100187828` >>>>>>>> [1] "hypophosphatemic bone disease" >>>>>>>> >>>>>>>>> mget(as.character(thelst[[2]][,2]), org.Hs.egGENENAME) >>>>>>>> >>>>>>>> $`115653` >>>>>>>> [1] "killer cell immunoglobulin-like receptor, three domains, long >>>>>>>> cytoplasmic tail, 3" >>>>>>>> >>>>>>>> $`100133046` >>>>>>>> [1] "killer cell immunoglobulin-like receptor three domains long >>>>>>>> cytoplasmic >>>>>>>> tail 3" >>>>>>>> >>>>>>>> >>>>>>>> So HBD is the gene symbol for two different genes! If this gene >>>>>>>> symbol is in >>>>>>>> your data, you will now have attributed your data to two genes that >>>>>>>> apparently are not remotely similar. if KIR3DL3 is in your data, >>>>>>>> then it >>>>>>>> worked out OK for that gene. >>>>>>>> >>>>>>>> Best, >>>>>>>> >>>>>>>> Jim >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>> The real problem is that the number of rows is now different for >>>>>>>>> the 2 >>>>>>>>> objects: >>>>>>>>>> >>>>>>>>>> nrow(df); nrow(test) >>>>>>>>> >>>>>>>>> [1] 573 >>>>>>>>> [1] 201 >>>>>>>>> >>>>>>>>> So I obviously can't put the new data into the original df. My >>>>>>>>> impression is that when the 1 to many mapping arises, the select >>>>>>>>> functions exits, with that warning message. As a result, my test >>>>>>>>> object is incomplete. >>>>>>>>> >>>>>>>>> On top of that, and I can't really explain this, the row >>>>>>>>> positions are >>>>>>>>> messed up, e.g. >>>>>>>>> >>>>>>>>>> all.equal(df[100,],test[100,]) >>>>>>>>> >>>>>>>>> returns FALSE. >>>>>>>>> >>>>>>>>> How can I work around this? >>>>>>>>> >>>>>>>>> Thanks a lot! >>>>>>>>> >>>>>>>>> Best, >>>>>>>>> >>>>>>>>> On 25 July 2013 16:58, James W. MacDonald<jmacdon at="" uw.edu=""> wrote: >>>>>>>>>> >>>>>>>>>> Hi Enrico, >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On 7/25/2013 11:35 AM, Enrico Ferrero wrote: >>>>>>>>>>> >>>>>>>>>>> Hello, >>>>>>>>>>> >>>>>>>>>>> I often have data frames where I need to perform ID conversions on >>>>>>>>>>> one >>>>>>>>>>> or >>>>>>>>>>> more of the columns while preserving the order of the rows, e.g.: >>>>>>>>>>> >>>>>>>>>>> GeneSymbol Value1 Value2 >>>>>>>>>>> GS1 2.5 0.1 >>>>>>>>>>> GS2 3 0.2 >>>>>>>>>>> .. >>>>>>>>>>> >>>>>>>>>>> And I want to obtain: >>>>>>>>>>> >>>>>>>>>>> GeneSymbol EntrezGeneID Value1 Value2 >>>>>>>>>>> GS1 EG1 2.5 0.1 >>>>>>>>>>> GS2 EG2 3 0.2 >>>>>>>>>>> .. >>>>>>>>>>> >>>>>>>>>>> What I've done so far was to create a function that uses >>>>>>>>>>> org.Hs.eg.db to >>>>>>>>>>> loop over the rows of the column and does the conversion: >>>>>>>>>>> >>>>>>>>>>> library(org.Hs.eg.db) >>>>>>>>>>> alias2EG<- function(x) { >>>>>>>>>>> for (i in 1:length(x)) { >>>>>>>>>>> if (!is.na(x[i])) { >>>>>>>>>>> repl<- org.Hs.egALIAS2EG[[x[i]]][1] >>>>>>>>>>> if (!is.null(repl)) { >>>>>>>>>>> x[i]<- repl >>>>>>>>>>> } >>>>>>>>>>> else { >>>>>>>>>>> x[i]<- NA >>>>>>>>>>> } >>>>>>>>>>> } >>>>>>>>>>> } >>>>>>>>>>> return(x) >>>>>>>>>>> } >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> I should first note that gene symbols are not unique, so you are >>>>>>>>>> taking a >>>>>>>>>> chance on your mappings. Is there no other annotation for your >>>>>>>>>> data? >>>>>>>>>> >>>>>>>>>> In addition, you should note that it is almost always better to >>>>>>>>>> think of >>>>>>>>>> objects as vectors and matrices in R, rather than as things that >>>>>>>>>> need to >>>>>>>>>> be >>>>>>>>>> looped over (e.g., R isn't Perl or C). >>>>>>>>>> >>>>>>>>>> first.two<- select(org.Hs.eg.db, as.character(df$GeneSymbol), >>>>>>>>>> "ENTREZID", >>>>>>>>>> "SYMBOL") >>>>>>>>>> >>>>>>>>>> Note that there used to be a warning or an error (don't remember >>>>>>>>>> which) >>>>>>>>>> when >>>>>>>>>> you did something like this, stating that gene symbols are not >>>>>>>>>> unique, >>>>>>>>>> and >>>>>>>>>> that you shouldn't do this sort of thing. Apparently this >>>>>>>>>> warning has >>>>>>>>>> been >>>>>>>>>> removed, but the issue remains valid. >>>>>>>>>> >>>>>>>>>> ## check yourself >>>>>>>>>> >>>>>>>>>> all.equal(df$GeneSymbol, first.two$SYMBOL) >>>>>>>>>> >>>>>>>>>> ## if true, proceed >>>>>>>>>> >>>>>>>>>> df<- data.frame(first.two, df[,-1]) >>>>>>>>>> >>>>>>>>>> Best, >>>>>>>>>> >>>>>>>>>> Jim >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>> and then call the function like this: >>>>>>>>>>> >>>>>>>>>>> df$EntrezGeneID<- alias2GS(df$GeneSymbol) >>>>>>>>>>> >>>>>>>>>>> This works well, but gets very slow when I need to do multiple >>>>>>>>>>> conversions >>>>>>>>>>> on large datasets. >>>>>>>>>>> >>>>>>>>>>> Is there any way I can achieve the same result but in a >>>>>>>>>>> quicker, more >>>>>>>>>>> efficient way? >>>>>>>>>>> >>>>>>>>>>> Thank you. >>>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> James W. MacDonald, M.S. >>>>>>>>>> Biostatistician >>>>>>>>>> University of Washington >>>>>>>>>> Environmental and Occupational Health Sciences >>>>>>>>>> 4225 Roosevelt Way NE, # 100 >>>>>>>>>> Seattle WA 98105-6099 >>>>>>>>>> >>>>>>>> -- >>>>>>>> James W. MacDonald, M.S. >>>>>>>> Biostatistician >>>>>>>> University of Washington >>>>>>>> Environmental and Occupational Health Sciences >>>>>>>> 4225 Roosevelt Way NE, # 100 >>>>>>>> Seattle WA 98105-6099 >>>>>>>> >>>>>>> >>>>> -- >>>>> Hervé Pagès >>>>> >>>>> Program in Computational Biology >>>>> Division of Public Health Sciences >>>>> Fred Hutchinson Cancer Research Center >>>>> 1100 Fairview Ave. N, M1-B514 >>>>> P.O. Box 19024 >>>>> Seattle, WA 98109-1024 >>>>> >>>>> E-mail: hpages at fhcrc.org >>>>> Phone: (206) 667-5791 >>>>> Fax: (206) 667-1319 >>>> >>>> >>>> >>> >> >> -- >> Hervé Pagès >> >> Program in Computational Biology >> Division of Public Health Sciences >> Fred Hutchinson Cancer Research Center >> 1100 Fairview Ave. N, M1-B514 >> P.O. Box 19024 >> Seattle, WA 98109-1024 >> >> E-mail: hpages at fhcrc.org >> Phone: (206) 667-5791 >> Fax: (206) 667-1319 > > > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
ADD REPLY
0
Entering edit mode
On 07/25/2013 04:45 PM, Hervé Pagès wrote: > On 07/25/2013 03:54 PM, Enrico Ferrero wrote: >> Hi both, >> >> Thanks for your insights, this is extremely interesting! >> >> While I (kind of) understand why NAs get removed, deliberately >> truncating the output that way is probably not what most people >> expect. It may be worth considering filing a bug report for this? >> >> This also brings me back to my original question: what's the simplest >> and most effienct way to create an exact copy of a column containing >> converted IDs in a data.frame? >> >> I'm surprised there doesn't seem to be an easy ready-to-go solution, >> as I would imagine it is a rather common task to perform. > > There is no ready-to-go solution, because, as Jim pointed out, the > problem of the multiple mappings cannot be solved in a meaningful > way without some extra knowledge. It's not a limitation of the software, > it's a problem inherent to the nature of the data itself. > > However, the 1st thing you can do to reduce the number of multiple > mappings is to request only the columns you are interested in. > For example: > > > library(org.Hs.eg.db) > > > select(org.Hs.eg.db, key="ALOX5", keytype="ALIAS", > cols=c("SYMBOL","ENTREZID","ENSEMBL")) > ALIAS SYMBOL ENTREZID ENSEMBL > 1 ALOX5 ALOX5 240 ENSG00000012779 > 2 ALOX5 ALOX5 240 ENSG00000262552 > > > select(org.Hs.eg.db, key="ALOX5", keytype="ALIAS", cols="ENTREZID") > ALIAS ENTREZID > 1 ALOX5 240 > > ALOX5 is mapped to 2 Ensembl ids, but only to one Entrez id. So by > requesting only the ENTREZID, ALOX5 does not generate 2 rows anymore. > > Now a *blunt* approach to get rid of all keys with multiple mapping > is to treat them as if they had no mapping (this avoid having to > choose a particular row for the key, convenient but of course not > satisfactory). The way to do this is to do a little bit of preprocessing > of the 'key' vector and a little bit of post-processing of the > data.frame returned by select(): > > library(org.Hs.eg.db) > > mydf <- read.table("testdata.txt", sep="\t", header=TRUE, as.is=TRUE) > > mykeys0 <- mydf$GeneSymbol1 > mykeys <- unique(mykeys0[!is.na(mykeys0)]) > mytest <- select(org.Hs.eg.db, key=mykeys, keytype="ALIAS", > cols="ENTREZID") > > is_multiple_mapping <- duplicated(mytest$ALIAS) | > duplicated(mytest$ALIAS, fromLast=TRUE) > mytest0 <- mytest[!is_multiple_mapping, ] > mytest0 <- mytest0[match(mykeys0, mytest0$ALIAS), ] > mytest0$ALIAS <- mykeys0 > rownames(mytest0) <- NULL > > Each row in 'mytest0' faces the corresponding key in 'mykeys0'. > > Cheers, > H. > Hello everyone, Sorry that I saw this thread so late. Basically, select() does *try* to keep your initial keys and map them each to an equivalent number of unique values. We did actually anticipate that people would *want* to cbind() their results. But as you discovered there are many circumstances where the data make this kind of behavior impossible. So passing in NAs as keys for example can't ever find anything meaningful. Those will simply have to be removed before we can proceed. And, it is also impossible to maintain a 1:1 mapping if you retrieve fields that have many to one relationships with your initial keys (also seen here). For convenience, when this kind of 1:1 output is already impossible (as it is for most of your examples), select will also try to simplify the output by removing rows that are identical all the way across etc.. My aim was that select should try to do the most reasonable thing possible based on the data we have in each case. The rationale is that in the case where there are 1:many mappings, you should not be planning to bind those directly onto any other data.frames anyways (as this circumstance would require you to call merge() instead). So in that case, non-destructive simplification seems beneficial. I hope this clarifies things, Marc > >> As I >> mentioned in my first post, the for loop function works, but it's >> highly inefficient. >> >> Any help is greatly appreciated, thank you. >> >> Best, >> >> >> >> On 25 July 2013 23:18, Hervé Pagès <hpages at="" fhcrc.org=""> wrote: >>> Hi James, >>> >>> You're right. >>> >>> It's actually both: NAs *and* duplicated keys that are mapped to >>> more than 1 row are removed from the input. I don't think this >>> is documented. >>> >>> I wonder if select() behavior couldn't be a little bit simpler by >>> either preserving or removing all duplicated keys, and not just some >>> of them (on a somewhat arbitrary criteria). >>> >>> Thanks, >>> H. >>> >>> >>> >>> On 07/25/2013 02:57 PM, James W. MacDonald wrote: >>>> >>>> Hi Enrico and Herve, >>>> >>>> This has to do with duplicate entries, but only when the duplicate >>>> entry >>>> maps to many ENTREZID: >>>> >>>> > select(org.Hs.eg.db, rep("ADORA2A", 4), "ENTREZID", "ALIAS") >>>> ALIAS ENTREZID >>>> 1 ADORA2A 135 >>>> 2 ADORA2A 135 >>>> 3 ADORA2A 135 >>>> 4 ADORA2A 135 >>>> >>>> > select(org.Hs.eg.db, rep("AGT", 4), "ENTREZID", "ALIAS") >>>> ALIAS ENTREZID >>>> 1 AGT 183 >>>> 2 AGT 189 >>>> Warning message: >>>> In .generateExtraRows(tab, keys, jointype) : >>>> 'select' and duplicate query keys resulted in 1:many mapping >>>> between >>>> keys and return rows >>>> >>>> > select(org.Hs.eg.db, "AGT", "ENTREZID", "ALIAS") >>>> ALIAS ENTREZID >>>> 1 AGT 183 >>>> 2 AGT 189 >>>> Warning message: >>>> In .generateExtraRows(tab, keys, jointype) : >>>> 'select' resulted in 1:many mapping between keys and return rows >>>> >>>> >>>> So in the instances where a gene symbol maps to more than one >>>> ENTREZID, >>>> the output gets truncated, whereas if it is a one-to-one mapping, it >>>> does not. >>>> >>>> Best, >>>> >>>> Jim >>>> >>>> >>>> >>>> >>>> On 7/25/2013 5:06 PM, Enrico Ferrero wrote: >>>>> >>>>> Hi, >>>>> >>>>> Herv?, that's exactly what I'm trying to say. >>>>> >>>>> Attached to this email is a tab delimited file with two columns of >>>>> GeneSymbols (or Aliases), and here is some simple code to reproduce >>>>> the unexpected behaviour: >>>>> >>>>> library(org.Hs.eg.db) >>>>> mydf<- read.table("testdata.txt", sep="\t", header=TRUE, as.is=TRUE) >>>>> mytest<- select(org.Hs.eg.db, key=mydf$GeneSymbol1, keytype="ALIAS", >>>>> cols=c("SYMBOL","ENTREZID","ENSEMBL")) >>>>> # check that mytest has less rows than mydf >>>>> nrow(mydf) >>>>> nrow(mytest) >>>>> # pick a random row: they don't match >>>>> mydf[250,] >>>>> mytest[250,] >>>>> >>>>> Ideally, mytest should have the same number and position of rows of >>>>> mydf so that I can then cbind them. >>>>> If mytest has more rows because of multiple mappings that's also >>>>> fine: >>>>> I can always use merge(mydf, mytest), right? >>>>> >>>>> Thanks a lot to both for your help, it's very appreciated. >>>>> Best, >>>>> >>>>> >>>>> On 25 July 2013 21:32, Hervé Pagès<hpages at="" fhcrc.org=""> wrote: >>>>>> >>>>>> Hi Enrico, >>>>>> >>>>>> >>>>>> On 07/25/2013 01:20 PM, James W. MacDonald wrote: >>>>>>> >>>>>>> Hi Enrico, >>>>>>> >>>>>>> Please don't take things off-list (e.g., use reply-all). >>>>>>> >>>>>>> >>>>>>> On 7/25/2013 2:17 PM, Enrico Ferrero wrote: >>>>>>>> >>>>>>>> Hi James, >>>>>>>> >>>>>>>> Thanks very much for your help. >>>>>>>> There is an issue that needs to be solved before thinking about >>>>>>>> what's >>>>>>>> the best approach in my opinion. >>>>>>>> >>>>>>>> I don't understand why, but the object created with the call to >>>>>>>> select >>>>>>>> (test in my example, first.two in yours) has a different number of >>>>>>>> rows from the original object (df in my example). Specifically >>>>>>>> it has >>>>>>>> *less* rows. >>>>>> >>>>>> >>>>>> I'm surprised it has less rows. It can definitely have more, when >>>>>> some >>>>>> of the keys passed to select() are mapped to more than 1 row, but my >>>>>> understanding was that select() would propagate unmapped keys to the >>>>>> output by placing them in rows stuffed with NAs. So maybe I >>>>>> misunderstood how select() works, or its behavior was changed, or >>>>>> there is a bug somewhere. Could you please send the code that allows >>>>>> us to reproduce this? Thanks. >>>>>> >>>>>> H. >>>>>> >>>>>> >>>>>>> If all symbols were converted to all possible Entrez IDs, >>>>>>>> >>>>>>>> I would expect it to have more rows, not less. To me, it looks >>>>>>>> like >>>>>>>> not all rows are looked up and returned. >>>>>>>> >>>>>>>> Do you see what I mean? >>>>>>> >>>>>>> >>>>>>> Sure. You could be using outdated gene symbols. Or perhaps you are >>>>>>> using >>>>>>> a mixture of symbols and aliases. Which is even cooler than just >>>>>>> all >>>>>>> symbols: >>>>>>> >>>>>>> > symb<- c(Rkeys(org.Hs.egSYMBOL)[1:10], >>>>>>> Rkeys(org.Hs.egALIAS2EG)[31:45]) >>>>>>> > symb >>>>>>> [1] "A1BG" "A2M" "A2MP1" "NAT1" "NAT2" "AACP" >>>>>>> [7] "SERPINA3" "AADAC" "AAMP" "AANAT" "AAMP" "AANAT" >>>>>>> [13] "DSPS" "SNAT" "AARS" "CMT2N" "AAV" "AAVS1" >>>>>>> [19] "ABAT" "GABA-AT" "GABAT" "NPD009" "ABC-1" "ABC1" >>>>>>> [25] "ABCA1" >>>>>>> > select(org.Hs.eg.db, symb, "ENTREZID","SYMBOL") >>>>>>> SYMBOL ENTREZID >>>>>>> 1 A1BG 1 >>>>>>> 2 A2M 2 >>>>>>> 3 A2MP1 3 >>>>>>> 4 NAT1 9 >>>>>>> 5 NAT2 10 >>>>>>> 6 AACP 11 >>>>>>> 7 SERPINA3 12 >>>>>>> 8 AADAC 13 >>>>>>> 9 AAMP 14 >>>>>>> 10 AANAT 15 >>>>>>> 11 AAMP 14 >>>>>>> 12 AANAT 15 >>>>>>> 13 DSPS<na> >>>>>>> 14 SNAT<na> >>>>>>> 15 AARS 16 >>>>>>> 16 CMT2N<na> >>>>>>> 17 AAV<na> >>>>>>> 18 AAVS1 17 >>>>>>> 19 ABAT 18 >>>>>>> 20 GABA-AT<na> >>>>>>> 21 GABAT<na> >>>>>>> 22 NPD009<na> >>>>>>> 23 ABC-1<na> >>>>>>> 24 ABC1<na> >>>>>>> 25 ABCA1 19 >>>>>>> > select(org.Hs.eg.db, symb, "ENTREZID","ALIAS") >>>>>>> ALIAS ENTREZID >>>>>>> 1 A1BG 1 >>>>>>> 2 A2M 2 >>>>>>> 3 A2MP1 3 >>>>>>> 4 NAT1 9 >>>>>>> 5 NAT1 1982 >>>>>>> 6 NAT1 6530 >>>>>>> 7 NAT1 10991 >>>>>>> 8 NAT2 10 >>>>>>> 9 NAT2 81539 >>>>>>> 10 AACP 11 >>>>>>> 11 SERPINA3 12 >>>>>>> 12 AADAC 13 >>>>>>> 13 AAMP 14 >>>>>>> 14 AANAT 15 >>>>>>> 15 DSPS 15 >>>>>>> 16 SNAT 15 >>>>>>> 17 AARS 16 >>>>>>> 18 CMT2N 16 >>>>>>> 19 AAV 17 >>>>>>> 20 AAVS1 17 >>>>>>> 21 ABAT 18 >>>>>>> 22 GABA-AT 18 >>>>>>> 23 GABAT 18 >>>>>>> 24 NPD009 18 >>>>>>> 25 ABC-1 19 >>>>>>> 26 ABC1 19 >>>>>>> 27 ABC1 63897 >>>>>>> 28 ABCA1 19 >>>>>>> Warning message: >>>>>>> In .generateExtraRows(tab, keys, jointype) : >>>>>>> 'select' and duplicate query keys resulted in 1:many mapping >>>>>>> between >>>>>>> keys and return rows >>>>>>> > mget(c("1982","6530","10991"), org.Hs.egGENENAME) >>>>>>> $`1982` >>>>>>> [1] "eukaryotic translation initiation factor 4 gamma, 2" >>>>>>> >>>>>>> $`6530` >>>>>>> [1] "solute carrier family 6 (neurotransmitter transporter, >>>>>>> noradrenalin), member 2" >>>>>>> >>>>>>> $`10991` >>>>>>> [1] "solute carrier family 38, member 3" >>>>>>> >>>>>>> Best, >>>>>>> >>>>>>> Jim >>>>>>> >>>>>>>> On 25 July 2013 18:17, James W. MacDonald<jmacdon at="" uw.edu=""> wrote: >>>>>>>>> >>>>>>>>> Hi Enrico, >>>>>>>>> >>>>>>>>> >>>>>>>>> On 7/25/2013 12:56 PM, Enrico Ferrero wrote: >>>>>>>>>> >>>>>>>>>> Dear James, >>>>>>>>>> >>>>>>>>>> Thanks very much for your prompt reply. >>>>>>>>>> I knew the problem was the for loop and the select function is >>>>>>>>>> indeed >>>>>>>>>> a lot faster than that and works perfectly with toy data. >>>>>>>>>> >>>>>>>>>> However, this is what happens when I try to use it with real >>>>>>>>>> data: >>>>>>>>>> >>>>>>>>>>> test<- select(org.Hs.eg.db, keys=df$GeneSymbol, >>>>>>>>>>> keytype="ALIAS", >>>>>>>>>>> cols=c("SYMBOL","ENTREZID","ENSEMBL")) >>>>>>>>>> >>>>>>>>>> Warning message: >>>>>>>>>> In .generateExtraRows(tab, keys, jointype) : >>>>>>>>>> 'select' and duplicate query keys resulted in 1:many >>>>>>>>>> mapping >>>>>>>>>> between >>>>>>>>>> keys and return rows >>>>>>>>>> >>>>>>>>>> which is probably the warning you mentioned. >>>>>>>>> >>>>>>>>> >>>>>>>>> That's not the warning I mentioned, but it does point out the >>>>>>>>> same >>>>>>>>> issue, >>>>>>>>> which is that there is a one to many mapping between symbol and >>>>>>>>> entrez gene >>>>>>>>> ID. >>>>>>>>> >>>>>>>>> So now you have to decide if you want to be naive (or stupid, >>>>>>>>> depending on >>>>>>>>> your perspective) or not. You could just cover your eyes and >>>>>>>>> do this: >>>>>>>>> >>>>>>>>> first.two<- first.two[!duplicated(first.two$SYMBOL),] >>>>>>>>> >>>>>>>>> which will choose for you the first symbol -> gene ID mapping and >>>>>>>>> nuke the >>>>>>>>> rest. That's nice and quick, but you are making huge assumptions. >>>>>>>>> >>>>>>>>> Or you could decide to be a bit more sophisticated and do >>>>>>>>> something like >>>>>>>>> >>>>>>>>> thelst<- tapply(1:nrow(first.two), first.two$SYMBOL, function(x) >>>>>>>>> first.two[x,]) >>>>>>>>> >>>>>>>>> At this point you can take a look at e.g., thelst[1:10] to see >>>>>>>>> what >>>>>>>>> we just >>>>>>>>> did >>>>>>>>> >>>>>>>>> thelst<- do.call("rbind", lapply(thelst, function(x) c(x[1,1], >>>>>>>>> paste(x[,2], >>>>>>>>> collapse = "|"))) >>>>>>>>> >>>>>>>>> and here you can look at head(thelst). >>>>>>>>> >>>>>>>>> Then you can check to ensure that the first column of thelst is >>>>>>>>> identical to >>>>>>>>> the first column of df, and proceed as before. >>>>>>>>> >>>>>>>>> But there is still the problem of the multiple mappings. As an >>>>>>>>> example: >>>>>>>>> >>>>>>>>>> thelst[1:5] >>>>>>>>> >>>>>>>>> $HBD >>>>>>>>> SYMBOL ENTREZID >>>>>>>>> 2535 HBD 3045 >>>>>>>>> 2536 HBD 100187828 >>>>>>>>> >>>>>>>>> $KIR3DL3 >>>>>>>>> SYMBOL ENTREZID >>>>>>>>> 17513 KIR3DL3 115653 >>>>>>>>> 17514 KIR3DL3 100133046 >>>>>>>>> >>>>>>>>>> mget(as.character(thelst[[1]][,2]), org.Hs.egGENENAME) >>>>>>>>> >>>>>>>>> $`3045` >>>>>>>>> [1] "hemoglobin, delta" >>>>>>>>> >>>>>>>>> $`100187828` >>>>>>>>> [1] "hypophosphatemic bone disease" >>>>>>>>> >>>>>>>>>> mget(as.character(thelst[[2]][,2]), org.Hs.egGENENAME) >>>>>>>>> >>>>>>>>> $`115653` >>>>>>>>> [1] "killer cell immunoglobulin-like receptor, three domains, >>>>>>>>> long >>>>>>>>> cytoplasmic tail, 3" >>>>>>>>> >>>>>>>>> $`100133046` >>>>>>>>> [1] "killer cell immunoglobulin-like receptor three domains long >>>>>>>>> cytoplasmic >>>>>>>>> tail 3" >>>>>>>>> >>>>>>>>> >>>>>>>>> So HBD is the gene symbol for two different genes! If this gene >>>>>>>>> symbol is in >>>>>>>>> your data, you will now have attributed your data to two genes >>>>>>>>> that >>>>>>>>> apparently are not remotely similar. if KIR3DL3 is in your data, >>>>>>>>> then it >>>>>>>>> worked out OK for that gene. >>>>>>>>> >>>>>>>>> Best, >>>>>>>>> >>>>>>>>> Jim >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>>> The real problem is that the number of rows is now different for >>>>>>>>>> the 2 >>>>>>>>>> objects: >>>>>>>>>>> >>>>>>>>>>> nrow(df); nrow(test) >>>>>>>>>> >>>>>>>>>> [1] 573 >>>>>>>>>> [1] 201 >>>>>>>>>> >>>>>>>>>> So I obviously can't put the new data into the original df. My >>>>>>>>>> impression is that when the 1 to many mapping arises, the select >>>>>>>>>> functions exits, with that warning message. As a result, my test >>>>>>>>>> object is incomplete. >>>>>>>>>> >>>>>>>>>> On top of that, and I can't really explain this, the row >>>>>>>>>> positions are >>>>>>>>>> messed up, e.g. >>>>>>>>>> >>>>>>>>>>> all.equal(df[100,],test[100,]) >>>>>>>>>> >>>>>>>>>> returns FALSE. >>>>>>>>>> >>>>>>>>>> How can I work around this? >>>>>>>>>> >>>>>>>>>> Thanks a lot! >>>>>>>>>> >>>>>>>>>> Best, >>>>>>>>>> >>>>>>>>>> On 25 July 2013 16:58, James W. MacDonald<jmacdon at="" uw.edu=""> >>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>> Hi Enrico, >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On 7/25/2013 11:35 AM, Enrico Ferrero wrote: >>>>>>>>>>>> >>>>>>>>>>>> Hello, >>>>>>>>>>>> >>>>>>>>>>>> I often have data frames where I need to perform ID >>>>>>>>>>>> conversions on >>>>>>>>>>>> one >>>>>>>>>>>> or >>>>>>>>>>>> more of the columns while preserving the order of the rows, >>>>>>>>>>>> e.g.: >>>>>>>>>>>> >>>>>>>>>>>> GeneSymbol Value1 Value2 >>>>>>>>>>>> GS1 2.5 0.1 >>>>>>>>>>>> GS2 3 0.2 >>>>>>>>>>>> .. >>>>>>>>>>>> >>>>>>>>>>>> And I want to obtain: >>>>>>>>>>>> >>>>>>>>>>>> GeneSymbol EntrezGeneID Value1 Value2 >>>>>>>>>>>> GS1 EG1 2.5 0.1 >>>>>>>>>>>> GS2 EG2 3 0.2 >>>>>>>>>>>> .. >>>>>>>>>>>> >>>>>>>>>>>> What I've done so far was to create a function that uses >>>>>>>>>>>> org.Hs.eg.db to >>>>>>>>>>>> loop over the rows of the column and does the conversion: >>>>>>>>>>>> >>>>>>>>>>>> library(org.Hs.eg.db) >>>>>>>>>>>> alias2EG<- function(x) { >>>>>>>>>>>> for (i in 1:length(x)) { >>>>>>>>>>>> if (!is.na(x[i])) { >>>>>>>>>>>> repl<- org.Hs.egALIAS2EG[[x[i]]][1] >>>>>>>>>>>> if (!is.null(repl)) { >>>>>>>>>>>> x[i]<- repl >>>>>>>>>>>> } >>>>>>>>>>>> else { >>>>>>>>>>>> x[i]<- NA >>>>>>>>>>>> } >>>>>>>>>>>> } >>>>>>>>>>>> } >>>>>>>>>>>> return(x) >>>>>>>>>>>> } >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> I should first note that gene symbols are not unique, so you >>>>>>>>>>> are >>>>>>>>>>> taking a >>>>>>>>>>> chance on your mappings. Is there no other annotation for your >>>>>>>>>>> data? >>>>>>>>>>> >>>>>>>>>>> In addition, you should note that it is almost always better to >>>>>>>>>>> think of >>>>>>>>>>> objects as vectors and matrices in R, rather than as things >>>>>>>>>>> that >>>>>>>>>>> need to >>>>>>>>>>> be >>>>>>>>>>> looped over (e.g., R isn't Perl or C). >>>>>>>>>>> >>>>>>>>>>> first.two<- select(org.Hs.eg.db, as.character(df$GeneSymbol), >>>>>>>>>>> "ENTREZID", >>>>>>>>>>> "SYMBOL") >>>>>>>>>>> >>>>>>>>>>> Note that there used to be a warning or an error (don't >>>>>>>>>>> remember >>>>>>>>>>> which) >>>>>>>>>>> when >>>>>>>>>>> you did something like this, stating that gene symbols are not >>>>>>>>>>> unique, >>>>>>>>>>> and >>>>>>>>>>> that you shouldn't do this sort of thing. Apparently this >>>>>>>>>>> warning has >>>>>>>>>>> been >>>>>>>>>>> removed, but the issue remains valid. >>>>>>>>>>> >>>>>>>>>>> ## check yourself >>>>>>>>>>> >>>>>>>>>>> all.equal(df$GeneSymbol, first.two$SYMBOL) >>>>>>>>>>> >>>>>>>>>>> ## if true, proceed >>>>>>>>>>> >>>>>>>>>>> df<- data.frame(first.two, df[,-1]) >>>>>>>>>>> >>>>>>>>>>> Best, >>>>>>>>>>> >>>>>>>>>>> Jim >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>> and then call the function like this: >>>>>>>>>>>> >>>>>>>>>>>> df$EntrezGeneID<- alias2GS(df$GeneSymbol) >>>>>>>>>>>> >>>>>>>>>>>> This works well, but gets very slow when I need to do multiple >>>>>>>>>>>> conversions >>>>>>>>>>>> on large datasets. >>>>>>>>>>>> >>>>>>>>>>>> Is there any way I can achieve the same result but in a >>>>>>>>>>>> quicker, more >>>>>>>>>>>> efficient way? >>>>>>>>>>>> >>>>>>>>>>>> Thank you. >>>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> James W. MacDonald, M.S. >>>>>>>>>>> Biostatistician >>>>>>>>>>> University of Washington >>>>>>>>>>> Environmental and Occupational Health Sciences >>>>>>>>>>> 4225 Roosevelt Way NE, # 100 >>>>>>>>>>> Seattle WA 98105-6099 >>>>>>>>>>> >>>>>>>>> -- >>>>>>>>> James W. MacDonald, M.S. >>>>>>>>> Biostatistician >>>>>>>>> University of Washington >>>>>>>>> Environmental and Occupational Health Sciences >>>>>>>>> 4225 Roosevelt Way NE, # 100 >>>>>>>>> Seattle WA 98105-6099 >>>>>>>>> >>>>>>>> >>>>>> -- >>>>>> Hervé Pagès >>>>>> >>>>>> Program in Computational Biology >>>>>> Division of Public Health Sciences >>>>>> Fred Hutchinson Cancer Research Center >>>>>> 1100 Fairview Ave. N, M1-B514 >>>>>> P.O. Box 19024 >>>>>> Seattle, WA 98109-1024 >>>>>> >>>>>> E-mail: hpages at fhcrc.org >>>>>> Phone: (206) 667-5791 >>>>>> Fax: (206) 667-1319 >>>>> >>>>> >>>>> >>>> >>> >>> -- >>> Hervé Pagès >>> >>> Program in Computational Biology >>> Division of Public Health Sciences >>> Fred Hutchinson Cancer Research Center >>> 1100 Fairview Ave. N, M1-B514 >>> P.O. Box 19024 >>> Seattle, WA 98109-1024 >>> >>> E-mail: hpages at fhcrc.org >>> Phone: (206) 667-5791 >>> Fax: (206) 667-1319 >> >> >> >
ADD REPLY
0
Entering edit mode
Hi Marc, On 07/26/2013 12:57 PM, Marc Carlson wrote: ... > Hello everyone, > > Sorry that I saw this thread so late. Basically, select() does *try* to > keep your initial keys and map them each to an equivalent number of > unique values. We did actually anticipate that people would *want* to > cbind() their results. > > But as you discovered there are many circumstances where the data make > this kind of behavior impossible. > > So passing in NAs as keys for example can't ever find anything > meaningful. Those will simply have to be removed before we can > proceed. And, it is also impossible to maintain a 1:1 mapping if you > retrieve fields that have many to one relationships with your initial > keys (also seen here). > > For convenience, when this kind of 1:1 output is already impossible (as > it is for most of your examples), select will also try to simplify the > output by removing rows that are identical all the way across etc.. > > My aim was that select should try to do the most reasonable thing > possible based on the data we have in each case. The rationale is that > in the case where there are 1:many mappings, you should not be planning > to bind those directly onto any other data.frames anyways (as this > circumstance would require you to call merge() instead). So in that > case, non-destructive simplification seems beneficial. Other tools in our infrastructure use an extra argument to pick-up 1 thing in case of multiple mapping e.g. findOverlaps() has the 'select' argument with possible values "all", "first", "last", and "arbitrary". Also nearest() and family have this argument and it accepts similar values. Couldn't select() use a similar approach? The default should be "all" so the current behavior is preserved but if it's something else then the returned data.frame should align with the input. Thanks, H. > > I hope this clarifies things, > > > Marc > > > >> >>> As I >>> mentioned in my first post, the for loop function works, but it's >>> highly inefficient. >>> >>> Any help is greatly appreciated, thank you. >>> >>> Best, >>> >>> >>> >>> On 25 July 2013 23:18, Hervé Pagès <hpages at="" fhcrc.org=""> wrote: >>>> Hi James, >>>> >>>> You're right. >>>> >>>> It's actually both: NAs *and* duplicated keys that are mapped to >>>> more than 1 row are removed from the input. I don't think this >>>> is documented. >>>> >>>> I wonder if select() behavior couldn't be a little bit simpler by >>>> either preserving or removing all duplicated keys, and not just some >>>> of them (on a somewhat arbitrary criteria). >>>> >>>> Thanks, >>>> H. >>>> >>>> >>>> >>>> On 07/25/2013 02:57 PM, James W. MacDonald wrote: >>>>> >>>>> Hi Enrico and Herve, >>>>> >>>>> This has to do with duplicate entries, but only when the duplicate >>>>> entry >>>>> maps to many ENTREZID: >>>>> >>>>> > select(org.Hs.eg.db, rep("ADORA2A", 4), "ENTREZID", "ALIAS") >>>>> ALIAS ENTREZID >>>>> 1 ADORA2A 135 >>>>> 2 ADORA2A 135 >>>>> 3 ADORA2A 135 >>>>> 4 ADORA2A 135 >>>>> >>>>> > select(org.Hs.eg.db, rep("AGT", 4), "ENTREZID", "ALIAS") >>>>> ALIAS ENTREZID >>>>> 1 AGT 183 >>>>> 2 AGT 189 >>>>> Warning message: >>>>> In .generateExtraRows(tab, keys, jointype) : >>>>> 'select' and duplicate query keys resulted in 1:many mapping >>>>> between >>>>> keys and return rows >>>>> >>>>> > select(org.Hs.eg.db, "AGT", "ENTREZID", "ALIAS") >>>>> ALIAS ENTREZID >>>>> 1 AGT 183 >>>>> 2 AGT 189 >>>>> Warning message: >>>>> In .generateExtraRows(tab, keys, jointype) : >>>>> 'select' resulted in 1:many mapping between keys and return rows >>>>> >>>>> >>>>> So in the instances where a gene symbol maps to more than one >>>>> ENTREZID, >>>>> the output gets truncated, whereas if it is a one-to-one mapping, it >>>>> does not. >>>>> >>>>> Best, >>>>> >>>>> Jim >>>>> >>>>> >>>>> >>>>> >>>>> On 7/25/2013 5:06 PM, Enrico Ferrero wrote: >>>>>> >>>>>> Hi, >>>>>> >>>>>> Herv?, that's exactly what I'm trying to say. >>>>>> >>>>>> Attached to this email is a tab delimited file with two columns of >>>>>> GeneSymbols (or Aliases), and here is some simple code to reproduce >>>>>> the unexpected behaviour: >>>>>> >>>>>> library(org.Hs.eg.db) >>>>>> mydf<- read.table("testdata.txt", sep="\t", header=TRUE, as.is=TRUE) >>>>>> mytest<- select(org.Hs.eg.db, key=mydf$GeneSymbol1, keytype="ALIAS", >>>>>> cols=c("SYMBOL","ENTREZID","ENSEMBL")) >>>>>> # check that mytest has less rows than mydf >>>>>> nrow(mydf) >>>>>> nrow(mytest) >>>>>> # pick a random row: they don't match >>>>>> mydf[250,] >>>>>> mytest[250,] >>>>>> >>>>>> Ideally, mytest should have the same number and position of rows of >>>>>> mydf so that I can then cbind them. >>>>>> If mytest has more rows because of multiple mappings that's also >>>>>> fine: >>>>>> I can always use merge(mydf, mytest), right? >>>>>> >>>>>> Thanks a lot to both for your help, it's very appreciated. >>>>>> Best, >>>>>> >>>>>> >>>>>> On 25 July 2013 21:32, Hervé Pagès<hpages at="" fhcrc.org=""> wrote: >>>>>>> >>>>>>> Hi Enrico, >>>>>>> >>>>>>> >>>>>>> On 07/25/2013 01:20 PM, James W. MacDonald wrote: >>>>>>>> >>>>>>>> Hi Enrico, >>>>>>>> >>>>>>>> Please don't take things off-list (e.g., use reply-all). >>>>>>>> >>>>>>>> >>>>>>>> On 7/25/2013 2:17 PM, Enrico Ferrero wrote: >>>>>>>>> >>>>>>>>> Hi James, >>>>>>>>> >>>>>>>>> Thanks very much for your help. >>>>>>>>> There is an issue that needs to be solved before thinking about >>>>>>>>> what's >>>>>>>>> the best approach in my opinion. >>>>>>>>> >>>>>>>>> I don't understand why, but the object created with the call to >>>>>>>>> select >>>>>>>>> (test in my example, first.two in yours) has a different number of >>>>>>>>> rows from the original object (df in my example). Specifically >>>>>>>>> it has >>>>>>>>> *less* rows. >>>>>>> >>>>>>> >>>>>>> I'm surprised it has less rows. It can definitely have more, when >>>>>>> some >>>>>>> of the keys passed to select() are mapped to more than 1 row, but my >>>>>>> understanding was that select() would propagate unmapped keys to the >>>>>>> output by placing them in rows stuffed with NAs. So maybe I >>>>>>> misunderstood how select() works, or its behavior was changed, or >>>>>>> there is a bug somewhere. Could you please send the code that allows >>>>>>> us to reproduce this? Thanks. >>>>>>> >>>>>>> H. >>>>>>> >>>>>>> >>>>>>>> If all symbols were converted to all possible Entrez IDs, >>>>>>>>> >>>>>>>>> I would expect it to have more rows, not less. To me, it looks >>>>>>>>> like >>>>>>>>> not all rows are looked up and returned. >>>>>>>>> >>>>>>>>> Do you see what I mean? >>>>>>>> >>>>>>>> >>>>>>>> Sure. You could be using outdated gene symbols. Or perhaps you are >>>>>>>> using >>>>>>>> a mixture of symbols and aliases. Which is even cooler than just >>>>>>>> all >>>>>>>> symbols: >>>>>>>> >>>>>>>> > symb<- c(Rkeys(org.Hs.egSYMBOL)[1:10], >>>>>>>> Rkeys(org.Hs.egALIAS2EG)[31:45]) >>>>>>>> > symb >>>>>>>> [1] "A1BG" "A2M" "A2MP1" "NAT1" "NAT2" "AACP" >>>>>>>> [7] "SERPINA3" "AADAC" "AAMP" "AANAT" "AAMP" "AANAT" >>>>>>>> [13] "DSPS" "SNAT" "AARS" "CMT2N" "AAV" "AAVS1" >>>>>>>> [19] "ABAT" "GABA-AT" "GABAT" "NPD009" "ABC-1" "ABC1" >>>>>>>> [25] "ABCA1" >>>>>>>> > select(org.Hs.eg.db, symb, "ENTREZID","SYMBOL") >>>>>>>> SYMBOL ENTREZID >>>>>>>> 1 A1BG 1 >>>>>>>> 2 A2M 2 >>>>>>>> 3 A2MP1 3 >>>>>>>> 4 NAT1 9 >>>>>>>> 5 NAT2 10 >>>>>>>> 6 AACP 11 >>>>>>>> 7 SERPINA3 12 >>>>>>>> 8 AADAC 13 >>>>>>>> 9 AAMP 14 >>>>>>>> 10 AANAT 15 >>>>>>>> 11 AAMP 14 >>>>>>>> 12 AANAT 15 >>>>>>>> 13 DSPS<na> >>>>>>>> 14 SNAT<na> >>>>>>>> 15 AARS 16 >>>>>>>> 16 CMT2N<na> >>>>>>>> 17 AAV<na> >>>>>>>> 18 AAVS1 17 >>>>>>>> 19 ABAT 18 >>>>>>>> 20 GABA-AT<na> >>>>>>>> 21 GABAT<na> >>>>>>>> 22 NPD009<na> >>>>>>>> 23 ABC-1<na> >>>>>>>> 24 ABC1<na> >>>>>>>> 25 ABCA1 19 >>>>>>>> > select(org.Hs.eg.db, symb, "ENTREZID","ALIAS") >>>>>>>> ALIAS ENTREZID >>>>>>>> 1 A1BG 1 >>>>>>>> 2 A2M 2 >>>>>>>> 3 A2MP1 3 >>>>>>>> 4 NAT1 9 >>>>>>>> 5 NAT1 1982 >>>>>>>> 6 NAT1 6530 >>>>>>>> 7 NAT1 10991 >>>>>>>> 8 NAT2 10 >>>>>>>> 9 NAT2 81539 >>>>>>>> 10 AACP 11 >>>>>>>> 11 SERPINA3 12 >>>>>>>> 12 AADAC 13 >>>>>>>> 13 AAMP 14 >>>>>>>> 14 AANAT 15 >>>>>>>> 15 DSPS 15 >>>>>>>> 16 SNAT 15 >>>>>>>> 17 AARS 16 >>>>>>>> 18 CMT2N 16 >>>>>>>> 19 AAV 17 >>>>>>>> 20 AAVS1 17 >>>>>>>> 21 ABAT 18 >>>>>>>> 22 GABA-AT 18 >>>>>>>> 23 GABAT 18 >>>>>>>> 24 NPD009 18 >>>>>>>> 25 ABC-1 19 >>>>>>>> 26 ABC1 19 >>>>>>>> 27 ABC1 63897 >>>>>>>> 28 ABCA1 19 >>>>>>>> Warning message: >>>>>>>> In .generateExtraRows(tab, keys, jointype) : >>>>>>>> 'select' and duplicate query keys resulted in 1:many mapping >>>>>>>> between >>>>>>>> keys and return rows >>>>>>>> > mget(c("1982","6530","10991"), org.Hs.egGENENAME) >>>>>>>> $`1982` >>>>>>>> [1] "eukaryotic translation initiation factor 4 gamma, 2" >>>>>>>> >>>>>>>> $`6530` >>>>>>>> [1] "solute carrier family 6 (neurotransmitter transporter, >>>>>>>> noradrenalin), member 2" >>>>>>>> >>>>>>>> $`10991` >>>>>>>> [1] "solute carrier family 38, member 3" >>>>>>>> >>>>>>>> Best, >>>>>>>> >>>>>>>> Jim >>>>>>>> >>>>>>>>> On 25 July 2013 18:17, James W. MacDonald<jmacdon at="" uw.edu=""> wrote: >>>>>>>>>> >>>>>>>>>> Hi Enrico, >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On 7/25/2013 12:56 PM, Enrico Ferrero wrote: >>>>>>>>>>> >>>>>>>>>>> Dear James, >>>>>>>>>>> >>>>>>>>>>> Thanks very much for your prompt reply. >>>>>>>>>>> I knew the problem was the for loop and the select function is >>>>>>>>>>> indeed >>>>>>>>>>> a lot faster than that and works perfectly with toy data. >>>>>>>>>>> >>>>>>>>>>> However, this is what happens when I try to use it with real >>>>>>>>>>> data: >>>>>>>>>>> >>>>>>>>>>>> test<- select(org.Hs.eg.db, keys=df$GeneSymbol, >>>>>>>>>>>> keytype="ALIAS", >>>>>>>>>>>> cols=c("SYMBOL","ENTREZID","ENSEMBL")) >>>>>>>>>>> >>>>>>>>>>> Warning message: >>>>>>>>>>> In .generateExtraRows(tab, keys, jointype) : >>>>>>>>>>> 'select' and duplicate query keys resulted in 1:many >>>>>>>>>>> mapping >>>>>>>>>>> between >>>>>>>>>>> keys and return rows >>>>>>>>>>> >>>>>>>>>>> which is probably the warning you mentioned. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> That's not the warning I mentioned, but it does point out the >>>>>>>>>> same >>>>>>>>>> issue, >>>>>>>>>> which is that there is a one to many mapping between symbol and >>>>>>>>>> entrez gene >>>>>>>>>> ID. >>>>>>>>>> >>>>>>>>>> So now you have to decide if you want to be naive (or stupid, >>>>>>>>>> depending on >>>>>>>>>> your perspective) or not. You could just cover your eyes and >>>>>>>>>> do this: >>>>>>>>>> >>>>>>>>>> first.two<- first.two[!duplicated(first.two$SYMBOL),] >>>>>>>>>> >>>>>>>>>> which will choose for you the first symbol -> gene ID mapping and >>>>>>>>>> nuke the >>>>>>>>>> rest. That's nice and quick, but you are making huge assumptions. >>>>>>>>>> >>>>>>>>>> Or you could decide to be a bit more sophisticated and do >>>>>>>>>> something like >>>>>>>>>> >>>>>>>>>> thelst<- tapply(1:nrow(first.two), first.two$SYMBOL, function(x) >>>>>>>>>> first.two[x,]) >>>>>>>>>> >>>>>>>>>> At this point you can take a look at e.g., thelst[1:10] to see >>>>>>>>>> what >>>>>>>>>> we just >>>>>>>>>> did >>>>>>>>>> >>>>>>>>>> thelst<- do.call("rbind", lapply(thelst, function(x) c(x[1,1], >>>>>>>>>> paste(x[,2], >>>>>>>>>> collapse = "|"))) >>>>>>>>>> >>>>>>>>>> and here you can look at head(thelst). >>>>>>>>>> >>>>>>>>>> Then you can check to ensure that the first column of thelst is >>>>>>>>>> identical to >>>>>>>>>> the first column of df, and proceed as before. >>>>>>>>>> >>>>>>>>>> But there is still the problem of the multiple mappings. As an >>>>>>>>>> example: >>>>>>>>>> >>>>>>>>>>> thelst[1:5] >>>>>>>>>> >>>>>>>>>> $HBD >>>>>>>>>> SYMBOL ENTREZID >>>>>>>>>> 2535 HBD 3045 >>>>>>>>>> 2536 HBD 100187828 >>>>>>>>>> >>>>>>>>>> $KIR3DL3 >>>>>>>>>> SYMBOL ENTREZID >>>>>>>>>> 17513 KIR3DL3 115653 >>>>>>>>>> 17514 KIR3DL3 100133046 >>>>>>>>>> >>>>>>>>>>> mget(as.character(thelst[[1]][,2]), org.Hs.egGENENAME) >>>>>>>>>> >>>>>>>>>> $`3045` >>>>>>>>>> [1] "hemoglobin, delta" >>>>>>>>>> >>>>>>>>>> $`100187828` >>>>>>>>>> [1] "hypophosphatemic bone disease" >>>>>>>>>> >>>>>>>>>>> mget(as.character(thelst[[2]][,2]), org.Hs.egGENENAME) >>>>>>>>>> >>>>>>>>>> $`115653` >>>>>>>>>> [1] "killer cell immunoglobulin-like receptor, three domains, >>>>>>>>>> long >>>>>>>>>> cytoplasmic tail, 3" >>>>>>>>>> >>>>>>>>>> $`100133046` >>>>>>>>>> [1] "killer cell immunoglobulin-like receptor three domains long >>>>>>>>>> cytoplasmic >>>>>>>>>> tail 3" >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> So HBD is the gene symbol for two different genes! If this gene >>>>>>>>>> symbol is in >>>>>>>>>> your data, you will now have attributed your data to two genes >>>>>>>>>> that >>>>>>>>>> apparently are not remotely similar. if KIR3DL3 is in your data, >>>>>>>>>> then it >>>>>>>>>> worked out OK for that gene. >>>>>>>>>> >>>>>>>>>> Best, >>>>>>>>>> >>>>>>>>>> Jim >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>> The real problem is that the number of rows is now different for >>>>>>>>>>> the 2 >>>>>>>>>>> objects: >>>>>>>>>>>> >>>>>>>>>>>> nrow(df); nrow(test) >>>>>>>>>>> >>>>>>>>>>> [1] 573 >>>>>>>>>>> [1] 201 >>>>>>>>>>> >>>>>>>>>>> So I obviously can't put the new data into the original df. My >>>>>>>>>>> impression is that when the 1 to many mapping arises, the select >>>>>>>>>>> functions exits, with that warning message. As a result, my test >>>>>>>>>>> object is incomplete. >>>>>>>>>>> >>>>>>>>>>> On top of that, and I can't really explain this, the row >>>>>>>>>>> positions are >>>>>>>>>>> messed up, e.g. >>>>>>>>>>> >>>>>>>>>>>> all.equal(df[100,],test[100,]) >>>>>>>>>>> >>>>>>>>>>> returns FALSE. >>>>>>>>>>> >>>>>>>>>>> How can I work around this? >>>>>>>>>>> >>>>>>>>>>> Thanks a lot! >>>>>>>>>>> >>>>>>>>>>> Best, >>>>>>>>>>> >>>>>>>>>>> On 25 July 2013 16:58, James W. MacDonald<jmacdon at="" uw.edu=""> wrote: >>>>>>>>>>>> >>>>>>>>>>>> Hi Enrico, >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On 7/25/2013 11:35 AM, Enrico Ferrero wrote: >>>>>>>>>>>>> >>>>>>>>>>>>> Hello, >>>>>>>>>>>>> >>>>>>>>>>>>> I often have data frames where I need to perform ID >>>>>>>>>>>>> conversions on >>>>>>>>>>>>> one >>>>>>>>>>>>> or >>>>>>>>>>>>> more of the columns while preserving the order of the rows, >>>>>>>>>>>>> e.g.: >>>>>>>>>>>>> >>>>>>>>>>>>> GeneSymbol Value1 Value2 >>>>>>>>>>>>> GS1 2.5 0.1 >>>>>>>>>>>>> GS2 3 0.2 >>>>>>>>>>>>> .. >>>>>>>>>>>>> >>>>>>>>>>>>> And I want to obtain: >>>>>>>>>>>>> >>>>>>>>>>>>> GeneSymbol EntrezGeneID Value1 Value2 >>>>>>>>>>>>> GS1 EG1 2.5 0.1 >>>>>>>>>>>>> GS2 EG2 3 0.2 >>>>>>>>>>>>> .. >>>>>>>>>>>>> >>>>>>>>>>>>> What I've done so far was to create a function that uses >>>>>>>>>>>>> org.Hs.eg.db to >>>>>>>>>>>>> loop over the rows of the column and does the conversion: >>>>>>>>>>>>> >>>>>>>>>>>>> library(org.Hs.eg.db) >>>>>>>>>>>>> alias2EG<- function(x) { >>>>>>>>>>>>> for (i in 1:length(x)) { >>>>>>>>>>>>> if (!is.na(x[i])) { >>>>>>>>>>>>> repl<- org.Hs.egALIAS2EG[[x[i]]][1] >>>>>>>>>>>>> if (!is.null(repl)) { >>>>>>>>>>>>> x[i]<- repl >>>>>>>>>>>>> } >>>>>>>>>>>>> else { >>>>>>>>>>>>> x[i]<- NA >>>>>>>>>>>>> } >>>>>>>>>>>>> } >>>>>>>>>>>>> } >>>>>>>>>>>>> return(x) >>>>>>>>>>>>> } >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> I should first note that gene symbols are not unique, so you >>>>>>>>>>>> are >>>>>>>>>>>> taking a >>>>>>>>>>>> chance on your mappings. Is there no other annotation for your >>>>>>>>>>>> data? >>>>>>>>>>>> >>>>>>>>>>>> In addition, you should note that it is almost always better to >>>>>>>>>>>> think of >>>>>>>>>>>> objects as vectors and matrices in R, rather than as things >>>>>>>>>>>> that >>>>>>>>>>>> need to >>>>>>>>>>>> be >>>>>>>>>>>> looped over (e.g., R isn't Perl or C). >>>>>>>>>>>> >>>>>>>>>>>> first.two<- select(org.Hs.eg.db, as.character(df$GeneSymbol), >>>>>>>>>>>> "ENTREZID", >>>>>>>>>>>> "SYMBOL") >>>>>>>>>>>> >>>>>>>>>>>> Note that there used to be a warning or an error (don't >>>>>>>>>>>> remember >>>>>>>>>>>> which) >>>>>>>>>>>> when >>>>>>>>>>>> you did something like this, stating that gene symbols are not >>>>>>>>>>>> unique, >>>>>>>>>>>> and >>>>>>>>>>>> that you shouldn't do this sort of thing. Apparently this >>>>>>>>>>>> warning has >>>>>>>>>>>> been >>>>>>>>>>>> removed, but the issue remains valid. >>>>>>>>>>>> >>>>>>>>>>>> ## check yourself >>>>>>>>>>>> >>>>>>>>>>>> all.equal(df$GeneSymbol, first.two$SYMBOL) >>>>>>>>>>>> >>>>>>>>>>>> ## if true, proceed >>>>>>>>>>>> >>>>>>>>>>>> df<- data.frame(first.two, df[,-1]) >>>>>>>>>>>> >>>>>>>>>>>> Best, >>>>>>>>>>>> >>>>>>>>>>>> Jim >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>> and then call the function like this: >>>>>>>>>>>>> >>>>>>>>>>>>> df$EntrezGeneID<- alias2GS(df$GeneSymbol) >>>>>>>>>>>>> >>>>>>>>>>>>> This works well, but gets very slow when I need to do multiple >>>>>>>>>>>>> conversions >>>>>>>>>>>>> on large datasets. >>>>>>>>>>>>> >>>>>>>>>>>>> Is there any way I can achieve the same result but in a >>>>>>>>>>>>> quicker, more >>>>>>>>>>>>> efficient way? >>>>>>>>>>>>> >>>>>>>>>>>>> Thank you. >>>>>>>>>>>>> >>>>>>>>>>>> -- >>>>>>>>>>>> James W. MacDonald, M.S. >>>>>>>>>>>> Biostatistician >>>>>>>>>>>> University of Washington >>>>>>>>>>>> Environmental and Occupational Health Sciences >>>>>>>>>>>> 4225 Roosevelt Way NE, # 100 >>>>>>>>>>>> Seattle WA 98105-6099 >>>>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> James W. MacDonald, M.S. >>>>>>>>>> Biostatistician >>>>>>>>>> University of Washington >>>>>>>>>> Environmental and Occupational Health Sciences >>>>>>>>>> 4225 Roosevelt Way NE, # 100 >>>>>>>>>> Seattle WA 98105-6099 >>>>>>>>>> >>>>>>>>> >>>>>>> -- >>>>>>> Hervé Pagès >>>>>>> >>>>>>> Program in Computational Biology >>>>>>> Division of Public Health Sciences >>>>>>> Fred Hutchinson Cancer Research Center >>>>>>> 1100 Fairview Ave. N, M1-B514 >>>>>>> P.O. Box 19024 >>>>>>> Seattle, WA 98109-1024 >>>>>>> >>>>>>> E-mail: hpages at fhcrc.org >>>>>>> Phone: (206) 667-5791 >>>>>>> Fax: (206) 667-1319 >>>>>> >>>>>> >>>>>> >>>>> >>>> >>>> -- >>>> Hervé Pagès >>>> >>>> Program in Computational Biology >>>> Division of Public Health Sciences >>>> Fred Hutchinson Cancer Research Center >>>> 1100 Fairview Ave. N, M1-B514 >>>> P.O. Box 19024 >>>> Seattle, WA 98109-1024 >>>> >>>> E-mail: hpages at fhcrc.org >>>> Phone: (206) 667-5791 >>>> Fax: (206) 667-1319 >>> >>> >>> >> > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
ADD REPLY
0
Entering edit mode
Hi everybody, Marc, thanks for clarifying things. The behaviour of the select() function is absolutely sensible. Maybe it should be made explicit somewhere in the documentation that, when working with data frames, the user is expected to use the merge() function in conjunction with it. I also agree with Herv? that having options to tweak and customize the output would be an extremely positive thing and a step in the right direction. In addition to a "select" argument, one can also think of a "remove.na.rows" that evaluates to either TRUE or FALSE. But then again, using merge() after select() already deals with these issues quite well. What I think should be investigated more closely at the moment is the unexpected behaviour select() exhibits when one SYMBOL or ALIAS (and potentially other types of ID, I don't know) maps to more than one ENTREZID. As exemplified by James' code below, this causes the output to be truncated, and I highly doubt this is intentional: > select(org.Hs.eg.db, rep("ADORA2A", 4), "ENTREZID", "ALIAS") ALIAS ENTREZID 1 ADORA2A 135 2 ADORA2A 135 3 ADORA2A 135 4 ADORA2A 135 > select(org.Hs.eg.db, rep("AGT", 4), "ENTREZID", "ALIAS") ALIAS ENTREZID 1 AGT 183 2 AGT 189 Warning message: In .generateExtraRows(tab, keys, jointype) : 'select' and duplicate query keys resulted in 1:many mapping between keys and return rows It would be great to have your views on this. Best, On 26 July 2013 21:46, Hervé Pagès <hpages at="" fhcrc.org=""> wrote: > Hi Marc, > > On 07/26/2013 12:57 PM, Marc Carlson wrote: > ... > >> Hello everyone, >> >> Sorry that I saw this thread so late. Basically, select() does *try* to >> keep your initial keys and map them each to an equivalent number of >> unique values. We did actually anticipate that people would *want* to >> cbind() their results. >> >> But as you discovered there are many circumstances where the data make >> this kind of behavior impossible. >> >> So passing in NAs as keys for example can't ever find anything >> meaningful. Those will simply have to be removed before we can >> proceed. And, it is also impossible to maintain a 1:1 mapping if you >> retrieve fields that have many to one relationships with your initial >> keys (also seen here). >> >> For convenience, when this kind of 1:1 output is already impossible (as >> it is for most of your examples), select will also try to simplify the >> output by removing rows that are identical all the way across etc.. >> >> My aim was that select should try to do the most reasonable thing >> possible based on the data we have in each case. The rationale is that >> in the case where there are 1:many mappings, you should not be planning >> to bind those directly onto any other data.frames anyways (as this >> circumstance would require you to call merge() instead). So in that >> case, non-destructive simplification seems beneficial. > > > Other tools in our infrastructure use an extra argument to pick-up 1 > thing in case of multiple mapping e.g. findOverlaps() has the 'select' > argument with possible values "all", "first", "last", and "arbitrary". > Also nearest() and family have this argument and it accepts similar > values. > > Couldn't select() use a similar approach? The default should be "all" > so the current behavior is preserved but if it's something else then > the returned data.frame should align with the input. > > Thanks, > > H. > > >> >> I hope this clarifies things, >> >> >> Marc >> >> >> >>> >>>> As I >>>> mentioned in my first post, the for loop function works, but it's >>>> highly inefficient. >>>> >>>> Any help is greatly appreciated, thank you. >>>> >>>> Best, >>>> >>>> >>>> >>>> On 25 July 2013 23:18, Hervé Pagès <hpages at="" fhcrc.org=""> wrote: >>>>> >>>>> Hi James, >>>>> >>>>> You're right. >>>>> >>>>> It's actually both: NAs *and* duplicated keys that are mapped to >>>>> more than 1 row are removed from the input. I don't think this >>>>> is documented. >>>>> >>>>> I wonder if select() behavior couldn't be a little bit simpler by >>>>> either preserving or removing all duplicated keys, and not just some >>>>> of them (on a somewhat arbitrary criteria). >>>>> >>>>> Thanks, >>>>> H. >>>>> >>>>> >>>>> >>>>> On 07/25/2013 02:57 PM, James W. MacDonald wrote: >>>>>> >>>>>> >>>>>> Hi Enrico and Herve, >>>>>> >>>>>> This has to do with duplicate entries, but only when the duplicate >>>>>> entry >>>>>> maps to many ENTREZID: >>>>>> >>>>>> > select(org.Hs.eg.db, rep("ADORA2A", 4), "ENTREZID", "ALIAS") >>>>>> ALIAS ENTREZID >>>>>> 1 ADORA2A 135 >>>>>> 2 ADORA2A 135 >>>>>> 3 ADORA2A 135 >>>>>> 4 ADORA2A 135 >>>>>> >>>>>> > select(org.Hs.eg.db, rep("AGT", 4), "ENTREZID", "ALIAS") >>>>>> ALIAS ENTREZID >>>>>> 1 AGT 183 >>>>>> 2 AGT 189 >>>>>> Warning message: >>>>>> In .generateExtraRows(tab, keys, jointype) : >>>>>> 'select' and duplicate query keys resulted in 1:many mapping >>>>>> between >>>>>> keys and return rows >>>>>> >>>>>> > select(org.Hs.eg.db, "AGT", "ENTREZID", "ALIAS") >>>>>> ALIAS ENTREZID >>>>>> 1 AGT 183 >>>>>> 2 AGT 189 >>>>>> Warning message: >>>>>> In .generateExtraRows(tab, keys, jointype) : >>>>>> 'select' resulted in 1:many mapping between keys and return rows >>>>>> >>>>>> >>>>>> So in the instances where a gene symbol maps to more than one >>>>>> ENTREZID, >>>>>> the output gets truncated, whereas if it is a one-to-one mapping, it >>>>>> does not. >>>>>> >>>>>> Best, >>>>>> >>>>>> Jim >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On 7/25/2013 5:06 PM, Enrico Ferrero wrote: >>>>>>> >>>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> Herv?, that's exactly what I'm trying to say. >>>>>>> >>>>>>> Attached to this email is a tab delimited file with two columns of >>>>>>> GeneSymbols (or Aliases), and here is some simple code to reproduce >>>>>>> the unexpected behaviour: >>>>>>> >>>>>>> library(org.Hs.eg.db) >>>>>>> mydf<- read.table("testdata.txt", sep="\t", header=TRUE, as.is=TRUE) >>>>>>> mytest<- select(org.Hs.eg.db, key=mydf$GeneSymbol1, keytype="ALIAS", >>>>>>> cols=c("SYMBOL","ENTREZID","ENSEMBL")) >>>>>>> # check that mytest has less rows than mydf >>>>>>> nrow(mydf) >>>>>>> nrow(mytest) >>>>>>> # pick a random row: they don't match >>>>>>> mydf[250,] >>>>>>> mytest[250,] >>>>>>> >>>>>>> Ideally, mytest should have the same number and position of rows of >>>>>>> mydf so that I can then cbind them. >>>>>>> If mytest has more rows because of multiple mappings that's also >>>>>>> fine: >>>>>>> I can always use merge(mydf, mytest), right? >>>>>>> >>>>>>> Thanks a lot to both for your help, it's very appreciated. >>>>>>> Best, >>>>>>> >>>>>>> >>>>>>> On 25 July 2013 21:32, Hervé Pagès<hpages at="" fhcrc.org=""> wrote: >>>>>>>> >>>>>>>> >>>>>>>> Hi Enrico, >>>>>>>> >>>>>>>> >>>>>>>> On 07/25/2013 01:20 PM, James W. MacDonald wrote: >>>>>>>>> >>>>>>>>> >>>>>>>>> Hi Enrico, >>>>>>>>> >>>>>>>>> Please don't take things off-list (e.g., use reply-all). >>>>>>>>> >>>>>>>>> >>>>>>>>> On 7/25/2013 2:17 PM, Enrico Ferrero wrote: >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Hi James, >>>>>>>>>> >>>>>>>>>> Thanks very much for your help. >>>>>>>>>> There is an issue that needs to be solved before thinking about >>>>>>>>>> what's >>>>>>>>>> the best approach in my opinion. >>>>>>>>>> >>>>>>>>>> I don't understand why, but the object created with the call to >>>>>>>>>> select >>>>>>>>>> (test in my example, first.two in yours) has a different number of >>>>>>>>>> rows from the original object (df in my example). Specifically >>>>>>>>>> it has >>>>>>>>>> *less* rows. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> I'm surprised it has less rows. It can definitely have more, when >>>>>>>> some >>>>>>>> of the keys passed to select() are mapped to more than 1 row, but my >>>>>>>> understanding was that select() would propagate unmapped keys to the >>>>>>>> output by placing them in rows stuffed with NAs. So maybe I >>>>>>>> misunderstood how select() works, or its behavior was changed, or >>>>>>>> there is a bug somewhere. Could you please send the code that allows >>>>>>>> us to reproduce this? Thanks. >>>>>>>> >>>>>>>> H. >>>>>>>> >>>>>>>> >>>>>>>>> If all symbols were converted to all possible Entrez IDs, >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> I would expect it to have more rows, not less. To me, it looks >>>>>>>>>> like >>>>>>>>>> not all rows are looked up and returned. >>>>>>>>>> >>>>>>>>>> Do you see what I mean? >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> Sure. You could be using outdated gene symbols. Or perhaps you are >>>>>>>>> using >>>>>>>>> a mixture of symbols and aliases. Which is even cooler than just >>>>>>>>> all >>>>>>>>> symbols: >>>>>>>>> >>>>>>>>> > symb<- c(Rkeys(org.Hs.egSYMBOL)[1:10], >>>>>>>>> Rkeys(org.Hs.egALIAS2EG)[31:45]) >>>>>>>>> > symb >>>>>>>>> [1] "A1BG" "A2M" "A2MP1" "NAT1" "NAT2" "AACP" >>>>>>>>> [7] "SERPINA3" "AADAC" "AAMP" "AANAT" "AAMP" "AANAT" >>>>>>>>> [13] "DSPS" "SNAT" "AARS" "CMT2N" "AAV" "AAVS1" >>>>>>>>> [19] "ABAT" "GABA-AT" "GABAT" "NPD009" "ABC-1" "ABC1" >>>>>>>>> [25] "ABCA1" >>>>>>>>> > select(org.Hs.eg.db, symb, "ENTREZID","SYMBOL") >>>>>>>>> SYMBOL ENTREZID >>>>>>>>> 1 A1BG 1 >>>>>>>>> 2 A2M 2 >>>>>>>>> 3 A2MP1 3 >>>>>>>>> 4 NAT1 9 >>>>>>>>> 5 NAT2 10 >>>>>>>>> 6 AACP 11 >>>>>>>>> 7 SERPINA3 12 >>>>>>>>> 8 AADAC 13 >>>>>>>>> 9 AAMP 14 >>>>>>>>> 10 AANAT 15 >>>>>>>>> 11 AAMP 14 >>>>>>>>> 12 AANAT 15 >>>>>>>>> 13 DSPS<na> >>>>>>>>> 14 SNAT<na> >>>>>>>>> 15 AARS 16 >>>>>>>>> 16 CMT2N<na> >>>>>>>>> 17 AAV<na> >>>>>>>>> 18 AAVS1 17 >>>>>>>>> 19 ABAT 18 >>>>>>>>> 20 GABA-AT<na> >>>>>>>>> 21 GABAT<na> >>>>>>>>> 22 NPD009<na> >>>>>>>>> 23 ABC-1<na> >>>>>>>>> 24 ABC1<na> >>>>>>>>> 25 ABCA1 19 >>>>>>>>> > select(org.Hs.eg.db, symb, "ENTREZID","ALIAS") >>>>>>>>> ALIAS ENTREZID >>>>>>>>> 1 A1BG 1 >>>>>>>>> 2 A2M 2 >>>>>>>>> 3 A2MP1 3 >>>>>>>>> 4 NAT1 9 >>>>>>>>> 5 NAT1 1982 >>>>>>>>> 6 NAT1 6530 >>>>>>>>> 7 NAT1 10991 >>>>>>>>> 8 NAT2 10 >>>>>>>>> 9 NAT2 81539 >>>>>>>>> 10 AACP 11 >>>>>>>>> 11 SERPINA3 12 >>>>>>>>> 12 AADAC 13 >>>>>>>>> 13 AAMP 14 >>>>>>>>> 14 AANAT 15 >>>>>>>>> 15 DSPS 15 >>>>>>>>> 16 SNAT 15 >>>>>>>>> 17 AARS 16 >>>>>>>>> 18 CMT2N 16 >>>>>>>>> 19 AAV 17 >>>>>>>>> 20 AAVS1 17 >>>>>>>>> 21 ABAT 18 >>>>>>>>> 22 GABA-AT 18 >>>>>>>>> 23 GABAT 18 >>>>>>>>> 24 NPD009 18 >>>>>>>>> 25 ABC-1 19 >>>>>>>>> 26 ABC1 19 >>>>>>>>> 27 ABC1 63897 >>>>>>>>> 28 ABCA1 19 >>>>>>>>> Warning message: >>>>>>>>> In .generateExtraRows(tab, keys, jointype) : >>>>>>>>> 'select' and duplicate query keys resulted in 1:many mapping >>>>>>>>> between >>>>>>>>> keys and return rows >>>>>>>>> > mget(c("1982","6530","10991"), org.Hs.egGENENAME) >>>>>>>>> $`1982` >>>>>>>>> [1] "eukaryotic translation initiation factor 4 gamma, 2" >>>>>>>>> >>>>>>>>> $`6530` >>>>>>>>> [1] "solute carrier family 6 (neurotransmitter transporter, >>>>>>>>> noradrenalin), member 2" >>>>>>>>> >>>>>>>>> $`10991` >>>>>>>>> [1] "solute carrier family 38, member 3" >>>>>>>>> >>>>>>>>> Best, >>>>>>>>> >>>>>>>>> Jim >>>>>>>>> >>>>>>>>>> On 25 July 2013 18:17, James W. MacDonald<jmacdon at="" uw.edu=""> wrote: >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Hi Enrico, >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On 7/25/2013 12:56 PM, Enrico Ferrero wrote: >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Dear James, >>>>>>>>>>>> >>>>>>>>>>>> Thanks very much for your prompt reply. >>>>>>>>>>>> I knew the problem was the for loop and the select function is >>>>>>>>>>>> indeed >>>>>>>>>>>> a lot faster than that and works perfectly with toy data. >>>>>>>>>>>> >>>>>>>>>>>> However, this is what happens when I try to use it with real >>>>>>>>>>>> data: >>>>>>>>>>>> >>>>>>>>>>>>> test<- select(org.Hs.eg.db, keys=df$GeneSymbol, >>>>>>>>>>>>> keytype="ALIAS", >>>>>>>>>>>>> cols=c("SYMBOL","ENTREZID","ENSEMBL")) >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Warning message: >>>>>>>>>>>> In .generateExtraRows(tab, keys, jointype) : >>>>>>>>>>>> 'select' and duplicate query keys resulted in 1:many >>>>>>>>>>>> mapping >>>>>>>>>>>> between >>>>>>>>>>>> keys and return rows >>>>>>>>>>>> >>>>>>>>>>>> which is probably the warning you mentioned. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> That's not the warning I mentioned, but it does point out the >>>>>>>>>>> same >>>>>>>>>>> issue, >>>>>>>>>>> which is that there is a one to many mapping between symbol and >>>>>>>>>>> entrez gene >>>>>>>>>>> ID. >>>>>>>>>>> >>>>>>>>>>> So now you have to decide if you want to be naive (or stupid, >>>>>>>>>>> depending on >>>>>>>>>>> your perspective) or not. You could just cover your eyes and >>>>>>>>>>> do this: >>>>>>>>>>> >>>>>>>>>>> first.two<- first.two[!duplicated(first.two$SYMBOL),] >>>>>>>>>>> >>>>>>>>>>> which will choose for you the first symbol -> gene ID mapping and >>>>>>>>>>> nuke the >>>>>>>>>>> rest. That's nice and quick, but you are making huge assumptions. >>>>>>>>>>> >>>>>>>>>>> Or you could decide to be a bit more sophisticated and do >>>>>>>>>>> something like >>>>>>>>>>> >>>>>>>>>>> thelst<- tapply(1:nrow(first.two), first.two$SYMBOL, function(x) >>>>>>>>>>> first.two[x,]) >>>>>>>>>>> >>>>>>>>>>> At this point you can take a look at e.g., thelst[1:10] to see >>>>>>>>>>> what >>>>>>>>>>> we just >>>>>>>>>>> did >>>>>>>>>>> >>>>>>>>>>> thelst<- do.call("rbind", lapply(thelst, function(x) c(x[1,1], >>>>>>>>>>> paste(x[,2], >>>>>>>>>>> collapse = "|"))) >>>>>>>>>>> >>>>>>>>>>> and here you can look at head(thelst). >>>>>>>>>>> >>>>>>>>>>> Then you can check to ensure that the first column of thelst is >>>>>>>>>>> identical to >>>>>>>>>>> the first column of df, and proceed as before. >>>>>>>>>>> >>>>>>>>>>> But there is still the problem of the multiple mappings. As an >>>>>>>>>>> example: >>>>>>>>>>> >>>>>>>>>>>> thelst[1:5] >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> $HBD >>>>>>>>>>> SYMBOL ENTREZID >>>>>>>>>>> 2535 HBD 3045 >>>>>>>>>>> 2536 HBD 100187828 >>>>>>>>>>> >>>>>>>>>>> $KIR3DL3 >>>>>>>>>>> SYMBOL ENTREZID >>>>>>>>>>> 17513 KIR3DL3 115653 >>>>>>>>>>> 17514 KIR3DL3 100133046 >>>>>>>>>>> >>>>>>>>>>>> mget(as.character(thelst[[1]][,2]), org.Hs.egGENENAME) >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> $`3045` >>>>>>>>>>> [1] "hemoglobin, delta" >>>>>>>>>>> >>>>>>>>>>> $`100187828` >>>>>>>>>>> [1] "hypophosphatemic bone disease" >>>>>>>>>>> >>>>>>>>>>>> mget(as.character(thelst[[2]][,2]), org.Hs.egGENENAME) >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> $`115653` >>>>>>>>>>> [1] "killer cell immunoglobulin-like receptor, three domains, >>>>>>>>>>> long >>>>>>>>>>> cytoplasmic tail, 3" >>>>>>>>>>> >>>>>>>>>>> $`100133046` >>>>>>>>>>> [1] "killer cell immunoglobulin-like receptor three domains long >>>>>>>>>>> cytoplasmic >>>>>>>>>>> tail 3" >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> So HBD is the gene symbol for two different genes! If this gene >>>>>>>>>>> symbol is in >>>>>>>>>>> your data, you will now have attributed your data to two genes >>>>>>>>>>> that >>>>>>>>>>> apparently are not remotely similar. if KIR3DL3 is in your data, >>>>>>>>>>> then it >>>>>>>>>>> worked out OK for that gene. >>>>>>>>>>> >>>>>>>>>>> Best, >>>>>>>>>>> >>>>>>>>>>> Jim >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>> The real problem is that the number of rows is now different for >>>>>>>>>>>> the 2 >>>>>>>>>>>> objects: >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> nrow(df); nrow(test) >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> [1] 573 >>>>>>>>>>>> [1] 201 >>>>>>>>>>>> >>>>>>>>>>>> So I obviously can't put the new data into the original df. My >>>>>>>>>>>> impression is that when the 1 to many mapping arises, the select >>>>>>>>>>>> functions exits, with that warning message. As a result, my test >>>>>>>>>>>> object is incomplete. >>>>>>>>>>>> >>>>>>>>>>>> On top of that, and I can't really explain this, the row >>>>>>>>>>>> positions are >>>>>>>>>>>> messed up, e.g. >>>>>>>>>>>> >>>>>>>>>>>>> all.equal(df[100,],test[100,]) >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> returns FALSE. >>>>>>>>>>>> >>>>>>>>>>>> How can I work around this? >>>>>>>>>>>> >>>>>>>>>>>> Thanks a lot! >>>>>>>>>>>> >>>>>>>>>>>> Best, >>>>>>>>>>>> >>>>>>>>>>>> On 25 July 2013 16:58, James W. MacDonald<jmacdon at="" uw.edu=""> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> Hi Enrico, >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On 7/25/2013 11:35 AM, Enrico Ferrero wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> Hello, >>>>>>>>>>>>>> >>>>>>>>>>>>>> I often have data frames where I need to perform ID >>>>>>>>>>>>>> conversions on >>>>>>>>>>>>>> one >>>>>>>>>>>>>> or >>>>>>>>>>>>>> more of the columns while preserving the order of the rows, >>>>>>>>>>>>>> e.g.: >>>>>>>>>>>>>> >>>>>>>>>>>>>> GeneSymbol Value1 Value2 >>>>>>>>>>>>>> GS1 2.5 0.1 >>>>>>>>>>>>>> GS2 3 0.2 >>>>>>>>>>>>>> .. >>>>>>>>>>>>>> >>>>>>>>>>>>>> And I want to obtain: >>>>>>>>>>>>>> >>>>>>>>>>>>>> GeneSymbol EntrezGeneID Value1 Value2 >>>>>>>>>>>>>> GS1 EG1 2.5 0.1 >>>>>>>>>>>>>> GS2 EG2 3 0.2 >>>>>>>>>>>>>> .. >>>>>>>>>>>>>> >>>>>>>>>>>>>> What I've done so far was to create a function that uses >>>>>>>>>>>>>> org.Hs.eg.db to >>>>>>>>>>>>>> loop over the rows of the column and does the conversion: >>>>>>>>>>>>>> >>>>>>>>>>>>>> library(org.Hs.eg.db) >>>>>>>>>>>>>> alias2EG<- function(x) { >>>>>>>>>>>>>> for (i in 1:length(x)) { >>>>>>>>>>>>>> if (!is.na(x[i])) { >>>>>>>>>>>>>> repl<- org.Hs.egALIAS2EG[[x[i]]][1] >>>>>>>>>>>>>> if (!is.null(repl)) { >>>>>>>>>>>>>> x[i]<- repl >>>>>>>>>>>>>> } >>>>>>>>>>>>>> else { >>>>>>>>>>>>>> x[i]<- NA >>>>>>>>>>>>>> } >>>>>>>>>>>>>> } >>>>>>>>>>>>>> } >>>>>>>>>>>>>> return(x) >>>>>>>>>>>>>> } >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> I should first note that gene symbols are not unique, so you >>>>>>>>>>>>> are >>>>>>>>>>>>> taking a >>>>>>>>>>>>> chance on your mappings. Is there no other annotation for your >>>>>>>>>>>>> data? >>>>>>>>>>>>> >>>>>>>>>>>>> In addition, you should note that it is almost always better to >>>>>>>>>>>>> think of >>>>>>>>>>>>> objects as vectors and matrices in R, rather than as things >>>>>>>>>>>>> that >>>>>>>>>>>>> need to >>>>>>>>>>>>> be >>>>>>>>>>>>> looped over (e.g., R isn't Perl or C). >>>>>>>>>>>>> >>>>>>>>>>>>> first.two<- select(org.Hs.eg.db, as.character(df$GeneSymbol), >>>>>>>>>>>>> "ENTREZID", >>>>>>>>>>>>> "SYMBOL") >>>>>>>>>>>>> >>>>>>>>>>>>> Note that there used to be a warning or an error (don't >>>>>>>>>>>>> remember >>>>>>>>>>>>> which) >>>>>>>>>>>>> when >>>>>>>>>>>>> you did something like this, stating that gene symbols are not >>>>>>>>>>>>> unique, >>>>>>>>>>>>> and >>>>>>>>>>>>> that you shouldn't do this sort of thing. Apparently this >>>>>>>>>>>>> warning has >>>>>>>>>>>>> been >>>>>>>>>>>>> removed, but the issue remains valid. >>>>>>>>>>>>> >>>>>>>>>>>>> ## check yourself >>>>>>>>>>>>> >>>>>>>>>>>>> all.equal(df$GeneSymbol, first.two$SYMBOL) >>>>>>>>>>>>> >>>>>>>>>>>>> ## if true, proceed >>>>>>>>>>>>> >>>>>>>>>>>>> df<- data.frame(first.two, df[,-1]) >>>>>>>>>>>>> >>>>>>>>>>>>> Best, >>>>>>>>>>>>> >>>>>>>>>>>>> Jim >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>> and then call the function like this: >>>>>>>>>>>>>> >>>>>>>>>>>>>> df$EntrezGeneID<- alias2GS(df$GeneSymbol) >>>>>>>>>>>>>> >>>>>>>>>>>>>> This works well, but gets very slow when I need to do multiple >>>>>>>>>>>>>> conversions >>>>>>>>>>>>>> on large datasets. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Is there any way I can achieve the same result but in a >>>>>>>>>>>>>> quicker, more >>>>>>>>>>>>>> efficient way? >>>>>>>>>>>>>> >>>>>>>>>>>>>> Thank you. >>>>>>>>>>>>>> >>>>>>>>>>>>> -- >>>>>>>>>>>>> James W. MacDonald, M.S. >>>>>>>>>>>>> Biostatistician >>>>>>>>>>>>> University of Washington >>>>>>>>>>>>> Environmental and Occupational Health Sciences >>>>>>>>>>>>> 4225 Roosevelt Way NE, # 100 >>>>>>>>>>>>> Seattle WA 98105-6099 >>>>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> James W. MacDonald, M.S. >>>>>>>>>>> Biostatistician >>>>>>>>>>> University of Washington >>>>>>>>>>> Environmental and Occupational Health Sciences >>>>>>>>>>> 4225 Roosevelt Way NE, # 100 >>>>>>>>>>> Seattle WA 98105-6099 >>>>>>>>>>> >>>>>>>>>> >>>>>>>> -- >>>>>>>> Hervé Pagès >>>>>>>> >>>>>>>> Program in Computational Biology >>>>>>>> Division of Public Health Sciences >>>>>>>> Fred Hutchinson Cancer Research Center >>>>>>>> 1100 Fairview Ave. N, M1-B514 >>>>>>>> P.O. Box 19024 >>>>>>>> Seattle, WA 98109-1024 >>>>>>>> >>>>>>>> E-mail: hpages at fhcrc.org >>>>>>>> Phone: (206) 667-5791 >>>>>>>> Fax: (206) 667-1319 >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>> >>>>> -- >>>>> Hervé Pagès >>>>> >>>>> Program in Computational Biology >>>>> Division of Public Health Sciences >>>>> Fred Hutchinson Cancer Research Center >>>>> 1100 Fairview Ave. N, M1-B514 >>>>> P.O. Box 19024 >>>>> Seattle, WA 98109-1024 >>>>> >>>>> E-mail: hpages at fhcrc.org >>>>> Phone: (206) 667-5791 >>>>> Fax: (206) 667-1319 >>>> >>>> >>>> >>>> >>> >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor > > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages at fhcrc.org > Phone: (206) 667-5791 > Fax: (206) 667-1319 > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor -- Enrico Ferrero PhD Student Steve Russell Lab - Department of Genetics FlyChip - Cambridge Systems Biology Centre University of Cambridge e.ferrero at gen.cam.ac.uk http://flypress.gen.cam.ac.uk/
ADD REPLY
0
Entering edit mode
On 07/27/2013 06:40 AM, Enrico Ferrero wrote: > Hi everybody, > > Marc, thanks for clarifying things. The behaviour of the select() > function is absolutely sensible. Maybe it should be made explicit > somewhere in the documentation that, when working with data frames, > the user is expected to use the merge() function in conjunction with > it. I also agree with Herv? that having options to tweak and customize > the output would be an extremely positive thing and a step in the > right direction. In addition to a "select" argument, one can also > think of a "remove.na.rows" that evaluates to either TRUE or FALSE. > But then again, using merge() after select() already deals with these > issues quite well. > > What I think should be investigated more closely at the moment is the > unexpected behaviour select() exhibits when one SYMBOL or ALIAS (and > potentially other types of ID, I don't know) maps to more than one > ENTREZID. As exemplified by James' code below, this causes the output > to be truncated, and I highly doubt this is intentional: > >> select(org.Hs.eg.db, rep("ADORA2A", 4), "ENTREZID", "ALIAS") > ALIAS ENTREZID > 1 ADORA2A 135 > 2 ADORA2A 135 > 3 ADORA2A 135 > 4 ADORA2A 135 > >> select(org.Hs.eg.db, rep("AGT", 4), "ENTREZID", "ALIAS") > ALIAS ENTREZID > 1 AGT 183 > 2 AGT 189 > > Warning message: > In .generateExtraRows(tab, keys, jointype) : > 'select' and duplicate query keys resulted in 1:many mapping between > keys and return rows > > > It would be great to have your views on this. > Best, Hi Enrico, My view on this is the same one that I presented above. Basically you seem to have misunderstood what select is doing in this case. So clearly I need to explain things a bit better in the documentation. But what is happening is in fact completely intentional, and it happens every time there is at least one "many to one" relationship requested by select. The presence of a many to one relationship means that select no longer has any chance of giving you a data.frame back that has the same height as the length of your keys. So instead of attempting to keep your repeated keys and matching them perfectly (which is no longer possible), select assumes that you know what you are doing and instead it just simplifies the result by removing all duplicated rows from the result. This is why your result appears "truncated". It's because there really was no point in keeping the initial pattern from the keys you passed in (as the data shape makes it impossible to do this anyways). The result you get in your 2nd case is actually the same exact information content as if we had tried to duplicate rows to match your repeated input. The only actual difference here is that there is no way for select to know how you intended to repeat the symbol "AGT" to match the two entrez gene IDs to the initial four "AGT" symbols that you passed in. For this example, did you want AGT repeated 4 times (with two repeats each of the two entrez gene IDs)? Or did you maybe want it repeated 8 times (with 4 repeats of each entrez gene ID)? And what should we have done if you had repeated the symbol "AGT" 5 times in the input instead? How are we supposed to format the output in that case? I hope you can see why in this case we have to just give you the data as it is. In this circumstance we just can't guess anymore about how you want it presented. So instead of guessing we just return all the data "as is" and give you a warning. So it's not actually true that the 2nd case you presented is "truncated". It's actually true instead that the 1st case data has just been repeated in an effort to make your life easier. But when the data is complicated by many to one relationships, we just can't know anymore what you will want to do for formatting it. We have tried to be very accommodating with select for people who request simple 1:1 relationships because we recognize that this is a common use case and we can see a straightforward way to make things easier for that common use case. But select is not really meant to be a data formatting function. It's really intended to be a data retrieval function. R already has a lot of great functions for data formatting already (like merge and the subset operators etc.), and these are already more flexible and better suited for tasks like that. Marc > On 26 July 2013 21:46, Hervé Pagès <hpages at="" fhcrc.org=""> wrote: >> Hi Marc, >> >> On 07/26/2013 12:57 PM, Marc Carlson wrote: >> ... >> >>> Hello everyone, >>> >>> Sorry that I saw this thread so late. Basically, select() does *try* to >>> keep your initial keys and map them each to an equivalent number of >>> unique values. We did actually anticipate that people would *want* to >>> cbind() their results. >>> >>> But as you discovered there are many circumstances where the data make >>> this kind of behavior impossible. >>> >>> So passing in NAs as keys for example can't ever find anything >>> meaningful. Those will simply have to be removed before we can >>> proceed. And, it is also impossible to maintain a 1:1 mapping if you >>> retrieve fields that have many to one relationships with your initial >>> keys (also seen here). >>> >>> For convenience, when this kind of 1:1 output is already impossible (as >>> it is for most of your examples), select will also try to simplify the >>> output by removing rows that are identical all the way across etc.. >>> >>> My aim was that select should try to do the most reasonable thing >>> possible based on the data we have in each case. The rationale is that >>> in the case where there are 1:many mappings, you should not be planning >>> to bind those directly onto any other data.frames anyways (as this >>> circumstance would require you to call merge() instead). So in that >>> case, non-destructive simplification seems beneficial. >> >> Other tools in our infrastructure use an extra argument to pick-up 1 >> thing in case of multiple mapping e.g. findOverlaps() has the 'select' >> argument with possible values "all", "first", "last", and "arbitrary". >> Also nearest() and family have this argument and it accepts similar >> values. >> >> Couldn't select() use a similar approach? The default should be "all" >> so the current behavior is preserved but if it's something else then >> the returned data.frame should align with the input. >> >> Thanks, >> >> H. >> >> >>> I hope this clarifies things, >>> >>> >>> Marc >>> >>> >>> >>>>> As I >>>>> mentioned in my first post, the for loop function works, but it's >>>>> highly inefficient. >>>>> >>>>> Any help is greatly appreciated, thank you. >>>>> >>>>> Best, >>>>> >>>>> >>>>> >>>>> On 25 July 2013 23:18, Hervé Pagès <hpages at="" fhcrc.org=""> wrote: >>>>>> Hi James, >>>>>> >>>>>> You're right. >>>>>> >>>>>> It's actually both: NAs *and* duplicated keys that are mapped to >>>>>> more than 1 row are removed from the input. I don't think this >>>>>> is documented. >>>>>> >>>>>> I wonder if select() behavior couldn't be a little bit simpler by >>>>>> either preserving or removing all duplicated keys, and not just some >>>>>> of them (on a somewhat arbitrary criteria). >>>>>> >>>>>> Thanks, >>>>>> H. >>>>>> >>>>>> >>>>>> >>>>>> On 07/25/2013 02:57 PM, James W. MacDonald wrote: >>>>>>> >>>>>>> Hi Enrico and Herve, >>>>>>> >>>>>>> This has to do with duplicate entries, but only when the duplicate >>>>>>> entry >>>>>>> maps to many ENTREZID: >>>>>>> >>>>>>> > select(org.Hs.eg.db, rep("ADORA2A", 4), "ENTREZID", "ALIAS") >>>>>>> ALIAS ENTREZID >>>>>>> 1 ADORA2A 135 >>>>>>> 2 ADORA2A 135 >>>>>>> 3 ADORA2A 135 >>>>>>> 4 ADORA2A 135 >>>>>>> >>>>>>> > select(org.Hs.eg.db, rep("AGT", 4), "ENTREZID", "ALIAS") >>>>>>> ALIAS ENTREZID >>>>>>> 1 AGT 183 >>>>>>> 2 AGT 189 >>>>>>> Warning message: >>>>>>> In .generateExtraRows(tab, keys, jointype) : >>>>>>> 'select' and duplicate query keys resulted in 1:many mapping >>>>>>> between >>>>>>> keys and return rows >>>>>>> >>>>>>> > select(org.Hs.eg.db, "AGT", "ENTREZID", "ALIAS") >>>>>>> ALIAS ENTREZID >>>>>>> 1 AGT 183 >>>>>>> 2 AGT 189 >>>>>>> Warning message: >>>>>>> In .generateExtraRows(tab, keys, jointype) : >>>>>>> 'select' resulted in 1:many mapping between keys and return rows >>>>>>> >>>>>>> >>>>>>> So in the instances where a gene symbol maps to more than one >>>>>>> ENTREZID, >>>>>>> the output gets truncated, whereas if it is a one-to-one mapping, it >>>>>>> does not. >>>>>>> >>>>>>> Best, >>>>>>> >>>>>>> Jim >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> On 7/25/2013 5:06 PM, Enrico Ferrero wrote: >>>>>>>> >>>>>>>> Hi, >>>>>>>> >>>>>>>> Herv?, that's exactly what I'm trying to say. >>>>>>>> >>>>>>>> Attached to this email is a tab delimited file with two columns of >>>>>>>> GeneSymbols (or Aliases), and here is some simple code to reproduce >>>>>>>> the unexpected behaviour: >>>>>>>> >>>>>>>> library(org.Hs.eg.db) >>>>>>>> mydf<- read.table("testdata.txt", sep="\t", header=TRUE, as.is=TRUE) >>>>>>>> mytest<- select(org.Hs.eg.db, key=mydf$GeneSymbol1, keytype="ALIAS", >>>>>>>> cols=c("SYMBOL","ENTREZID","ENSEMBL")) >>>>>>>> # check that mytest has less rows than mydf >>>>>>>> nrow(mydf) >>>>>>>> nrow(mytest) >>>>>>>> # pick a random row: they don't match >>>>>>>> mydf[250,] >>>>>>>> mytest[250,] >>>>>>>> >>>>>>>> Ideally, mytest should have the same number and position of rows of >>>>>>>> mydf so that I can then cbind them. >>>>>>>> If mytest has more rows because of multiple mappings that's also >>>>>>>> fine: >>>>>>>> I can always use merge(mydf, mytest), right? >>>>>>>> >>>>>>>> Thanks a lot to both for your help, it's very appreciated. >>>>>>>> Best, >>>>>>>> >>>>>>>> >>>>>>>> On 25 July 2013 21:32, Hervé Pagès<hpages at="" fhcrc.org=""> wrote: >>>>>>>>> >>>>>>>>> Hi Enrico, >>>>>>>>> >>>>>>>>> >>>>>>>>> On 07/25/2013 01:20 PM, James W. MacDonald wrote: >>>>>>>>>> >>>>>>>>>> Hi Enrico, >>>>>>>>>> >>>>>>>>>> Please don't take things off-list (e.g., use reply-all). >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On 7/25/2013 2:17 PM, Enrico Ferrero wrote: >>>>>>>>>>> >>>>>>>>>>> Hi James, >>>>>>>>>>> >>>>>>>>>>> Thanks very much for your help. >>>>>>>>>>> There is an issue that needs to be solved before thinking about >>>>>>>>>>> what's >>>>>>>>>>> the best approach in my opinion. >>>>>>>>>>> >>>>>>>>>>> I don't understand why, but the object created with the call to >>>>>>>>>>> select >>>>>>>>>>> (test in my example, first.two in yours) has a different number of >>>>>>>>>>> rows from the original object (df in my example). Specifically >>>>>>>>>>> it has >>>>>>>>>>> *less* rows. >>>>>>>>> >>>>>>>>> >>>>>>>>> I'm surprised it has less rows. It can definitely have more, when >>>>>>>>> some >>>>>>>>> of the keys passed to select() are mapped to more than 1 row, but my >>>>>>>>> understanding was that select() would propagate unmapped keys to the >>>>>>>>> output by placing them in rows stuffed with NAs. So maybe I >>>>>>>>> misunderstood how select() works, or its behavior was changed, or >>>>>>>>> there is a bug somewhere. Could you please send the code that allows >>>>>>>>> us to reproduce this? Thanks. >>>>>>>>> >>>>>>>>> H. >>>>>>>>> >>>>>>>>> >>>>>>>>>> If all symbols were converted to all possible Entrez IDs, >>>>>>>>>>> >>>>>>>>>>> I would expect it to have more rows, not less. To me, it looks >>>>>>>>>>> like >>>>>>>>>>> not all rows are looked up and returned. >>>>>>>>>>> >>>>>>>>>>> Do you see what I mean? >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Sure. You could be using outdated gene symbols. Or perhaps you are >>>>>>>>>> using >>>>>>>>>> a mixture of symbols and aliases. Which is even cooler than just >>>>>>>>>> all >>>>>>>>>> symbols: >>>>>>>>>> >>>>>>>>>> > symb<- c(Rkeys(org.Hs.egSYMBOL)[1:10], >>>>>>>>>> Rkeys(org.Hs.egALIAS2EG)[31:45]) >>>>>>>>>> > symb >>>>>>>>>> [1] "A1BG" "A2M" "A2MP1" "NAT1" "NAT2" "AACP" >>>>>>>>>> [7] "SERPINA3" "AADAC" "AAMP" "AANAT" "AAMP" "AANAT" >>>>>>>>>> [13] "DSPS" "SNAT" "AARS" "CMT2N" "AAV" "AAVS1" >>>>>>>>>> [19] "ABAT" "GABA-AT" "GABAT" "NPD009" "ABC-1" "ABC1" >>>>>>>>>> [25] "ABCA1" >>>>>>>>>> > select(org.Hs.eg.db, symb, "ENTREZID","SYMBOL") >>>>>>>>>> SYMBOL ENTREZID >>>>>>>>>> 1 A1BG 1 >>>>>>>>>> 2 A2M 2 >>>>>>>>>> 3 A2MP1 3 >>>>>>>>>> 4 NAT1 9 >>>>>>>>>> 5 NAT2 10 >>>>>>>>>> 6 AACP 11 >>>>>>>>>> 7 SERPINA3 12 >>>>>>>>>> 8 AADAC 13 >>>>>>>>>> 9 AAMP 14 >>>>>>>>>> 10 AANAT 15 >>>>>>>>>> 11 AAMP 14 >>>>>>>>>> 12 AANAT 15 >>>>>>>>>> 13 DSPS<na> >>>>>>>>>> 14 SNAT<na> >>>>>>>>>> 15 AARS 16 >>>>>>>>>> 16 CMT2N<na> >>>>>>>>>> 17 AAV<na> >>>>>>>>>> 18 AAVS1 17 >>>>>>>>>> 19 ABAT 18 >>>>>>>>>> 20 GABA-AT<na> >>>>>>>>>> 21 GABAT<na> >>>>>>>>>> 22 NPD009<na> >>>>>>>>>> 23 ABC-1<na> >>>>>>>>>> 24 ABC1<na> >>>>>>>>>> 25 ABCA1 19 >>>>>>>>>> > select(org.Hs.eg.db, symb, "ENTREZID","ALIAS") >>>>>>>>>> ALIAS ENTREZID >>>>>>>>>> 1 A1BG 1 >>>>>>>>>> 2 A2M 2 >>>>>>>>>> 3 A2MP1 3 >>>>>>>>>> 4 NAT1 9 >>>>>>>>>> 5 NAT1 1982 >>>>>>>>>> 6 NAT1 6530 >>>>>>>>>> 7 NAT1 10991 >>>>>>>>>> 8 NAT2 10 >>>>>>>>>> 9 NAT2 81539 >>>>>>>>>> 10 AACP 11 >>>>>>>>>> 11 SERPINA3 12 >>>>>>>>>> 12 AADAC 13 >>>>>>>>>> 13 AAMP 14 >>>>>>>>>> 14 AANAT 15 >>>>>>>>>> 15 DSPS 15 >>>>>>>>>> 16 SNAT 15 >>>>>>>>>> 17 AARS 16 >>>>>>>>>> 18 CMT2N 16 >>>>>>>>>> 19 AAV 17 >>>>>>>>>> 20 AAVS1 17 >>>>>>>>>> 21 ABAT 18 >>>>>>>>>> 22 GABA-AT 18 >>>>>>>>>> 23 GABAT 18 >>>>>>>>>> 24 NPD009 18 >>>>>>>>>> 25 ABC-1 19 >>>>>>>>>> 26 ABC1 19 >>>>>>>>>> 27 ABC1 63897 >>>>>>>>>> 28 ABCA1 19 >>>>>>>>>> Warning message: >>>>>>>>>> In .generateExtraRows(tab, keys, jointype) : >>>>>>>>>> 'select' and duplicate query keys resulted in 1:many mapping >>>>>>>>>> between >>>>>>>>>> keys and return rows >>>>>>>>>> > mget(c("1982","6530","10991"), org.Hs.egGENENAME) >>>>>>>>>> $`1982` >>>>>>>>>> [1] "eukaryotic translation initiation factor 4 gamma, 2" >>>>>>>>>> >>>>>>>>>> $`6530` >>>>>>>>>> [1] "solute carrier family 6 (neurotransmitter transporter, >>>>>>>>>> noradrenalin), member 2" >>>>>>>>>> >>>>>>>>>> $`10991` >>>>>>>>>> [1] "solute carrier family 38, member 3" >>>>>>>>>> >>>>>>>>>> Best, >>>>>>>>>> >>>>>>>>>> Jim >>>>>>>>>> >>>>>>>>>>> On 25 July 2013 18:17, James W. MacDonald<jmacdon at="" uw.edu=""> wrote: >>>>>>>>>>>> >>>>>>>>>>>> Hi Enrico, >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On 7/25/2013 12:56 PM, Enrico Ferrero wrote: >>>>>>>>>>>>> >>>>>>>>>>>>> Dear James, >>>>>>>>>>>>> >>>>>>>>>>>>> Thanks very much for your prompt reply. >>>>>>>>>>>>> I knew the problem was the for loop and the select function is >>>>>>>>>>>>> indeed >>>>>>>>>>>>> a lot faster than that and works perfectly with toy data. >>>>>>>>>>>>> >>>>>>>>>>>>> However, this is what happens when I try to use it with real >>>>>>>>>>>>> data: >>>>>>>>>>>>> >>>>>>>>>>>>>> test<- select(org.Hs.eg.db, keys=df$GeneSymbol, >>>>>>>>>>>>>> keytype="ALIAS", >>>>>>>>>>>>>> cols=c("SYMBOL","ENTREZID","ENSEMBL")) >>>>>>>>>>>>> >>>>>>>>>>>>> Warning message: >>>>>>>>>>>>> In .generateExtraRows(tab, keys, jointype) : >>>>>>>>>>>>> 'select' and duplicate query keys resulted in 1:many >>>>>>>>>>>>> mapping >>>>>>>>>>>>> between >>>>>>>>>>>>> keys and return rows >>>>>>>>>>>>> >>>>>>>>>>>>> which is probably the warning you mentioned. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> That's not the warning I mentioned, but it does point out the >>>>>>>>>>>> same >>>>>>>>>>>> issue, >>>>>>>>>>>> which is that there is a one to many mapping between symbol and >>>>>>>>>>>> entrez gene >>>>>>>>>>>> ID. >>>>>>>>>>>> >>>>>>>>>>>> So now you have to decide if you want to be naive (or stupid, >>>>>>>>>>>> depending on >>>>>>>>>>>> your perspective) or not. You could just cover your eyes and >>>>>>>>>>>> do this: >>>>>>>>>>>> >>>>>>>>>>>> first.two<- first.two[!duplicated(first.two$SYMBOL),] >>>>>>>>>>>> >>>>>>>>>>>> which will choose for you the first symbol -> gene ID mapping and >>>>>>>>>>>> nuke the >>>>>>>>>>>> rest. That's nice and quick, but you are making huge assumptions. >>>>>>>>>>>> >>>>>>>>>>>> Or you could decide to be a bit more sophisticated and do >>>>>>>>>>>> something like >>>>>>>>>>>> >>>>>>>>>>>> thelst<- tapply(1:nrow(first.two), first.two$SYMBOL, function(x) >>>>>>>>>>>> first.two[x,]) >>>>>>>>>>>> >>>>>>>>>>>> At this point you can take a look at e.g., thelst[1:10] to see >>>>>>>>>>>> what >>>>>>>>>>>> we just >>>>>>>>>>>> did >>>>>>>>>>>> >>>>>>>>>>>> thelst<- do.call("rbind", lapply(thelst, function(x) c(x[1,1], >>>>>>>>>>>> paste(x[,2], >>>>>>>>>>>> collapse = "|"))) >>>>>>>>>>>> >>>>>>>>>>>> and here you can look at head(thelst). >>>>>>>>>>>> >>>>>>>>>>>> Then you can check to ensure that the first column of thelst is >>>>>>>>>>>> identical to >>>>>>>>>>>> the first column of df, and proceed as before. >>>>>>>>>>>> >>>>>>>>>>>> But there is still the problem of the multiple mappings. As an >>>>>>>>>>>> example: >>>>>>>>>>>> >>>>>>>>>>>>> thelst[1:5] >>>>>>>>>>>> >>>>>>>>>>>> $HBD >>>>>>>>>>>> SYMBOL ENTREZID >>>>>>>>>>>> 2535 HBD 3045 >>>>>>>>>>>> 2536 HBD 100187828 >>>>>>>>>>>> >>>>>>>>>>>> $KIR3DL3 >>>>>>>>>>>> SYMBOL ENTREZID >>>>>>>>>>>> 17513 KIR3DL3 115653 >>>>>>>>>>>> 17514 KIR3DL3 100133046 >>>>>>>>>>>> >>>>>>>>>>>>> mget(as.character(thelst[[1]][,2]), org.Hs.egGENENAME) >>>>>>>>>>>> >>>>>>>>>>>> $`3045` >>>>>>>>>>>> [1] "hemoglobin, delta" >>>>>>>>>>>> >>>>>>>>>>>> $`100187828` >>>>>>>>>>>> [1] "hypophosphatemic bone disease" >>>>>>>>>>>> >>>>>>>>>>>>> mget(as.character(thelst[[2]][,2]), org.Hs.egGENENAME) >>>>>>>>>>>> >>>>>>>>>>>> $`115653` >>>>>>>>>>>> [1] "killer cell immunoglobulin-like receptor, three domains, >>>>>>>>>>>> long >>>>>>>>>>>> cytoplasmic tail, 3" >>>>>>>>>>>> >>>>>>>>>>>> $`100133046` >>>>>>>>>>>> [1] "killer cell immunoglobulin-like receptor three domains long >>>>>>>>>>>> cytoplasmic >>>>>>>>>>>> tail 3" >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> So HBD is the gene symbol for two different genes! If this gene >>>>>>>>>>>> symbol is in >>>>>>>>>>>> your data, you will now have attributed your data to two genes >>>>>>>>>>>> that >>>>>>>>>>>> apparently are not remotely similar. if KIR3DL3 is in your data, >>>>>>>>>>>> then it >>>>>>>>>>>> worked out OK for that gene. >>>>>>>>>>>> >>>>>>>>>>>> Best, >>>>>>>>>>>> >>>>>>>>>>>> Jim >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>> The real problem is that the number of rows is now different for >>>>>>>>>>>>> the 2 >>>>>>>>>>>>> objects: >>>>>>>>>>>>>> >>>>>>>>>>>>>> nrow(df); nrow(test) >>>>>>>>>>>>> >>>>>>>>>>>>> [1] 573 >>>>>>>>>>>>> [1] 201 >>>>>>>>>>>>> >>>>>>>>>>>>> So I obviously can't put the new data into the original df. My >>>>>>>>>>>>> impression is that when the 1 to many mapping arises, the select >>>>>>>>>>>>> functions exits, with that warning message. As a result, my test >>>>>>>>>>>>> object is incomplete. >>>>>>>>>>>>> >>>>>>>>>>>>> On top of that, and I can't really explain this, the row >>>>>>>>>>>>> positions are >>>>>>>>>>>>> messed up, e.g. >>>>>>>>>>>>> >>>>>>>>>>>>>> all.equal(df[100,],test[100,]) >>>>>>>>>>>>> >>>>>>>>>>>>> returns FALSE. >>>>>>>>>>>>> >>>>>>>>>>>>> How can I work around this? >>>>>>>>>>>>> >>>>>>>>>>>>> Thanks a lot! >>>>>>>>>>>>> >>>>>>>>>>>>> Best, >>>>>>>>>>>>> >>>>>>>>>>>>> On 25 July 2013 16:58, James W. MacDonald<jmacdon at="" uw.edu=""> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>> Hi Enrico, >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> On 7/25/2013 11:35 AM, Enrico Ferrero wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Hello, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I often have data frames where I need to perform ID >>>>>>>>>>>>>>> conversions on >>>>>>>>>>>>>>> one >>>>>>>>>>>>>>> or >>>>>>>>>>>>>>> more of the columns while preserving the order of the rows, >>>>>>>>>>>>>>> e.g.: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> GeneSymbol Value1 Value2 >>>>>>>>>>>>>>> GS1 2.5 0.1 >>>>>>>>>>>>>>> GS2 3 0.2 >>>>>>>>>>>>>>> .. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> And I want to obtain: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> GeneSymbol EntrezGeneID Value1 Value2 >>>>>>>>>>>>>>> GS1 EG1 2.5 0.1 >>>>>>>>>>>>>>> GS2 EG2 3 0.2 >>>>>>>>>>>>>>> .. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> What I've done so far was to create a function that uses >>>>>>>>>>>>>>> org.Hs.eg.db to >>>>>>>>>>>>>>> loop over the rows of the column and does the conversion: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> library(org.Hs.eg.db) >>>>>>>>>>>>>>> alias2EG<- function(x) { >>>>>>>>>>>>>>> for (i in 1:length(x)) { >>>>>>>>>>>>>>> if (!is.na(x[i])) { >>>>>>>>>>>>>>> repl<- org.Hs.egALIAS2EG[[x[i]]][1] >>>>>>>>>>>>>>> if (!is.null(repl)) { >>>>>>>>>>>>>>> x[i]<- repl >>>>>>>>>>>>>>> } >>>>>>>>>>>>>>> else { >>>>>>>>>>>>>>> x[i]<- NA >>>>>>>>>>>>>>> } >>>>>>>>>>>>>>> } >>>>>>>>>>>>>>> } >>>>>>>>>>>>>>> return(x) >>>>>>>>>>>>>>> } >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> I should first note that gene symbols are not unique, so you >>>>>>>>>>>>>> are >>>>>>>>>>>>>> taking a >>>>>>>>>>>>>> chance on your mappings. Is there no other annotation for your >>>>>>>>>>>>>> data? >>>>>>>>>>>>>> >>>>>>>>>>>>>> In addition, you should note that it is almost always better to >>>>>>>>>>>>>> think of >>>>>>>>>>>>>> objects as vectors and matrices in R, rather than as things >>>>>>>>>>>>>> that >>>>>>>>>>>>>> need to >>>>>>>>>>>>>> be >>>>>>>>>>>>>> looped over (e.g., R isn't Perl or C). >>>>>>>>>>>>>> >>>>>>>>>>>>>> first.two<- select(org.Hs.eg.db, as.character(df$GeneSymbol), >>>>>>>>>>>>>> "ENTREZID", >>>>>>>>>>>>>> "SYMBOL") >>>>>>>>>>>>>> >>>>>>>>>>>>>> Note that there used to be a warning or an error (don't >>>>>>>>>>>>>> remember >>>>>>>>>>>>>> which) >>>>>>>>>>>>>> when >>>>>>>>>>>>>> you did something like this, stating that gene symbols are not >>>>>>>>>>>>>> unique, >>>>>>>>>>>>>> and >>>>>>>>>>>>>> that you shouldn't do this sort of thing. Apparently this >>>>>>>>>>>>>> warning has >>>>>>>>>>>>>> been >>>>>>>>>>>>>> removed, but the issue remains valid. >>>>>>>>>>>>>> >>>>>>>>>>>>>> ## check yourself >>>>>>>>>>>>>> >>>>>>>>>>>>>> all.equal(df$GeneSymbol, first.two$SYMBOL) >>>>>>>>>>>>>> >>>>>>>>>>>>>> ## if true, proceed >>>>>>>>>>>>>> >>>>>>>>>>>>>> df<- data.frame(first.two, df[,-1]) >>>>>>>>>>>>>> >>>>>>>>>>>>>> Best, >>>>>>>>>>>>>> >>>>>>>>>>>>>> Jim >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>> and then call the function like this: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> df$EntrezGeneID<- alias2GS(df$GeneSymbol) >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> This works well, but gets very slow when I need to do multiple >>>>>>>>>>>>>>> conversions >>>>>>>>>>>>>>> on large datasets. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Is there any way I can achieve the same result but in a >>>>>>>>>>>>>>> quicker, more >>>>>>>>>>>>>>> efficient way? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Thank you. >>>>>>>>>>>>>>> >>>>>>>>>>>>>> -- >>>>>>>>>>>>>> James W. MacDonald, M.S. >>>>>>>>>>>>>> Biostatistician >>>>>>>>>>>>>> University of Washington >>>>>>>>>>>>>> Environmental and Occupational Health Sciences >>>>>>>>>>>>>> 4225 Roosevelt Way NE, # 100 >>>>>>>>>>>>>> Seattle WA 98105-6099 >>>>>>>>>>>>>> >>>>>>>>>>>> -- >>>>>>>>>>>> James W. MacDonald, M.S. >>>>>>>>>>>> Biostatistician >>>>>>>>>>>> University of Washington >>>>>>>>>>>> Environmental and Occupational Health Sciences >>>>>>>>>>>> 4225 Roosevelt Way NE, # 100 >>>>>>>>>>>> Seattle WA 98105-6099 >>>>>>>>>>>> >>>>>>>>> -- >>>>>>>>> Hervé Pagès >>>>>>>>> >>>>>>>>> Program in Computational Biology >>>>>>>>> Division of Public Health Sciences >>>>>>>>> Fred Hutchinson Cancer Research Center >>>>>>>>> 1100 Fairview Ave. N, M1-B514 >>>>>>>>> P.O. Box 19024 >>>>>>>>> Seattle, WA 98109-1024 >>>>>>>>> >>>>>>>>> E-mail: hpages at fhcrc.org >>>>>>>>> Phone: (206) 667-5791 >>>>>>>>> Fax: (206) 667-1319 >>>>>>>> >>>>>>>> >>>>>>>> >>>>>> -- >>>>>> Hervé Pagès >>>>>> >>>>>> Program in Computational Biology >>>>>> Division of Public Health Sciences >>>>>> Fred Hutchinson Cancer Research Center >>>>>> 1100 Fairview Ave. N, M1-B514 >>>>>> P.O. Box 19024 >>>>>> Seattle, WA 98109-1024 >>>>>> >>>>>> E-mail: hpages at fhcrc.org >>>>>> Phone: (206) 667-5791 >>>>>> Fax: (206) 667-1319 >>>>> >>>>> >>>>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> -- >> Hervé Pagès >> >> Program in Computational Biology >> Division of Public Health Sciences >> Fred Hutchinson Cancer Research Center >> 1100 Fairview Ave. N, M1-B514 >> P.O. Box 19024 >> Seattle, WA 98109-1024 >> >> E-mail: hpages at fhcrc.org >> Phone: (206) 667-5791 >> Fax: (206) 667-1319 >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor > >
ADD REPLY
0
Entering edit mode
Hi Marc, Thanks for taking the time to explain this thoroughly, it is now clear and my original problem can be considered solved. For the sake of discussion, I'd just like to add something (which doesn't necessarily need a follow up): at least to me, select() seems to have a rather arbitrary behaviour. I don't completely see the point in returning 4 rows in the 1st scenario then. Why not just return 1 row and have a consistent behaviour all around (remove all NAs and all duplicates in the input)? Best, On 29 July 2013 19:00, Marc Carlson <mcarlson at="" fhcrc.org=""> wrote: > On 07/27/2013 06:40 AM, Enrico Ferrero wrote: >> >> Hi everybody, >> >> Marc, thanks for clarifying things. The behaviour of the select() >> function is absolutely sensible. Maybe it should be made explicit >> somewhere in the documentation that, when working with data frames, >> the user is expected to use the merge() function in conjunction with >> it. I also agree with Herv? that having options to tweak and customize >> the output would be an extremely positive thing and a step in the >> right direction. In addition to a "select" argument, one can also >> think of a "remove.na.rows" that evaluates to either TRUE or FALSE. >> But then again, using merge() after select() already deals with these >> issues quite well. >> >> What I think should be investigated more closely at the moment is the >> unexpected behaviour select() exhibits when one SYMBOL or ALIAS (and >> potentially other types of ID, I don't know) maps to more than one >> ENTREZID. As exemplified by James' code below, this causes the output >> to be truncated, and I highly doubt this is intentional: >> >>> select(org.Hs.eg.db, rep("ADORA2A", 4), "ENTREZID", "ALIAS") >> >> ALIAS ENTREZID >> 1 ADORA2A 135 >> 2 ADORA2A 135 >> 3 ADORA2A 135 >> 4 ADORA2A 135 >> >>> select(org.Hs.eg.db, rep("AGT", 4), "ENTREZID", "ALIAS") >> >> ALIAS ENTREZID >> 1 AGT 183 >> 2 AGT 189 >> >> Warning message: >> In .generateExtraRows(tab, keys, jointype) : >> 'select' and duplicate query keys resulted in 1:many mapping between >> keys and return rows >> >> >> It would be great to have your views on this. >> Best, > > > Hi Enrico, > > My view on this is the same one that I presented above. Basically you seem > to have misunderstood what select is doing in this case. So clearly I need > to explain things a bit better in the documentation. But what is happening > is in fact completely intentional, and it happens every time there is at > least one "many to one" relationship requested by select. The presence of a > many to one relationship means that select no longer has any chance of > giving you a data.frame back that has the same height as the length of your > keys. So instead of attempting to keep your repeated keys and matching them > perfectly (which is no longer possible), select assumes that you know what > you are doing and instead it just simplifies the result by removing all > duplicated rows from the result. This is why your result appears > "truncated". It's because there really was no point in keeping the initial > pattern from the keys you passed in (as the data shape makes it impossible > to do this anyways). > > The result you get in your 2nd case is actually the same exact information > content as if we had tried to duplicate rows to match your repeated input. > The only actual difference here is that there is no way for select to know > how you intended to repeat the symbol "AGT" to match the two entrez gene IDs > to the initial four "AGT" symbols that you passed in. For this example, did > you want AGT repeated 4 times (with two repeats each of the two entrez gene > IDs)? Or did you maybe want it repeated 8 times (with 4 repeats of each > entrez gene ID)? And what should we have done if you had repeated the > symbol "AGT" 5 times in the input instead? How are we supposed to format > the output in that case? I hope you can see why in this case we have to > just give you the data as it is. In this circumstance we just can't guess > anymore about how you want it presented. So instead of guessing we just > return all the data "as is" and give you a warning. So it's not actually > true that the 2nd case you presented is "truncated". It's actually true > instead that the 1st case data has just been repeated in an effort to make > your life easier. But when the data is complicated by many to one > relationships, we just can't know anymore what you will want to do for > formatting it. > > We have tried to be very accommodating with select for people who request > simple 1:1 relationships because we recognize that this is a common use case > and we can see a straightforward way to make things easier for that common > use case. But select is not really meant to be a data formatting function. > It's really intended to be a data retrieval function. R already has a lot > of great functions for data formatting already (like merge and the subset > operators etc.), and these are already more flexible and better suited for > tasks like that. > > > > Marc > > > > > >> On 26 July 2013 21:46, Hervé Pagès <hpages at="" fhcrc.org=""> wrote: >>> >>> Hi Marc, >>> >>> On 07/26/2013 12:57 PM, Marc Carlson wrote: >>> ... >>> >>>> Hello everyone, >>>> >>>> Sorry that I saw this thread so late. Basically, select() does *try* to >>>> keep your initial keys and map them each to an equivalent number of >>>> unique values. We did actually anticipate that people would *want* to >>>> cbind() their results. >>>> >>>> But as you discovered there are many circumstances where the data make >>>> this kind of behavior impossible. >>>> >>>> So passing in NAs as keys for example can't ever find anything >>>> meaningful. Those will simply have to be removed before we can >>>> proceed. And, it is also impossible to maintain a 1:1 mapping if you >>>> retrieve fields that have many to one relationships with your initial >>>> keys (also seen here). >>>> >>>> For convenience, when this kind of 1:1 output is already impossible (as >>>> it is for most of your examples), select will also try to simplify the >>>> output by removing rows that are identical all the way across etc.. >>>> >>>> My aim was that select should try to do the most reasonable thing >>>> possible based on the data we have in each case. The rationale is that >>>> in the case where there are 1:many mappings, you should not be planning >>>> to bind those directly onto any other data.frames anyways (as this >>>> circumstance would require you to call merge() instead). So in that >>>> case, non-destructive simplification seems beneficial. >>> >>> >>> Other tools in our infrastructure use an extra argument to pick-up 1 >>> thing in case of multiple mapping e.g. findOverlaps() has the 'select' >>> argument with possible values "all", "first", "last", and "arbitrary". >>> Also nearest() and family have this argument and it accepts similar >>> values. >>> >>> Couldn't select() use a similar approach? The default should be "all" >>> so the current behavior is preserved but if it's something else then >>> the returned data.frame should align with the input. >>> >>> Thanks, >>> >>> H. >>> >>> >>>> I hope this clarifies things, >>>> >>>> >>>> Marc >>>> >>>> >>>> >>>>>> As I >>>>>> mentioned in my first post, the for loop function works, but it's >>>>>> highly inefficient. >>>>>> >>>>>> Any help is greatly appreciated, thank you. >>>>>> >>>>>> Best, >>>>>> >>>>>> >>>>>> >>>>>> On 25 July 2013 23:18, Hervé Pagès <hpages at="" fhcrc.org=""> wrote: >>>>>>> >>>>>>> Hi James, >>>>>>> >>>>>>> You're right. >>>>>>> >>>>>>> It's actually both: NAs *and* duplicated keys that are mapped to >>>>>>> more than 1 row are removed from the input. I don't think this >>>>>>> is documented. >>>>>>> >>>>>>> I wonder if select() behavior couldn't be a little bit simpler by >>>>>>> either preserving or removing all duplicated keys, and not just some >>>>>>> of them (on a somewhat arbitrary criteria). >>>>>>> >>>>>>> Thanks, >>>>>>> H. >>>>>>> >>>>>>> >>>>>>> >>>>>>> On 07/25/2013 02:57 PM, James W. MacDonald wrote: >>>>>>>> >>>>>>>> >>>>>>>> Hi Enrico and Herve, >>>>>>>> >>>>>>>> This has to do with duplicate entries, but only when the duplicate >>>>>>>> entry >>>>>>>> maps to many ENTREZID: >>>>>>>> >>>>>>>> > select(org.Hs.eg.db, rep("ADORA2A", 4), "ENTREZID", "ALIAS") >>>>>>>> ALIAS ENTREZID >>>>>>>> 1 ADORA2A 135 >>>>>>>> 2 ADORA2A 135 >>>>>>>> 3 ADORA2A 135 >>>>>>>> 4 ADORA2A 135 >>>>>>>> >>>>>>>> > select(org.Hs.eg.db, rep("AGT", 4), "ENTREZID", "ALIAS") >>>>>>>> ALIAS ENTREZID >>>>>>>> 1 AGT 183 >>>>>>>> 2 AGT 189 >>>>>>>> Warning message: >>>>>>>> In .generateExtraRows(tab, keys, jointype) : >>>>>>>> 'select' and duplicate query keys resulted in 1:many mapping >>>>>>>> between >>>>>>>> keys and return rows >>>>>>>> >>>>>>>> > select(org.Hs.eg.db, "AGT", "ENTREZID", "ALIAS") >>>>>>>> ALIAS ENTREZID >>>>>>>> 1 AGT 183 >>>>>>>> 2 AGT 189 >>>>>>>> Warning message: >>>>>>>> In .generateExtraRows(tab, keys, jointype) : >>>>>>>> 'select' resulted in 1:many mapping between keys and return >>>>>>>> rows >>>>>>>> >>>>>>>> >>>>>>>> So in the instances where a gene symbol maps to more than one >>>>>>>> ENTREZID, >>>>>>>> the output gets truncated, whereas if it is a one-to-one mapping, it >>>>>>>> does not. >>>>>>>> >>>>>>>> Best, >>>>>>>> >>>>>>>> Jim >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On 7/25/2013 5:06 PM, Enrico Ferrero wrote: >>>>>>>>> >>>>>>>>> >>>>>>>>> Hi, >>>>>>>>> >>>>>>>>> Herv?, that's exactly what I'm trying to say. >>>>>>>>> >>>>>>>>> Attached to this email is a tab delimited file with two columns of >>>>>>>>> GeneSymbols (or Aliases), and here is some simple code to reproduce >>>>>>>>> the unexpected behaviour: >>>>>>>>> >>>>>>>>> library(org.Hs.eg.db) >>>>>>>>> mydf<- read.table("testdata.txt", sep="\t", header=TRUE, >>>>>>>>> as.is=TRUE) >>>>>>>>> mytest<- select(org.Hs.eg.db, key=mydf$GeneSymbol1, >>>>>>>>> keytype="ALIAS", >>>>>>>>> cols=c("SYMBOL","ENTREZID","ENSEMBL")) >>>>>>>>> # check that mytest has less rows than mydf >>>>>>>>> nrow(mydf) >>>>>>>>> nrow(mytest) >>>>>>>>> # pick a random row: they don't match >>>>>>>>> mydf[250,] >>>>>>>>> mytest[250,] >>>>>>>>> >>>>>>>>> Ideally, mytest should have the same number and position of rows of >>>>>>>>> mydf so that I can then cbind them. >>>>>>>>> If mytest has more rows because of multiple mappings that's also >>>>>>>>> fine: >>>>>>>>> I can always use merge(mydf, mytest), right? >>>>>>>>> >>>>>>>>> Thanks a lot to both for your help, it's very appreciated. >>>>>>>>> Best, >>>>>>>>> >>>>>>>>> >>>>>>>>> On 25 July 2013 21:32, Hervé Pagès<hpages at="" fhcrc.org=""> wrote: >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Hi Enrico, >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On 07/25/2013 01:20 PM, James W. MacDonald wrote: >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Hi Enrico, >>>>>>>>>>> >>>>>>>>>>> Please don't take things off-list (e.g., use reply-all). >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On 7/25/2013 2:17 PM, Enrico Ferrero wrote: >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Hi James, >>>>>>>>>>>> >>>>>>>>>>>> Thanks very much for your help. >>>>>>>>>>>> There is an issue that needs to be solved before thinking about >>>>>>>>>>>> what's >>>>>>>>>>>> the best approach in my opinion. >>>>>>>>>>>> >>>>>>>>>>>> I don't understand why, but the object created with the call to >>>>>>>>>>>> select >>>>>>>>>>>> (test in my example, first.two in yours) has a different number >>>>>>>>>>>> of >>>>>>>>>>>> rows from the original object (df in my example). Specifically >>>>>>>>>>>> it has >>>>>>>>>>>> *less* rows. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> I'm surprised it has less rows. It can definitely have more, when >>>>>>>>>> some >>>>>>>>>> of the keys passed to select() are mapped to more than 1 row, but >>>>>>>>>> my >>>>>>>>>> understanding was that select() would propagate unmapped keys to >>>>>>>>>> the >>>>>>>>>> output by placing them in rows stuffed with NAs. So maybe I >>>>>>>>>> misunderstood how select() works, or its behavior was changed, or >>>>>>>>>> there is a bug somewhere. Could you please send the code that >>>>>>>>>> allows >>>>>>>>>> us to reproduce this? Thanks. >>>>>>>>>> >>>>>>>>>> H. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>> If all symbols were converted to all possible Entrez IDs, >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> I would expect it to have more rows, not less. To me, it looks >>>>>>>>>>>> like >>>>>>>>>>>> not all rows are looked up and returned. >>>>>>>>>>>> >>>>>>>>>>>> Do you see what I mean? >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Sure. You could be using outdated gene symbols. Or perhaps you >>>>>>>>>>> are >>>>>>>>>>> using >>>>>>>>>>> a mixture of symbols and aliases. Which is even cooler than just >>>>>>>>>>> all >>>>>>>>>>> symbols: >>>>>>>>>>> >>>>>>>>>>> > symb<- c(Rkeys(org.Hs.egSYMBOL)[1:10], >>>>>>>>>>> Rkeys(org.Hs.egALIAS2EG)[31:45]) >>>>>>>>>>> > symb >>>>>>>>>>> [1] "A1BG" "A2M" "A2MP1" "NAT1" "NAT2" >>>>>>>>>>> "AACP" >>>>>>>>>>> [7] "SERPINA3" "AADAC" "AAMP" "AANAT" "AAMP" >>>>>>>>>>> "AANAT" >>>>>>>>>>> [13] "DSPS" "SNAT" "AARS" "CMT2N" "AAV" "AAVS1" >>>>>>>>>>> [19] "ABAT" "GABA-AT" "GABAT" "NPD009" "ABC-1" "ABC1" >>>>>>>>>>> [25] "ABCA1" >>>>>>>>>>> > select(org.Hs.eg.db, symb, "ENTREZID","SYMBOL") >>>>>>>>>>> SYMBOL ENTREZID >>>>>>>>>>> 1 A1BG 1 >>>>>>>>>>> 2 A2M 2 >>>>>>>>>>> 3 A2MP1 3 >>>>>>>>>>> 4 NAT1 9 >>>>>>>>>>> 5 NAT2 10 >>>>>>>>>>> 6 AACP 11 >>>>>>>>>>> 7 SERPINA3 12 >>>>>>>>>>> 8 AADAC 13 >>>>>>>>>>> 9 AAMP 14 >>>>>>>>>>> 10 AANAT 15 >>>>>>>>>>> 11 AAMP 14 >>>>>>>>>>> 12 AANAT 15 >>>>>>>>>>> 13 DSPS<na> >>>>>>>>>>> 14 SNAT<na> >>>>>>>>>>> 15 AARS 16 >>>>>>>>>>> 16 CMT2N<na> >>>>>>>>>>> 17 AAV<na> >>>>>>>>>>> 18 AAVS1 17 >>>>>>>>>>> 19 ABAT 18 >>>>>>>>>>> 20 GABA-AT<na> >>>>>>>>>>> 21 GABAT<na> >>>>>>>>>>> 22 NPD009<na> >>>>>>>>>>> 23 ABC-1<na> >>>>>>>>>>> 24 ABC1<na> >>>>>>>>>>> 25 ABCA1 19 >>>>>>>>>>> > select(org.Hs.eg.db, symb, "ENTREZID","ALIAS") >>>>>>>>>>> ALIAS ENTREZID >>>>>>>>>>> 1 A1BG 1 >>>>>>>>>>> 2 A2M 2 >>>>>>>>>>> 3 A2MP1 3 >>>>>>>>>>> 4 NAT1 9 >>>>>>>>>>> 5 NAT1 1982 >>>>>>>>>>> 6 NAT1 6530 >>>>>>>>>>> 7 NAT1 10991 >>>>>>>>>>> 8 NAT2 10 >>>>>>>>>>> 9 NAT2 81539 >>>>>>>>>>> 10 AACP 11 >>>>>>>>>>> 11 SERPINA3 12 >>>>>>>>>>> 12 AADAC 13 >>>>>>>>>>> 13 AAMP 14 >>>>>>>>>>> 14 AANAT 15 >>>>>>>>>>> 15 DSPS 15 >>>>>>>>>>> 16 SNAT 15 >>>>>>>>>>> 17 AARS 16 >>>>>>>>>>> 18 CMT2N 16 >>>>>>>>>>> 19 AAV 17 >>>>>>>>>>> 20 AAVS1 17 >>>>>>>>>>> 21 ABAT 18 >>>>>>>>>>> 22 GABA-AT 18 >>>>>>>>>>> 23 GABAT 18 >>>>>>>>>>> 24 NPD009 18 >>>>>>>>>>> 25 ABC-1 19 >>>>>>>>>>> 26 ABC1 19 >>>>>>>>>>> 27 ABC1 63897 >>>>>>>>>>> 28 ABCA1 19 >>>>>>>>>>> Warning message: >>>>>>>>>>> In .generateExtraRows(tab, keys, jointype) : >>>>>>>>>>> 'select' and duplicate query keys resulted in 1:many >>>>>>>>>>> mapping >>>>>>>>>>> between >>>>>>>>>>> keys and return rows >>>>>>>>>>> > mget(c("1982","6530","10991"), org.Hs.egGENENAME) >>>>>>>>>>> $`1982` >>>>>>>>>>> [1] "eukaryotic translation initiation factor 4 gamma, 2" >>>>>>>>>>> >>>>>>>>>>> $`6530` >>>>>>>>>>> [1] "solute carrier family 6 (neurotransmitter transporter, >>>>>>>>>>> noradrenalin), member 2" >>>>>>>>>>> >>>>>>>>>>> $`10991` >>>>>>>>>>> [1] "solute carrier family 38, member 3" >>>>>>>>>>> >>>>>>>>>>> Best, >>>>>>>>>>> >>>>>>>>>>> Jim >>>>>>>>>>> >>>>>>>>>>>> On 25 July 2013 18:17, James W. MacDonald<jmacdon at="" uw.edu=""> >>>>>>>>>>>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> Hi Enrico, >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On 7/25/2013 12:56 PM, Enrico Ferrero wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> Dear James, >>>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks very much for your prompt reply. >>>>>>>>>>>>>> I knew the problem was the for loop and the select function is >>>>>>>>>>>>>> indeed >>>>>>>>>>>>>> a lot faster than that and works perfectly with toy data. >>>>>>>>>>>>>> >>>>>>>>>>>>>> However, this is what happens when I try to use it with real >>>>>>>>>>>>>> data: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> test<- select(org.Hs.eg.db, keys=df$GeneSymbol, >>>>>>>>>>>>>>> keytype="ALIAS", >>>>>>>>>>>>>>> cols=c("SYMBOL","ENTREZID","ENSEMBL")) >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> Warning message: >>>>>>>>>>>>>> In .generateExtraRows(tab, keys, jointype) : >>>>>>>>>>>>>> 'select' and duplicate query keys resulted in 1:many >>>>>>>>>>>>>> mapping >>>>>>>>>>>>>> between >>>>>>>>>>>>>> keys and return rows >>>>>>>>>>>>>> >>>>>>>>>>>>>> which is probably the warning you mentioned. >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> That's not the warning I mentioned, but it does point out the >>>>>>>>>>>>> same >>>>>>>>>>>>> issue, >>>>>>>>>>>>> which is that there is a one to many mapping between symbol and >>>>>>>>>>>>> entrez gene >>>>>>>>>>>>> ID. >>>>>>>>>>>>> >>>>>>>>>>>>> So now you have to decide if you want to be naive (or stupid, >>>>>>>>>>>>> depending on >>>>>>>>>>>>> your perspective) or not. You could just cover your eyes and >>>>>>>>>>>>> do this: >>>>>>>>>>>>> >>>>>>>>>>>>> first.two<- first.two[!duplicated(first.two$SYMBOL),] >>>>>>>>>>>>> >>>>>>>>>>>>> which will choose for you the first symbol -> gene ID mapping >>>>>>>>>>>>> and >>>>>>>>>>>>> nuke the >>>>>>>>>>>>> rest. That's nice and quick, but you are making huge >>>>>>>>>>>>> assumptions. >>>>>>>>>>>>> >>>>>>>>>>>>> Or you could decide to be a bit more sophisticated and do >>>>>>>>>>>>> something like >>>>>>>>>>>>> >>>>>>>>>>>>> thelst<- tapply(1:nrow(first.two), first.two$SYMBOL, >>>>>>>>>>>>> function(x) >>>>>>>>>>>>> first.two[x,]) >>>>>>>>>>>>> >>>>>>>>>>>>> At this point you can take a look at e.g., thelst[1:10] to see >>>>>>>>>>>>> what >>>>>>>>>>>>> we just >>>>>>>>>>>>> did >>>>>>>>>>>>> >>>>>>>>>>>>> thelst<- do.call("rbind", lapply(thelst, function(x) c(x[1,1], >>>>>>>>>>>>> paste(x[,2], >>>>>>>>>>>>> collapse = "|"))) >>>>>>>>>>>>> >>>>>>>>>>>>> and here you can look at head(thelst). >>>>>>>>>>>>> >>>>>>>>>>>>> Then you can check to ensure that the first column of thelst is >>>>>>>>>>>>> identical to >>>>>>>>>>>>> the first column of df, and proceed as before. >>>>>>>>>>>>> >>>>>>>>>>>>> But there is still the problem of the multiple mappings. As an >>>>>>>>>>>>> example: >>>>>>>>>>>>> >>>>>>>>>>>>>> thelst[1:5] >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> $HBD >>>>>>>>>>>>> SYMBOL ENTREZID >>>>>>>>>>>>> 2535 HBD 3045 >>>>>>>>>>>>> 2536 HBD 100187828 >>>>>>>>>>>>> >>>>>>>>>>>>> $KIR3DL3 >>>>>>>>>>>>> SYMBOL ENTREZID >>>>>>>>>>>>> 17513 KIR3DL3 115653 >>>>>>>>>>>>> 17514 KIR3DL3 100133046 >>>>>>>>>>>>> >>>>>>>>>>>>>> mget(as.character(thelst[[1]][,2]), org.Hs.egGENENAME) >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> $`3045` >>>>>>>>>>>>> [1] "hemoglobin, delta" >>>>>>>>>>>>> >>>>>>>>>>>>> $`100187828` >>>>>>>>>>>>> [1] "hypophosphatemic bone disease" >>>>>>>>>>>>> >>>>>>>>>>>>>> mget(as.character(thelst[[2]][,2]), org.Hs.egGENENAME) >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> $`115653` >>>>>>>>>>>>> [1] "killer cell immunoglobulin-like receptor, three domains, >>>>>>>>>>>>> long >>>>>>>>>>>>> cytoplasmic tail, 3" >>>>>>>>>>>>> >>>>>>>>>>>>> $`100133046` >>>>>>>>>>>>> [1] "killer cell immunoglobulin-like receptor three domains >>>>>>>>>>>>> long >>>>>>>>>>>>> cytoplasmic >>>>>>>>>>>>> tail 3" >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> So HBD is the gene symbol for two different genes! If this gene >>>>>>>>>>>>> symbol is in >>>>>>>>>>>>> your data, you will now have attributed your data to two genes >>>>>>>>>>>>> that >>>>>>>>>>>>> apparently are not remotely similar. if KIR3DL3 is in your >>>>>>>>>>>>> data, >>>>>>>>>>>>> then it >>>>>>>>>>>>> worked out OK for that gene. >>>>>>>>>>>>> >>>>>>>>>>>>> Best, >>>>>>>>>>>>> >>>>>>>>>>>>> Jim >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>> The real problem is that the number of rows is now different >>>>>>>>>>>>>> for >>>>>>>>>>>>>> the 2 >>>>>>>>>>>>>> objects: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> nrow(df); nrow(test) >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> [1] 573 >>>>>>>>>>>>>> [1] 201 >>>>>>>>>>>>>> >>>>>>>>>>>>>> So I obviously can't put the new data into the original df. My >>>>>>>>>>>>>> impression is that when the 1 to many mapping arises, the >>>>>>>>>>>>>> select >>>>>>>>>>>>>> functions exits, with that warning message. As a result, my >>>>>>>>>>>>>> test >>>>>>>>>>>>>> object is incomplete. >>>>>>>>>>>>>> >>>>>>>>>>>>>> On top of that, and I can't really explain this, the row >>>>>>>>>>>>>> positions are >>>>>>>>>>>>>> messed up, e.g. >>>>>>>>>>>>>> >>>>>>>>>>>>>>> all.equal(df[100,],test[100,]) >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> returns FALSE. >>>>>>>>>>>>>> >>>>>>>>>>>>>> How can I work around this? >>>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks a lot! >>>>>>>>>>>>>> >>>>>>>>>>>>>> Best, >>>>>>>>>>>>>> >>>>>>>>>>>>>> On 25 July 2013 16:58, James W. MacDonald<jmacdon at="" uw.edu=""> >>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Hi Enrico, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On 7/25/2013 11:35 AM, Enrico Ferrero wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Hello, >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I often have data frames where I need to perform ID >>>>>>>>>>>>>>>> conversions on >>>>>>>>>>>>>>>> one >>>>>>>>>>>>>>>> or >>>>>>>>>>>>>>>> more of the columns while preserving the order of the rows, >>>>>>>>>>>>>>>> e.g.: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> GeneSymbol Value1 Value2 >>>>>>>>>>>>>>>> GS1 2.5 0.1 >>>>>>>>>>>>>>>> GS2 3 0.2 >>>>>>>>>>>>>>>> .. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> And I want to obtain: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> GeneSymbol EntrezGeneID Value1 Value2 >>>>>>>>>>>>>>>> GS1 EG1 2.5 0.1 >>>>>>>>>>>>>>>> GS2 EG2 3 0.2 >>>>>>>>>>>>>>>> .. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> What I've done so far was to create a function that uses >>>>>>>>>>>>>>>> org.Hs.eg.db to >>>>>>>>>>>>>>>> loop over the rows of the column and does the conversion: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> library(org.Hs.eg.db) >>>>>>>>>>>>>>>> alias2EG<- function(x) { >>>>>>>>>>>>>>>> for (i in 1:length(x)) { >>>>>>>>>>>>>>>> if (!is.na(x[i])) { >>>>>>>>>>>>>>>> repl<- org.Hs.egALIAS2EG[[x[i]]][1] >>>>>>>>>>>>>>>> if (!is.null(repl)) { >>>>>>>>>>>>>>>> x[i]<- repl >>>>>>>>>>>>>>>> } >>>>>>>>>>>>>>>> else { >>>>>>>>>>>>>>>> x[i]<- NA >>>>>>>>>>>>>>>> } >>>>>>>>>>>>>>>> } >>>>>>>>>>>>>>>> } >>>>>>>>>>>>>>>> return(x) >>>>>>>>>>>>>>>> } >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I should first note that gene symbols are not unique, so you >>>>>>>>>>>>>>> are >>>>>>>>>>>>>>> taking a >>>>>>>>>>>>>>> chance on your mappings. Is there no other annotation for >>>>>>>>>>>>>>> your >>>>>>>>>>>>>>> data? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> In addition, you should note that it is almost always better >>>>>>>>>>>>>>> to >>>>>>>>>>>>>>> think of >>>>>>>>>>>>>>> objects as vectors and matrices in R, rather than as things >>>>>>>>>>>>>>> that >>>>>>>>>>>>>>> need to >>>>>>>>>>>>>>> be >>>>>>>>>>>>>>> looped over (e.g., R isn't Perl or C). >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> first.two<- select(org.Hs.eg.db, as.character(df$GeneSymbol), >>>>>>>>>>>>>>> "ENTREZID", >>>>>>>>>>>>>>> "SYMBOL") >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Note that there used to be a warning or an error (don't >>>>>>>>>>>>>>> remember >>>>>>>>>>>>>>> which) >>>>>>>>>>>>>>> when >>>>>>>>>>>>>>> you did something like this, stating that gene symbols are >>>>>>>>>>>>>>> not >>>>>>>>>>>>>>> unique, >>>>>>>>>>>>>>> and >>>>>>>>>>>>>>> that you shouldn't do this sort of thing. Apparently this >>>>>>>>>>>>>>> warning has >>>>>>>>>>>>>>> been >>>>>>>>>>>>>>> removed, but the issue remains valid. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> ## check yourself >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> all.equal(df$GeneSymbol, first.two$SYMBOL) >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> ## if true, proceed >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> df<- data.frame(first.two, df[,-1]) >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Best, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Jim >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> and then call the function like this: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> df$EntrezGeneID<- alias2GS(df$GeneSymbol) >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> This works well, but gets very slow when I need to do >>>>>>>>>>>>>>>> multiple >>>>>>>>>>>>>>>> conversions >>>>>>>>>>>>>>>> on large datasets. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Is there any way I can achieve the same result but in a >>>>>>>>>>>>>>>> quicker, more >>>>>>>>>>>>>>>> efficient way? >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Thank you. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>> James W. MacDonald, M.S. >>>>>>>>>>>>>>> Biostatistician >>>>>>>>>>>>>>> University of Washington >>>>>>>>>>>>>>> Environmental and Occupational Health Sciences >>>>>>>>>>>>>>> 4225 Roosevelt Way NE, # 100 >>>>>>>>>>>>>>> Seattle WA 98105-6099 >>>>>>>>>>>>>>> >>>>>>>>>>>>> -- >>>>>>>>>>>>> James W. MacDonald, M.S. >>>>>>>>>>>>> Biostatistician >>>>>>>>>>>>> University of Washington >>>>>>>>>>>>> Environmental and Occupational Health Sciences >>>>>>>>>>>>> 4225 Roosevelt Way NE, # 100 >>>>>>>>>>>>> Seattle WA 98105-6099 >>>>>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> Hervé Pagès >>>>>>>>>> >>>>>>>>>> Program in Computational Biology >>>>>>>>>> Division of Public Health Sciences >>>>>>>>>> Fred Hutchinson Cancer Research Center >>>>>>>>>> 1100 Fairview Ave. N, M1-B514 >>>>>>>>>> P.O. Box 19024 >>>>>>>>>> Seattle, WA 98109-1024 >>>>>>>>>> >>>>>>>>>> E-mail: hpages at fhcrc.org >>>>>>>>>> Phone: (206) 667-5791 >>>>>>>>>> Fax: (206) 667-1319 >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>> -- >>>>>>> Hervé Pagès >>>>>>> >>>>>>> Program in Computational Biology >>>>>>> Division of Public Health Sciences >>>>>>> Fred Hutchinson Cancer Research Center >>>>>>> 1100 Fairview Ave. N, M1-B514 >>>>>>> P.O. Box 19024 >>>>>>> Seattle, WA 98109-1024 >>>>>>> >>>>>>> E-mail: hpages at fhcrc.org >>>>>>> Phone: (206) 667-5791 >>>>>>> Fax: (206) 667-1319 >>>>>> >>>>>> >>>>>> >>>>>> >>>> _______________________________________________ >>>> Bioconductor mailing list >>>> Bioconductor at r-project.org >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> Search the archives: >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >>> >>> -- >>> Hervé Pagès >>> >>> Program in Computational Biology >>> Division of Public Health Sciences >>> Fred Hutchinson Cancer Research Center >>> 1100 Fairview Ave. N, M1-B514 >>> P.O. Box 19024 >>> Seattle, WA 98109-1024 >>> >>> E-mail: hpages at fhcrc.org >>> Phone: (206) 667-5791 >>> Fax: (206) 667-1319 >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> >> > -- Enrico Ferrero PhD Student Steve Russell Lab - Department of Genetics FlyChip - Cambridge Systems Biology Centre University of Cambridge e.ferrero at gen.cam.ac.uk http://flypress.gen.cam.ac.uk/
ADD REPLY
0
Entering edit mode
Well when I initially implemented it, that is exactly what it did. But then others said that they had many lists of IDs and that they wanted to have any repeated keys respected as a valid input, so that the output would match up with what was initially asked for. And I said to these people that while I could do that, it would ONLY really work in cases where everything asked for mapped 1:1 with the initial set of keys coming in. So this special 1st case is just here for convenience. If you call select() and it does not throw any warnings, then you know that you can just cbind() the results onto your starting keys. But if you see that warning it means that your data has a different shape than your keys started as. Marc On 07/29/2013 11:32 AM, Enrico Ferrero wrote: > Hi Marc, > > Thanks for taking the time to explain this thoroughly, it is now clear > and my original problem can be considered solved. > > For the sake of discussion, I'd just like to add something (which > doesn't necessarily need a follow up): at least to me, select() seems > to have a rather arbitrary behaviour. I don't completely see the point > in returning 4 rows in the 1st scenario then. Why not just return 1 > row and have a consistent behaviour all around (remove all NAs and all > duplicates in the input)? > > Best, > > On 29 July 2013 19:00, Marc Carlson <mcarlson at="" fhcrc.org=""> wrote: >> On 07/27/2013 06:40 AM, Enrico Ferrero wrote: >>> Hi everybody, >>> >>> Marc, thanks for clarifying things. The behaviour of the select() >>> function is absolutely sensible. Maybe it should be made explicit >>> somewhere in the documentation that, when working with data frames, >>> the user is expected to use the merge() function in conjunction with >>> it. I also agree with Herv? that having options to tweak and customize >>> the output would be an extremely positive thing and a step in the >>> right direction. In addition to a "select" argument, one can also >>> think of a "remove.na.rows" that evaluates to either TRUE or FALSE. >>> But then again, using merge() after select() already deals with these >>> issues quite well. >>> >>> What I think should be investigated more closely at the moment is the >>> unexpected behaviour select() exhibits when one SYMBOL or ALIAS (and >>> potentially other types of ID, I don't know) maps to more than one >>> ENTREZID. As exemplified by James' code below, this causes the output >>> to be truncated, and I highly doubt this is intentional: >>> >>>> select(org.Hs.eg.db, rep("ADORA2A", 4), "ENTREZID", "ALIAS") >>> ALIAS ENTREZID >>> 1 ADORA2A 135 >>> 2 ADORA2A 135 >>> 3 ADORA2A 135 >>> 4 ADORA2A 135 >>> >>>> select(org.Hs.eg.db, rep("AGT", 4), "ENTREZID", "ALIAS") >>> ALIAS ENTREZID >>> 1 AGT 183 >>> 2 AGT 189 >>> >>> Warning message: >>> In .generateExtraRows(tab, keys, jointype) : >>> 'select' and duplicate query keys resulted in 1:many mapping between >>> keys and return rows >>> >>> >>> It would be great to have your views on this. >>> Best, >> >> Hi Enrico, >> >> My view on this is the same one that I presented above. Basically you seem >> to have misunderstood what select is doing in this case. So clearly I need >> to explain things a bit better in the documentation. But what is happening >> is in fact completely intentional, and it happens every time there is at >> least one "many to one" relationship requested by select. The presence of a >> many to one relationship means that select no longer has any chance of >> giving you a data.frame back that has the same height as the length of your >> keys. So instead of attempting to keep your repeated keys and matching them >> perfectly (which is no longer possible), select assumes that you know what >> you are doing and instead it just simplifies the result by removing all >> duplicated rows from the result. This is why your result appears >> "truncated". It's because there really was no point in keeping the initial >> pattern from the keys you passed in (as the data shape makes it impossible >> to do this anyways). >> >> The result you get in your 2nd case is actually the same exact information >> content as if we had tried to duplicate rows to match your repeated input. >> The only actual difference here is that there is no way for select to know >> how you intended to repeat the symbol "AGT" to match the two entrez gene IDs >> to the initial four "AGT" symbols that you passed in. For this example, did >> you want AGT repeated 4 times (with two repeats each of the two entrez gene >> IDs)? Or did you maybe want it repeated 8 times (with 4 repeats of each >> entrez gene ID)? And what should we have done if you had repeated the >> symbol "AGT" 5 times in the input instead? How are we supposed to format >> the output in that case? I hope you can see why in this case we have to >> just give you the data as it is. In this circumstance we just can't guess >> anymore about how you want it presented. So instead of guessing we just >> return all the data "as is" and give you a warning. So it's not actually >> true that the 2nd case you presented is "truncated". It's actually true >> instead that the 1st case data has just been repeated in an effort to make >> your life easier. But when the data is complicated by many to one >> relationships, we just can't know anymore what you will want to do for >> formatting it. >> >> We have tried to be very accommodating with select for people who request >> simple 1:1 relationships because we recognize that this is a common use case >> and we can see a straightforward way to make things easier for that common >> use case. But select is not really meant to be a data formatting function. >> It's really intended to be a data retrieval function. R already has a lot >> of great functions for data formatting already (like merge and the subset >> operators etc.), and these are already more flexible and better suited for >> tasks like that. >> >> >> >> Marc >> >> >> >> >> >>> On 26 July 2013 21:46, Hervé Pagès <hpages at="" fhcrc.org=""> wrote: >>>> Hi Marc, >>>> >>>> On 07/26/2013 12:57 PM, Marc Carlson wrote: >>>> ... >>>> >>>>> Hello everyone, >>>>> >>>>> Sorry that I saw this thread so late. Basically, select() does *try* to >>>>> keep your initial keys and map them each to an equivalent number of >>>>> unique values. We did actually anticipate that people would *want* to >>>>> cbind() their results. >>>>> >>>>> But as you discovered there are many circumstances where the data make >>>>> this kind of behavior impossible. >>>>> >>>>> So passing in NAs as keys for example can't ever find anything >>>>> meaningful. Those will simply have to be removed before we can >>>>> proceed. And, it is also impossible to maintain a 1:1 mapping if you >>>>> retrieve fields that have many to one relationships with your initial >>>>> keys (also seen here). >>>>> >>>>> For convenience, when this kind of 1:1 output is already impossible (as >>>>> it is for most of your examples), select will also try to simplify the >>>>> output by removing rows that are identical all the way across etc.. >>>>> >>>>> My aim was that select should try to do the most reasonable thing >>>>> possible based on the data we have in each case. The rationale is that >>>>> in the case where there are 1:many mappings, you should not be planning >>>>> to bind those directly onto any other data.frames anyways (as this >>>>> circumstance would require you to call merge() instead). So in that >>>>> case, non-destructive simplification seems beneficial. >>>> >>>> Other tools in our infrastructure use an extra argument to pick- up 1 >>>> thing in case of multiple mapping e.g. findOverlaps() has the 'select' >>>> argument with possible values "all", "first", "last", and "arbitrary". >>>> Also nearest() and family have this argument and it accepts similar >>>> values. >>>> >>>> Couldn't select() use a similar approach? The default should be "all" >>>> so the current behavior is preserved but if it's something else then >>>> the returned data.frame should align with the input. >>>> >>>> Thanks, >>>> >>>> H. >>>> >>>> >>>>> I hope this clarifies things, >>>>> >>>>> >>>>> Marc >>>>> >>>>> >>>>> >>>>>>> As I >>>>>>> mentioned in my first post, the for loop function works, but it's >>>>>>> highly inefficient. >>>>>>> >>>>>>> Any help is greatly appreciated, thank you. >>>>>>> >>>>>>> Best, >>>>>>> >>>>>>> >>>>>>> >>>>>>> On 25 July 2013 23:18, Hervé Pagès <hpages at="" fhcrc.org=""> wrote: >>>>>>>> Hi James, >>>>>>>> >>>>>>>> You're right. >>>>>>>> >>>>>>>> It's actually both: NAs *and* duplicated keys that are mapped to >>>>>>>> more than 1 row are removed from the input. I don't think this >>>>>>>> is documented. >>>>>>>> >>>>>>>> I wonder if select() behavior couldn't be a little bit simpler by >>>>>>>> either preserving or removing all duplicated keys, and not just some >>>>>>>> of them (on a somewhat arbitrary criteria). >>>>>>>> >>>>>>>> Thanks, >>>>>>>> H. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On 07/25/2013 02:57 PM, James W. MacDonald wrote: >>>>>>>>> >>>>>>>>> Hi Enrico and Herve, >>>>>>>>> >>>>>>>>> This has to do with duplicate entries, but only when the duplicate >>>>>>>>> entry >>>>>>>>> maps to many ENTREZID: >>>>>>>>> >>>>>>>>> > select(org.Hs.eg.db, rep("ADORA2A", 4), "ENTREZID", "ALIAS") >>>>>>>>> ALIAS ENTREZID >>>>>>>>> 1 ADORA2A 135 >>>>>>>>> 2 ADORA2A 135 >>>>>>>>> 3 ADORA2A 135 >>>>>>>>> 4 ADORA2A 135 >>>>>>>>> >>>>>>>>> > select(org.Hs.eg.db, rep("AGT", 4), "ENTREZID", "ALIAS") >>>>>>>>> ALIAS ENTREZID >>>>>>>>> 1 AGT 183 >>>>>>>>> 2 AGT 189 >>>>>>>>> Warning message: >>>>>>>>> In .generateExtraRows(tab, keys, jointype) : >>>>>>>>> 'select' and duplicate query keys resulted in 1:many mapping >>>>>>>>> between >>>>>>>>> keys and return rows >>>>>>>>> >>>>>>>>> > select(org.Hs.eg.db, "AGT", "ENTREZID", "ALIAS") >>>>>>>>> ALIAS ENTREZID >>>>>>>>> 1 AGT 183 >>>>>>>>> 2 AGT 189 >>>>>>>>> Warning message: >>>>>>>>> In .generateExtraRows(tab, keys, jointype) : >>>>>>>>> 'select' resulted in 1:many mapping between keys and return >>>>>>>>> rows >>>>>>>>> >>>>>>>>> >>>>>>>>> So in the instances where a gene symbol maps to more than one >>>>>>>>> ENTREZID, >>>>>>>>> the output gets truncated, whereas if it is a one-to-one mapping, it >>>>>>>>> does not. >>>>>>>>> >>>>>>>>> Best, >>>>>>>>> >>>>>>>>> Jim >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On 7/25/2013 5:06 PM, Enrico Ferrero wrote: >>>>>>>>>> >>>>>>>>>> Hi, >>>>>>>>>> >>>>>>>>>> Herv?, that's exactly what I'm trying to say. >>>>>>>>>> >>>>>>>>>> Attached to this email is a tab delimited file with two columns of >>>>>>>>>> GeneSymbols (or Aliases), and here is some simple code to reproduce >>>>>>>>>> the unexpected behaviour: >>>>>>>>>> >>>>>>>>>> library(org.Hs.eg.db) >>>>>>>>>> mydf<- read.table("testdata.txt", sep="\t", header=TRUE, >>>>>>>>>> as.is=TRUE) >>>>>>>>>> mytest<- select(org.Hs.eg.db, key=mydf$GeneSymbol1, >>>>>>>>>> keytype="ALIAS", >>>>>>>>>> cols=c("SYMBOL","ENTREZID","ENSEMBL")) >>>>>>>>>> # check that mytest has less rows than mydf >>>>>>>>>> nrow(mydf) >>>>>>>>>> nrow(mytest) >>>>>>>>>> # pick a random row: they don't match >>>>>>>>>> mydf[250,] >>>>>>>>>> mytest[250,] >>>>>>>>>> >>>>>>>>>> Ideally, mytest should have the same number and position of rows of >>>>>>>>>> mydf so that I can then cbind them. >>>>>>>>>> If mytest has more rows because of multiple mappings that's also >>>>>>>>>> fine: >>>>>>>>>> I can always use merge(mydf, mytest), right? >>>>>>>>>> >>>>>>>>>> Thanks a lot to both for your help, it's very appreciated. >>>>>>>>>> Best, >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On 25 July 2013 21:32, Hervé Pagès<hpages at="" fhcrc.org=""> wrote: >>>>>>>>>>> >>>>>>>>>>> Hi Enrico, >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On 07/25/2013 01:20 PM, James W. MacDonald wrote: >>>>>>>>>>>> >>>>>>>>>>>> Hi Enrico, >>>>>>>>>>>> >>>>>>>>>>>> Please don't take things off-list (e.g., use reply-all). >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On 7/25/2013 2:17 PM, Enrico Ferrero wrote: >>>>>>>>>>>>> >>>>>>>>>>>>> Hi James, >>>>>>>>>>>>> >>>>>>>>>>>>> Thanks very much for your help. >>>>>>>>>>>>> There is an issue that needs to be solved before thinking about >>>>>>>>>>>>> what's >>>>>>>>>>>>> the best approach in my opinion. >>>>>>>>>>>>> >>>>>>>>>>>>> I don't understand why, but the object created with the call to >>>>>>>>>>>>> select >>>>>>>>>>>>> (test in my example, first.two in yours) has a different number >>>>>>>>>>>>> of >>>>>>>>>>>>> rows from the original object (df in my example). Specifically >>>>>>>>>>>>> it has >>>>>>>>>>>>> *less* rows. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> I'm surprised it has less rows. It can definitely have more, when >>>>>>>>>>> some >>>>>>>>>>> of the keys passed to select() are mapped to more than 1 row, but >>>>>>>>>>> my >>>>>>>>>>> understanding was that select() would propagate unmapped keys to >>>>>>>>>>> the >>>>>>>>>>> output by placing them in rows stuffed with NAs. So maybe I >>>>>>>>>>> misunderstood how select() works, or its behavior was changed, or >>>>>>>>>>> there is a bug somewhere. Could you please send the code that >>>>>>>>>>> allows >>>>>>>>>>> us to reproduce this? Thanks. >>>>>>>>>>> >>>>>>>>>>> H. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>> If all symbols were converted to all possible Entrez IDs, >>>>>>>>>>>>> >>>>>>>>>>>>> I would expect it to have more rows, not less. To me, it looks >>>>>>>>>>>>> like >>>>>>>>>>>>> not all rows are looked up and returned. >>>>>>>>>>>>> >>>>>>>>>>>>> Do you see what I mean? >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Sure. You could be using outdated gene symbols. Or perhaps you >>>>>>>>>>>> are >>>>>>>>>>>> using >>>>>>>>>>>> a mixture of symbols and aliases. Which is even cooler than just >>>>>>>>>>>> all >>>>>>>>>>>> symbols: >>>>>>>>>>>> >>>>>>>>>>>> > symb<- c(Rkeys(org.Hs.egSYMBOL)[1:10], >>>>>>>>>>>> Rkeys(org.Hs.egALIAS2EG)[31:45]) >>>>>>>>>>>> > symb >>>>>>>>>>>> [1] "A1BG" "A2M" "A2MP1" "NAT1" "NAT2" >>>>>>>>>>>> "AACP" >>>>>>>>>>>> [7] "SERPINA3" "AADAC" "AAMP" "AANAT" "AAMP" >>>>>>>>>>>> "AANAT" >>>>>>>>>>>> [13] "DSPS" "SNAT" "AARS" "CMT2N" "AAV" "AAVS1" >>>>>>>>>>>> [19] "ABAT" "GABA-AT" "GABAT" "NPD009" "ABC-1" "ABC1" >>>>>>>>>>>> [25] "ABCA1" >>>>>>>>>>>> > select(org.Hs.eg.db, symb, "ENTREZID","SYMBOL") >>>>>>>>>>>> SYMBOL ENTREZID >>>>>>>>>>>> 1 A1BG 1 >>>>>>>>>>>> 2 A2M 2 >>>>>>>>>>>> 3 A2MP1 3 >>>>>>>>>>>> 4 NAT1 9 >>>>>>>>>>>> 5 NAT2 10 >>>>>>>>>>>> 6 AACP 11 >>>>>>>>>>>> 7 SERPINA3 12 >>>>>>>>>>>> 8 AADAC 13 >>>>>>>>>>>> 9 AAMP 14 >>>>>>>>>>>> 10 AANAT 15 >>>>>>>>>>>> 11 AAMP 14 >>>>>>>>>>>> 12 AANAT 15 >>>>>>>>>>>> 13 DSPS<na> >>>>>>>>>>>> 14 SNAT<na> >>>>>>>>>>>> 15 AARS 16 >>>>>>>>>>>> 16 CMT2N<na> >>>>>>>>>>>> 17 AAV<na> >>>>>>>>>>>> 18 AAVS1 17 >>>>>>>>>>>> 19 ABAT 18 >>>>>>>>>>>> 20 GABA-AT<na> >>>>>>>>>>>> 21 GABAT<na> >>>>>>>>>>>> 22 NPD009<na> >>>>>>>>>>>> 23 ABC-1<na> >>>>>>>>>>>> 24 ABC1<na> >>>>>>>>>>>> 25 ABCA1 19 >>>>>>>>>>>> > select(org.Hs.eg.db, symb, "ENTREZID","ALIAS") >>>>>>>>>>>> ALIAS ENTREZID >>>>>>>>>>>> 1 A1BG 1 >>>>>>>>>>>> 2 A2M 2 >>>>>>>>>>>> 3 A2MP1 3 >>>>>>>>>>>> 4 NAT1 9 >>>>>>>>>>>> 5 NAT1 1982 >>>>>>>>>>>> 6 NAT1 6530 >>>>>>>>>>>> 7 NAT1 10991 >>>>>>>>>>>> 8 NAT2 10 >>>>>>>>>>>> 9 NAT2 81539 >>>>>>>>>>>> 10 AACP 11 >>>>>>>>>>>> 11 SERPINA3 12 >>>>>>>>>>>> 12 AADAC 13 >>>>>>>>>>>> 13 AAMP 14 >>>>>>>>>>>> 14 AANAT 15 >>>>>>>>>>>> 15 DSPS 15 >>>>>>>>>>>> 16 SNAT 15 >>>>>>>>>>>> 17 AARS 16 >>>>>>>>>>>> 18 CMT2N 16 >>>>>>>>>>>> 19 AAV 17 >>>>>>>>>>>> 20 AAVS1 17 >>>>>>>>>>>> 21 ABAT 18 >>>>>>>>>>>> 22 GABA-AT 18 >>>>>>>>>>>> 23 GABAT 18 >>>>>>>>>>>> 24 NPD009 18 >>>>>>>>>>>> 25 ABC-1 19 >>>>>>>>>>>> 26 ABC1 19 >>>>>>>>>>>> 27 ABC1 63897 >>>>>>>>>>>> 28 ABCA1 19 >>>>>>>>>>>> Warning message: >>>>>>>>>>>> In .generateExtraRows(tab, keys, jointype) : >>>>>>>>>>>> 'select' and duplicate query keys resulted in 1:many >>>>>>>>>>>> mapping >>>>>>>>>>>> between >>>>>>>>>>>> keys and return rows >>>>>>>>>>>> > mget(c("1982","6530","10991"), org.Hs.egGENENAME) >>>>>>>>>>>> $`1982` >>>>>>>>>>>> [1] "eukaryotic translation initiation factor 4 gamma, 2" >>>>>>>>>>>> >>>>>>>>>>>> $`6530` >>>>>>>>>>>> [1] "solute carrier family 6 (neurotransmitter transporter, >>>>>>>>>>>> noradrenalin), member 2" >>>>>>>>>>>> >>>>>>>>>>>> $`10991` >>>>>>>>>>>> [1] "solute carrier family 38, member 3" >>>>>>>>>>>> >>>>>>>>>>>> Best, >>>>>>>>>>>> >>>>>>>>>>>> Jim >>>>>>>>>>>> >>>>>>>>>>>>> On 25 July 2013 18:17, James W. MacDonald<jmacdon at="" uw.edu=""> >>>>>>>>>>>>> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>> Hi Enrico, >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> On 7/25/2013 12:56 PM, Enrico Ferrero wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Dear James, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Thanks very much for your prompt reply. >>>>>>>>>>>>>>> I knew the problem was the for loop and the select function is >>>>>>>>>>>>>>> indeed >>>>>>>>>>>>>>> a lot faster than that and works perfectly with toy data. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> However, this is what happens when I try to use it with real >>>>>>>>>>>>>>> data: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> test<- select(org.Hs.eg.db, keys=df$GeneSymbol, >>>>>>>>>>>>>>>> keytype="ALIAS", >>>>>>>>>>>>>>>> cols=c("SYMBOL","ENTREZID","ENSEMBL")) >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Warning message: >>>>>>>>>>>>>>> In .generateExtraRows(tab, keys, jointype) : >>>>>>>>>>>>>>> 'select' and duplicate query keys resulted in 1:many >>>>>>>>>>>>>>> mapping >>>>>>>>>>>>>>> between >>>>>>>>>>>>>>> keys and return rows >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> which is probably the warning you mentioned. >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> That's not the warning I mentioned, but it does point out the >>>>>>>>>>>>>> same >>>>>>>>>>>>>> issue, >>>>>>>>>>>>>> which is that there is a one to many mapping between symbol and >>>>>>>>>>>>>> entrez gene >>>>>>>>>>>>>> ID. >>>>>>>>>>>>>> >>>>>>>>>>>>>> So now you have to decide if you want to be naive (or stupid, >>>>>>>>>>>>>> depending on >>>>>>>>>>>>>> your perspective) or not. You could just cover your eyes and >>>>>>>>>>>>>> do this: >>>>>>>>>>>>>> >>>>>>>>>>>>>> first.two<- first.two[!duplicated(first.two$SYMBOL),] >>>>>>>>>>>>>> >>>>>>>>>>>>>> which will choose for you the first symbol -> gene ID mapping >>>>>>>>>>>>>> and >>>>>>>>>>>>>> nuke the >>>>>>>>>>>>>> rest. That's nice and quick, but you are making huge >>>>>>>>>>>>>> assumptions. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Or you could decide to be a bit more sophisticated and do >>>>>>>>>>>>>> something like >>>>>>>>>>>>>> >>>>>>>>>>>>>> thelst<- tapply(1:nrow(first.two), first.two$SYMBOL, >>>>>>>>>>>>>> function(x) >>>>>>>>>>>>>> first.two[x,]) >>>>>>>>>>>>>> >>>>>>>>>>>>>> At this point you can take a look at e.g., thelst[1:10] to see >>>>>>>>>>>>>> what >>>>>>>>>>>>>> we just >>>>>>>>>>>>>> did >>>>>>>>>>>>>> >>>>>>>>>>>>>> thelst<- do.call("rbind", lapply(thelst, function(x) c(x[1,1], >>>>>>>>>>>>>> paste(x[,2], >>>>>>>>>>>>>> collapse = "|"))) >>>>>>>>>>>>>> >>>>>>>>>>>>>> and here you can look at head(thelst). >>>>>>>>>>>>>> >>>>>>>>>>>>>> Then you can check to ensure that the first column of thelst is >>>>>>>>>>>>>> identical to >>>>>>>>>>>>>> the first column of df, and proceed as before. >>>>>>>>>>>>>> >>>>>>>>>>>>>> But there is still the problem of the multiple mappings. As an >>>>>>>>>>>>>> example: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> thelst[1:5] >>>>>>>>>>>>>> >>>>>>>>>>>>>> $HBD >>>>>>>>>>>>>> SYMBOL ENTREZID >>>>>>>>>>>>>> 2535 HBD 3045 >>>>>>>>>>>>>> 2536 HBD 100187828 >>>>>>>>>>>>>> >>>>>>>>>>>>>> $KIR3DL3 >>>>>>>>>>>>>> SYMBOL ENTREZID >>>>>>>>>>>>>> 17513 KIR3DL3 115653 >>>>>>>>>>>>>> 17514 KIR3DL3 100133046 >>>>>>>>>>>>>> >>>>>>>>>>>>>>> mget(as.character(thelst[[1]][,2]), org.Hs.egGENENAME) >>>>>>>>>>>>>> >>>>>>>>>>>>>> $`3045` >>>>>>>>>>>>>> [1] "hemoglobin, delta" >>>>>>>>>>>>>> >>>>>>>>>>>>>> $`100187828` >>>>>>>>>>>>>> [1] "hypophosphatemic bone disease" >>>>>>>>>>>>>> >>>>>>>>>>>>>>> mget(as.character(thelst[[2]][,2]), org.Hs.egGENENAME) >>>>>>>>>>>>>> >>>>>>>>>>>>>> $`115653` >>>>>>>>>>>>>> [1] "killer cell immunoglobulin-like receptor, three domains, >>>>>>>>>>>>>> long >>>>>>>>>>>>>> cytoplasmic tail, 3" >>>>>>>>>>>>>> >>>>>>>>>>>>>> $`100133046` >>>>>>>>>>>>>> [1] "killer cell immunoglobulin-like receptor three domains >>>>>>>>>>>>>> long >>>>>>>>>>>>>> cytoplasmic >>>>>>>>>>>>>> tail 3" >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> So HBD is the gene symbol for two different genes! If this gene >>>>>>>>>>>>>> symbol is in >>>>>>>>>>>>>> your data, you will now have attributed your data to two genes >>>>>>>>>>>>>> that >>>>>>>>>>>>>> apparently are not remotely similar. if KIR3DL3 is in your >>>>>>>>>>>>>> data, >>>>>>>>>>>>>> then it >>>>>>>>>>>>>> worked out OK for that gene. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Best, >>>>>>>>>>>>>> >>>>>>>>>>>>>> Jim >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>> The real problem is that the number of rows is now different >>>>>>>>>>>>>>> for >>>>>>>>>>>>>>> the 2 >>>>>>>>>>>>>>> objects: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> nrow(df); nrow(test) >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> [1] 573 >>>>>>>>>>>>>>> [1] 201 >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> So I obviously can't put the new data into the original df. My >>>>>>>>>>>>>>> impression is that when the 1 to many mapping arises, the >>>>>>>>>>>>>>> select >>>>>>>>>>>>>>> functions exits, with that warning message. As a result, my >>>>>>>>>>>>>>> test >>>>>>>>>>>>>>> object is incomplete. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On top of that, and I can't really explain this, the row >>>>>>>>>>>>>>> positions are >>>>>>>>>>>>>>> messed up, e.g. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> all.equal(df[100,],test[100,]) >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> returns FALSE. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> How can I work around this? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Thanks a lot! >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Best, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On 25 July 2013 16:58, James W. MacDonald<jmacdon at="" uw.edu=""> >>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Hi Enrico, >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On 7/25/2013 11:35 AM, Enrico Ferrero wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Hello, >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I often have data frames where I need to perform ID >>>>>>>>>>>>>>>>> conversions on >>>>>>>>>>>>>>>>> one >>>>>>>>>>>>>>>>> or >>>>>>>>>>>>>>>>> more of the columns while preserving the order of the rows, >>>>>>>>>>>>>>>>> e.g.: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> GeneSymbol Value1 Value2 >>>>>>>>>>>>>>>>> GS1 2.5 0.1 >>>>>>>>>>>>>>>>> GS2 3 0.2 >>>>>>>>>>>>>>>>> .. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> And I want to obtain: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> GeneSymbol EntrezGeneID Value1 Value2 >>>>>>>>>>>>>>>>> GS1 EG1 2.5 0.1 >>>>>>>>>>>>>>>>> GS2 EG2 3 0.2 >>>>>>>>>>>>>>>>> .. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> What I've done so far was to create a function that uses >>>>>>>>>>>>>>>>> org.Hs.eg.db to >>>>>>>>>>>>>>>>> loop over the rows of the column and does the conversion: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> library(org.Hs.eg.db) >>>>>>>>>>>>>>>>> alias2EG<- function(x) { >>>>>>>>>>>>>>>>> for (i in 1:length(x)) { >>>>>>>>>>>>>>>>> if (!is.na(x[i])) { >>>>>>>>>>>>>>>>> repl<- org.Hs.egALIAS2EG[[x[i]]][1] >>>>>>>>>>>>>>>>> if (!is.null(repl)) { >>>>>>>>>>>>>>>>> x[i]<- repl >>>>>>>>>>>>>>>>> } >>>>>>>>>>>>>>>>> else { >>>>>>>>>>>>>>>>> x[i]<- NA >>>>>>>>>>>>>>>>> } >>>>>>>>>>>>>>>>> } >>>>>>>>>>>>>>>>> } >>>>>>>>>>>>>>>>> return(x) >>>>>>>>>>>>>>>>> } >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I should first note that gene symbols are not unique, so you >>>>>>>>>>>>>>>> are >>>>>>>>>>>>>>>> taking a >>>>>>>>>>>>>>>> chance on your mappings. Is there no other annotation for >>>>>>>>>>>>>>>> your >>>>>>>>>>>>>>>> data? >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> In addition, you should note that it is almost always better >>>>>>>>>>>>>>>> to >>>>>>>>>>>>>>>> think of >>>>>>>>>>>>>>>> objects as vectors and matrices in R, rather than as things >>>>>>>>>>>>>>>> that >>>>>>>>>>>>>>>> need to >>>>>>>>>>>>>>>> be >>>>>>>>>>>>>>>> looped over (e.g., R isn't Perl or C). >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> first.two<- select(org.Hs.eg.db, as.character(df$GeneSymbol), >>>>>>>>>>>>>>>> "ENTREZID", >>>>>>>>>>>>>>>> "SYMBOL") >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Note that there used to be a warning or an error (don't >>>>>>>>>>>>>>>> remember >>>>>>>>>>>>>>>> which) >>>>>>>>>>>>>>>> when >>>>>>>>>>>>>>>> you did something like this, stating that gene symbols are >>>>>>>>>>>>>>>> not >>>>>>>>>>>>>>>> unique, >>>>>>>>>>>>>>>> and >>>>>>>>>>>>>>>> that you shouldn't do this sort of thing. Apparently this >>>>>>>>>>>>>>>> warning has >>>>>>>>>>>>>>>> been >>>>>>>>>>>>>>>> removed, but the issue remains valid. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> ## check yourself >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> all.equal(df$GeneSymbol, first.two$SYMBOL) >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> ## if true, proceed >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> df<- data.frame(first.two, df[,-1]) >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Best, >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Jim >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> and then call the function like this: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> df$EntrezGeneID<- alias2GS(df$GeneSymbol) >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> This works well, but gets very slow when I need to do >>>>>>>>>>>>>>>>> multiple >>>>>>>>>>>>>>>>> conversions >>>>>>>>>>>>>>>>> on large datasets. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Is there any way I can achieve the same result but in a >>>>>>>>>>>>>>>>> quicker, more >>>>>>>>>>>>>>>>> efficient way? >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Thank you. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>> James W. MacDonald, M.S. >>>>>>>>>>>>>>>> Biostatistician >>>>>>>>>>>>>>>> University of Washington >>>>>>>>>>>>>>>> Environmental and Occupational Health Sciences >>>>>>>>>>>>>>>> 4225 Roosevelt Way NE, # 100 >>>>>>>>>>>>>>>> Seattle WA 98105-6099 >>>>>>>>>>>>>>>> >>>>>>>>>>>>>> -- >>>>>>>>>>>>>> James W. MacDonald, M.S. >>>>>>>>>>>>>> Biostatistician >>>>>>>>>>>>>> University of Washington >>>>>>>>>>>>>> Environmental and Occupational Health Sciences >>>>>>>>>>>>>> 4225 Roosevelt Way NE, # 100 >>>>>>>>>>>>>> Seattle WA 98105-6099 >>>>>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> Hervé Pagès >>>>>>>>>>> >>>>>>>>>>> Program in Computational Biology >>>>>>>>>>> Division of Public Health Sciences >>>>>>>>>>> Fred Hutchinson Cancer Research Center >>>>>>>>>>> 1100 Fairview Ave. N, M1-B514 >>>>>>>>>>> P.O. Box 19024 >>>>>>>>>>> Seattle, WA 98109-1024 >>>>>>>>>>> >>>>>>>>>>> E-mail: hpages at fhcrc.org >>>>>>>>>>> Phone: (206) 667-5791 >>>>>>>>>>> Fax: (206) 667-1319 >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>> -- >>>>>>>> Hervé Pagès >>>>>>>> >>>>>>>> Program in Computational Biology >>>>>>>> Division of Public Health Sciences >>>>>>>> Fred Hutchinson Cancer Research Center >>>>>>>> 1100 Fairview Ave. N, M1-B514 >>>>>>>> P.O. Box 19024 >>>>>>>> Seattle, WA 98109-1024 >>>>>>>> >>>>>>>> E-mail: hpages at fhcrc.org >>>>>>>> Phone: (206) 667-5791 >>>>>>>> Fax: (206) 667-1319 >>>>>>> >>>>>>> >>>>>>> >>>>> _______________________________________________ >>>>> Bioconductor mailing list >>>>> Bioconductor at r-project.org >>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>> Search the archives: >>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>> >>>> -- >>>> Hervé Pagès >>>> >>>> Program in Computational Biology >>>> Division of Public Health Sciences >>>> Fred Hutchinson Cancer Research Center >>>> 1100 Fairview Ave. N, M1-B514 >>>> P.O. Box 19024 >>>> Seattle, WA 98109-1024 >>>> >>>> E-mail: hpages at fhcrc.org >>>> Phone: (206) 667-5791 >>>> Fax: (206) 667-1319 >>>> >>>> _______________________________________________ >>>> Bioconductor mailing list >>>> Bioconductor at r-project.org >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> Search the archives: >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >>> > >
ADD REPLY
0
Entering edit mode
Hi Marc, On 07/29/2013 02:29 PM, Marc Carlson wrote: > Well when I initially implemented it, that is exactly what it did. > > But then others said that they had many lists of IDs and that they > wanted to have any repeated keys respected as a valid input, so that the > output would match up with what was initially asked for. And I said to > these people that while I could do that, it would ONLY really work in > cases where everything asked for mapped 1:1 with the initial set of keys > coming in. > > So this special 1st case is just here for convenience. If you call > select() and it does not throw any warnings, then you know that you can > just cbind() the results onto your starting keys. > > But if you see that warning it means that your data has a different > shape than your keys started as. It's good to have the warning for interactive use. Thanks! It's not so convenient when select() needs to be integrated in robust code e.g. in a pipeline. Since the developer of the pipeline cannot assume that the data.frame will have the same shape as the keys, then s/he needs to be able to handle the general case, or, if s/he doesn't want to do that, s/he needs to raise an error in case select() returns a 1:many mapping. Having the warning doesn't make it easy/natural to detect this situation programmatically though. That brings me back to my original proposal of having an extra arg e.g. 'if.multiple.mapping' that would support "all", "arbitrary", "none" and now "error". Only "all" and "error" could be supported as a first step. I think that would already be very useful. Just throwing some ideas here. I know you're already very busy and I don't mean to put more things in your plate. Thanks! H. > > > Marc > > > On 07/29/2013 11:32 AM, Enrico Ferrero wrote: >> Hi Marc, >> >> Thanks for taking the time to explain this thoroughly, it is now clear >> and my original problem can be considered solved. >> >> For the sake of discussion, I'd just like to add something (which >> doesn't necessarily need a follow up): at least to me, select() seems >> to have a rather arbitrary behaviour. I don't completely see the point >> in returning 4 rows in the 1st scenario then. Why not just return 1 >> row and have a consistent behaviour all around (remove all NAs and all >> duplicates in the input)? >> >> Best, >> >> On 29 July 2013 19:00, Marc Carlson <mcarlson at="" fhcrc.org=""> wrote: >>> On 07/27/2013 06:40 AM, Enrico Ferrero wrote: >>>> Hi everybody, >>>> >>>> Marc, thanks for clarifying things. The behaviour of the select() >>>> function is absolutely sensible. Maybe it should be made explicit >>>> somewhere in the documentation that, when working with data frames, >>>> the user is expected to use the merge() function in conjunction with >>>> it. I also agree with Herv? that having options to tweak and customize >>>> the output would be an extremely positive thing and a step in the >>>> right direction. In addition to a "select" argument, one can also >>>> think of a "remove.na.rows" that evaluates to either TRUE or FALSE. >>>> But then again, using merge() after select() already deals with these >>>> issues quite well. >>>> >>>> What I think should be investigated more closely at the moment is the >>>> unexpected behaviour select() exhibits when one SYMBOL or ALIAS (and >>>> potentially other types of ID, I don't know) maps to more than one >>>> ENTREZID. As exemplified by James' code below, this causes the output >>>> to be truncated, and I highly doubt this is intentional: >>>> >>>>> select(org.Hs.eg.db, rep("ADORA2A", 4), "ENTREZID", "ALIAS") >>>> ALIAS ENTREZID >>>> 1 ADORA2A 135 >>>> 2 ADORA2A 135 >>>> 3 ADORA2A 135 >>>> 4 ADORA2A 135 >>>> >>>>> select(org.Hs.eg.db, rep("AGT", 4), "ENTREZID", "ALIAS") >>>> ALIAS ENTREZID >>>> 1 AGT 183 >>>> 2 AGT 189 >>>> >>>> Warning message: >>>> In .generateExtraRows(tab, keys, jointype) : >>>> 'select' and duplicate query keys resulted in 1:many mapping >>>> between >>>> keys and return rows >>>> >>>> >>>> It would be great to have your views on this. >>>> Best, >>> >>> Hi Enrico, >>> >>> My view on this is the same one that I presented above. Basically >>> you seem >>> to have misunderstood what select is doing in this case. So clearly I >>> need >>> to explain things a bit better in the documentation. But what is >>> happening >>> is in fact completely intentional, and it happens every time there is at >>> least one "many to one" relationship requested by select. The >>> presence of a >>> many to one relationship means that select no longer has any chance of >>> giving you a data.frame back that has the same height as the length >>> of your >>> keys. So instead of attempting to keep your repeated keys and >>> matching them >>> perfectly (which is no longer possible), select assumes that you know >>> what >>> you are doing and instead it just simplifies the result by removing all >>> duplicated rows from the result. This is why your result appears >>> "truncated". It's because there really was no point in keeping the >>> initial >>> pattern from the keys you passed in (as the data shape makes it >>> impossible >>> to do this anyways). >>> >>> The result you get in your 2nd case is actually the same exact >>> information >>> content as if we had tried to duplicate rows to match your repeated >>> input. >>> The only actual difference here is that there is no way for select to >>> know >>> how you intended to repeat the symbol "AGT" to match the two entrez >>> gene IDs >>> to the initial four "AGT" symbols that you passed in. For this >>> example, did >>> you want AGT repeated 4 times (with two repeats each of the two >>> entrez gene >>> IDs)? Or did you maybe want it repeated 8 times (with 4 repeats of each >>> entrez gene ID)? And what should we have done if you had repeated the >>> symbol "AGT" 5 times in the input instead? How are we supposed to >>> format >>> the output in that case? I hope you can see why in this case we have to >>> just give you the data as it is. In this circumstance we just can't >>> guess >>> anymore about how you want it presented. So instead of guessing we just >>> return all the data "as is" and give you a warning. So it's not >>> actually >>> true that the 2nd case you presented is "truncated". It's actually true >>> instead that the 1st case data has just been repeated in an effort to >>> make >>> your life easier. But when the data is complicated by many to one >>> relationships, we just can't know anymore what you will want to do for >>> formatting it. >>> >>> We have tried to be very accommodating with select for people who >>> request >>> simple 1:1 relationships because we recognize that this is a common >>> use case >>> and we can see a straightforward way to make things easier for that >>> common >>> use case. But select is not really meant to be a data formatting >>> function. >>> It's really intended to be a data retrieval function. R already has >>> a lot >>> of great functions for data formatting already (like merge and the >>> subset >>> operators etc.), and these are already more flexible and better >>> suited for >>> tasks like that. >>> >>> >>> >>> Marc >>> >>> >>> >>> >>> >>>> On 26 July 2013 21:46, Hervé Pagès <hpages at="" fhcrc.org=""> wrote: >>>>> Hi Marc, >>>>> >>>>> On 07/26/2013 12:57 PM, Marc Carlson wrote: >>>>> ... >>>>> >>>>>> Hello everyone, >>>>>> >>>>>> Sorry that I saw this thread so late. Basically, select() does >>>>>> *try* to >>>>>> keep your initial keys and map them each to an equivalent number of >>>>>> unique values. We did actually anticipate that people would >>>>>> *want* to >>>>>> cbind() their results. >>>>>> >>>>>> But as you discovered there are many circumstances where the data >>>>>> make >>>>>> this kind of behavior impossible. >>>>>> >>>>>> So passing in NAs as keys for example can't ever find anything >>>>>> meaningful. Those will simply have to be removed before we can >>>>>> proceed. And, it is also impossible to maintain a 1:1 mapping if you >>>>>> retrieve fields that have many to one relationships with your initial >>>>>> keys (also seen here). >>>>>> >>>>>> For convenience, when this kind of 1:1 output is already >>>>>> impossible (as >>>>>> it is for most of your examples), select will also try to simplify >>>>>> the >>>>>> output by removing rows that are identical all the way across etc.. >>>>>> >>>>>> My aim was that select should try to do the most reasonable thing >>>>>> possible based on the data we have in each case. The rationale is >>>>>> that >>>>>> in the case where there are 1:many mappings, you should not be >>>>>> planning >>>>>> to bind those directly onto any other data.frames anyways (as this >>>>>> circumstance would require you to call merge() instead). So in that >>>>>> case, non-destructive simplification seems beneficial. >>>>> >>>>> Other tools in our infrastructure use an extra argument to pick- up 1 >>>>> thing in case of multiple mapping e.g. findOverlaps() has the 'select' >>>>> argument with possible values "all", "first", "last", and "arbitrary". >>>>> Also nearest() and family have this argument and it accepts similar >>>>> values. >>>>> >>>>> Couldn't select() use a similar approach? The default should be "all" >>>>> so the current behavior is preserved but if it's something else then >>>>> the returned data.frame should align with the input. >>>>> >>>>> Thanks, >>>>> >>>>> H. >>>>> >>>>> >>>>>> I hope this clarifies things, >>>>>> >>>>>> >>>>>> Marc >>>>>> >>>>>> >>>>>> >>>>>>>> As I >>>>>>>> mentioned in my first post, the for loop function works, but it's >>>>>>>> highly inefficient. >>>>>>>> >>>>>>>> Any help is greatly appreciated, thank you. >>>>>>>> >>>>>>>> Best, >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On 25 July 2013 23:18, Hervé Pagès <hpages at="" fhcrc.org=""> wrote: >>>>>>>>> Hi James, >>>>>>>>> >>>>>>>>> You're right. >>>>>>>>> >>>>>>>>> It's actually both: NAs *and* duplicated keys that are mapped to >>>>>>>>> more than 1 row are removed from the input. I don't think this >>>>>>>>> is documented. >>>>>>>>> >>>>>>>>> I wonder if select() behavior couldn't be a little bit simpler by >>>>>>>>> either preserving or removing all duplicated keys, and not just >>>>>>>>> some >>>>>>>>> of them (on a somewhat arbitrary criteria). >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> H. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On 07/25/2013 02:57 PM, James W. MacDonald wrote: >>>>>>>>>> >>>>>>>>>> Hi Enrico and Herve, >>>>>>>>>> >>>>>>>>>> This has to do with duplicate entries, but only when the >>>>>>>>>> duplicate >>>>>>>>>> entry >>>>>>>>>> maps to many ENTREZID: >>>>>>>>>> >>>>>>>>>> > select(org.Hs.eg.db, rep("ADORA2A", 4), "ENTREZID", >>>>>>>>>> "ALIAS") >>>>>>>>>> ALIAS ENTREZID >>>>>>>>>> 1 ADORA2A 135 >>>>>>>>>> 2 ADORA2A 135 >>>>>>>>>> 3 ADORA2A 135 >>>>>>>>>> 4 ADORA2A 135 >>>>>>>>>> >>>>>>>>>> > select(org.Hs.eg.db, rep("AGT", 4), "ENTREZID", "ALIAS") >>>>>>>>>> ALIAS ENTREZID >>>>>>>>>> 1 AGT 183 >>>>>>>>>> 2 AGT 189 >>>>>>>>>> Warning message: >>>>>>>>>> In .generateExtraRows(tab, keys, jointype) : >>>>>>>>>> 'select' and duplicate query keys resulted in 1:many >>>>>>>>>> mapping >>>>>>>>>> between >>>>>>>>>> keys and return rows >>>>>>>>>> >>>>>>>>>> > select(org.Hs.eg.db, "AGT", "ENTREZID", "ALIAS") >>>>>>>>>> ALIAS ENTREZID >>>>>>>>>> 1 AGT 183 >>>>>>>>>> 2 AGT 189 >>>>>>>>>> Warning message: >>>>>>>>>> In .generateExtraRows(tab, keys, jointype) : >>>>>>>>>> 'select' resulted in 1:many mapping between keys and return >>>>>>>>>> rows >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> So in the instances where a gene symbol maps to more than one >>>>>>>>>> ENTREZID, >>>>>>>>>> the output gets truncated, whereas if it is a one-to-one >>>>>>>>>> mapping, it >>>>>>>>>> does not. >>>>>>>>>> >>>>>>>>>> Best, >>>>>>>>>> >>>>>>>>>> Jim >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On 7/25/2013 5:06 PM, Enrico Ferrero wrote: >>>>>>>>>>> >>>>>>>>>>> Hi, >>>>>>>>>>> >>>>>>>>>>> Herv?, that's exactly what I'm trying to say. >>>>>>>>>>> >>>>>>>>>>> Attached to this email is a tab delimited file with two >>>>>>>>>>> columns of >>>>>>>>>>> GeneSymbols (or Aliases), and here is some simple code to >>>>>>>>>>> reproduce >>>>>>>>>>> the unexpected behaviour: >>>>>>>>>>> >>>>>>>>>>> library(org.Hs.eg.db) >>>>>>>>>>> mydf<- read.table("testdata.txt", sep="\t", header=TRUE, >>>>>>>>>>> as.is=TRUE) >>>>>>>>>>> mytest<- select(org.Hs.eg.db, key=mydf$GeneSymbol1, >>>>>>>>>>> keytype="ALIAS", >>>>>>>>>>> cols=c("SYMBOL","ENTREZID","ENSEMBL")) >>>>>>>>>>> # check that mytest has less rows than mydf >>>>>>>>>>> nrow(mydf) >>>>>>>>>>> nrow(mytest) >>>>>>>>>>> # pick a random row: they don't match >>>>>>>>>>> mydf[250,] >>>>>>>>>>> mytest[250,] >>>>>>>>>>> >>>>>>>>>>> Ideally, mytest should have the same number and position of >>>>>>>>>>> rows of >>>>>>>>>>> mydf so that I can then cbind them. >>>>>>>>>>> If mytest has more rows because of multiple mappings that's also >>>>>>>>>>> fine: >>>>>>>>>>> I can always use merge(mydf, mytest), right? >>>>>>>>>>> >>>>>>>>>>> Thanks a lot to both for your help, it's very appreciated. >>>>>>>>>>> Best, >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On 25 July 2013 21:32, Hervé Pagès<hpages at="" fhcrc.org=""> wrote: >>>>>>>>>>>> >>>>>>>>>>>> Hi Enrico, >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On 07/25/2013 01:20 PM, James W. MacDonald wrote: >>>>>>>>>>>>> >>>>>>>>>>>>> Hi Enrico, >>>>>>>>>>>>> >>>>>>>>>>>>> Please don't take things off-list (e.g., use reply-all). >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On 7/25/2013 2:17 PM, Enrico Ferrero wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>> Hi James, >>>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks very much for your help. >>>>>>>>>>>>>> There is an issue that needs to be solved before thinking >>>>>>>>>>>>>> about >>>>>>>>>>>>>> what's >>>>>>>>>>>>>> the best approach in my opinion. >>>>>>>>>>>>>> >>>>>>>>>>>>>> I don't understand why, but the object created with the >>>>>>>>>>>>>> call to >>>>>>>>>>>>>> select >>>>>>>>>>>>>> (test in my example, first.two in yours) has a different >>>>>>>>>>>>>> number >>>>>>>>>>>>>> of >>>>>>>>>>>>>> rows from the original object (df in my example). >>>>>>>>>>>>>> Specifically >>>>>>>>>>>>>> it has >>>>>>>>>>>>>> *less* rows. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> I'm surprised it has less rows. It can definitely have more, >>>>>>>>>>>> when >>>>>>>>>>>> some >>>>>>>>>>>> of the keys passed to select() are mapped to more than 1 >>>>>>>>>>>> row, but >>>>>>>>>>>> my >>>>>>>>>>>> understanding was that select() would propagate unmapped >>>>>>>>>>>> keys to >>>>>>>>>>>> the >>>>>>>>>>>> output by placing them in rows stuffed with NAs. So maybe I >>>>>>>>>>>> misunderstood how select() works, or its behavior was >>>>>>>>>>>> changed, or >>>>>>>>>>>> there is a bug somewhere. Could you please send the code that >>>>>>>>>>>> allows >>>>>>>>>>>> us to reproduce this? Thanks. >>>>>>>>>>>> >>>>>>>>>>>> H. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>> If all symbols were converted to all possible Entrez IDs, >>>>>>>>>>>>>> >>>>>>>>>>>>>> I would expect it to have more rows, not less. To me, it >>>>>>>>>>>>>> looks >>>>>>>>>>>>>> like >>>>>>>>>>>>>> not all rows are looked up and returned. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Do you see what I mean? >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> Sure. You could be using outdated gene symbols. Or perhaps you >>>>>>>>>>>>> are >>>>>>>>>>>>> using >>>>>>>>>>>>> a mixture of symbols and aliases. Which is even cooler than >>>>>>>>>>>>> just >>>>>>>>>>>>> all >>>>>>>>>>>>> symbols: >>>>>>>>>>>>> >>>>>>>>>>>>> > symb<- c(Rkeys(org.Hs.egSYMBOL)[1:10], >>>>>>>>>>>>> Rkeys(org.Hs.egALIAS2EG)[31:45]) >>>>>>>>>>>>> > symb >>>>>>>>>>>>> [1] "A1BG" "A2M" "A2MP1" "NAT1" "NAT2" >>>>>>>>>>>>> "AACP" >>>>>>>>>>>>> [7] "SERPINA3" "AADAC" "AAMP" "AANAT" "AAMP" >>>>>>>>>>>>> "AANAT" >>>>>>>>>>>>> [13] "DSPS" "SNAT" "AARS" "CMT2N" "AAV" >>>>>>>>>>>>> "AAVS1" >>>>>>>>>>>>> [19] "ABAT" "GABA-AT" "GABAT" "NPD009" "ABC-1" >>>>>>>>>>>>> "ABC1" >>>>>>>>>>>>> [25] "ABCA1" >>>>>>>>>>>>> > select(org.Hs.eg.db, symb, "ENTREZID","SYMBOL") >>>>>>>>>>>>> SYMBOL ENTREZID >>>>>>>>>>>>> 1 A1BG 1 >>>>>>>>>>>>> 2 A2M 2 >>>>>>>>>>>>> 3 A2MP1 3 >>>>>>>>>>>>> 4 NAT1 9 >>>>>>>>>>>>> 5 NAT2 10 >>>>>>>>>>>>> 6 AACP 11 >>>>>>>>>>>>> 7 SERPINA3 12 >>>>>>>>>>>>> 8 AADAC 13 >>>>>>>>>>>>> 9 AAMP 14 >>>>>>>>>>>>> 10 AANAT 15 >>>>>>>>>>>>> 11 AAMP 14 >>>>>>>>>>>>> 12 AANAT 15 >>>>>>>>>>>>> 13 DSPS<na> >>>>>>>>>>>>> 14 SNAT<na> >>>>>>>>>>>>> 15 AARS 16 >>>>>>>>>>>>> 16 CMT2N<na> >>>>>>>>>>>>> 17 AAV<na> >>>>>>>>>>>>> 18 AAVS1 17 >>>>>>>>>>>>> 19 ABAT 18 >>>>>>>>>>>>> 20 GABA-AT<na> >>>>>>>>>>>>> 21 GABAT<na> >>>>>>>>>>>>> 22 NPD009<na> >>>>>>>>>>>>> 23 ABC-1<na> >>>>>>>>>>>>> 24 ABC1<na> >>>>>>>>>>>>> 25 ABCA1 19 >>>>>>>>>>>>> > select(org.Hs.eg.db, symb, "ENTREZID","ALIAS") >>>>>>>>>>>>> ALIAS ENTREZID >>>>>>>>>>>>> 1 A1BG 1 >>>>>>>>>>>>> 2 A2M 2 >>>>>>>>>>>>> 3 A2MP1 3 >>>>>>>>>>>>> 4 NAT1 9 >>>>>>>>>>>>> 5 NAT1 1982 >>>>>>>>>>>>> 6 NAT1 6530 >>>>>>>>>>>>> 7 NAT1 10991 >>>>>>>>>>>>> 8 NAT2 10 >>>>>>>>>>>>> 9 NAT2 81539 >>>>>>>>>>>>> 10 AACP 11 >>>>>>>>>>>>> 11 SERPINA3 12 >>>>>>>>>>>>> 12 AADAC 13 >>>>>>>>>>>>> 13 AAMP 14 >>>>>>>>>>>>> 14 AANAT 15 >>>>>>>>>>>>> 15 DSPS 15 >>>>>>>>>>>>> 16 SNAT 15 >>>>>>>>>>>>> 17 AARS 16 >>>>>>>>>>>>> 18 CMT2N 16 >>>>>>>>>>>>> 19 AAV 17 >>>>>>>>>>>>> 20 AAVS1 17 >>>>>>>>>>>>> 21 ABAT 18 >>>>>>>>>>>>> 22 GABA-AT 18 >>>>>>>>>>>>> 23 GABAT 18 >>>>>>>>>>>>> 24 NPD009 18 >>>>>>>>>>>>> 25 ABC-1 19 >>>>>>>>>>>>> 26 ABC1 19 >>>>>>>>>>>>> 27 ABC1 63897 >>>>>>>>>>>>> 28 ABCA1 19 >>>>>>>>>>>>> Warning message: >>>>>>>>>>>>> In .generateExtraRows(tab, keys, jointype) : >>>>>>>>>>>>> 'select' and duplicate query keys resulted in 1:many >>>>>>>>>>>>> mapping >>>>>>>>>>>>> between >>>>>>>>>>>>> keys and return rows >>>>>>>>>>>>> > mget(c("1982","6530","10991"), org.Hs.egGENENAME) >>>>>>>>>>>>> $`1982` >>>>>>>>>>>>> [1] "eukaryotic translation initiation factor 4 gamma, 2" >>>>>>>>>>>>> >>>>>>>>>>>>> $`6530` >>>>>>>>>>>>> [1] "solute carrier family 6 (neurotransmitter transporter, >>>>>>>>>>>>> noradrenalin), member 2" >>>>>>>>>>>>> >>>>>>>>>>>>> $`10991` >>>>>>>>>>>>> [1] "solute carrier family 38, member 3" >>>>>>>>>>>>> >>>>>>>>>>>>> Best, >>>>>>>>>>>>> >>>>>>>>>>>>> Jim >>>>>>>>>>>>> >>>>>>>>>>>>>> On 25 July 2013 18:17, James W. MacDonald<jmacdon at="" uw.edu=""> >>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Hi Enrico, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On 7/25/2013 12:56 PM, Enrico Ferrero wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Dear James, >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Thanks very much for your prompt reply. >>>>>>>>>>>>>>>> I knew the problem was the for loop and the select >>>>>>>>>>>>>>>> function is >>>>>>>>>>>>>>>> indeed >>>>>>>>>>>>>>>> a lot faster than that and works perfectly with toy data. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> However, this is what happens when I try to use it with >>>>>>>>>>>>>>>> real >>>>>>>>>>>>>>>> data: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> test<- select(org.Hs.eg.db, keys=df$GeneSymbol, >>>>>>>>>>>>>>>>> keytype="ALIAS", >>>>>>>>>>>>>>>>> cols=c("SYMBOL","ENTREZID","ENSEMBL")) >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Warning message: >>>>>>>>>>>>>>>> In .generateExtraRows(tab, keys, jointype) : >>>>>>>>>>>>>>>> 'select' and duplicate query keys resulted in >>>>>>>>>>>>>>>> 1:many >>>>>>>>>>>>>>>> mapping >>>>>>>>>>>>>>>> between >>>>>>>>>>>>>>>> keys and return rows >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> which is probably the warning you mentioned. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> That's not the warning I mentioned, but it does point out >>>>>>>>>>>>>>> the >>>>>>>>>>>>>>> same >>>>>>>>>>>>>>> issue, >>>>>>>>>>>>>>> which is that there is a one to many mapping between >>>>>>>>>>>>>>> symbol and >>>>>>>>>>>>>>> entrez gene >>>>>>>>>>>>>>> ID. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> So now you have to decide if you want to be naive (or >>>>>>>>>>>>>>> stupid, >>>>>>>>>>>>>>> depending on >>>>>>>>>>>>>>> your perspective) or not. You could just cover your eyes and >>>>>>>>>>>>>>> do this: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> first.two<- first.two[!duplicated(first.two$SYMBOL),] >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> which will choose for you the first symbol -> gene ID >>>>>>>>>>>>>>> mapping >>>>>>>>>>>>>>> and >>>>>>>>>>>>>>> nuke the >>>>>>>>>>>>>>> rest. That's nice and quick, but you are making huge >>>>>>>>>>>>>>> assumptions. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Or you could decide to be a bit more sophisticated and do >>>>>>>>>>>>>>> something like >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> thelst<- tapply(1:nrow(first.two), first.two$SYMBOL, >>>>>>>>>>>>>>> function(x) >>>>>>>>>>>>>>> first.two[x,]) >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> At this point you can take a look at e.g., thelst[1:10] >>>>>>>>>>>>>>> to see >>>>>>>>>>>>>>> what >>>>>>>>>>>>>>> we just >>>>>>>>>>>>>>> did >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> thelst<- do.call("rbind", lapply(thelst, function(x) >>>>>>>>>>>>>>> c(x[1,1], >>>>>>>>>>>>>>> paste(x[,2], >>>>>>>>>>>>>>> collapse = "|"))) >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> and here you can look at head(thelst). >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Then you can check to ensure that the first column of >>>>>>>>>>>>>>> thelst is >>>>>>>>>>>>>>> identical to >>>>>>>>>>>>>>> the first column of df, and proceed as before. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> But there is still the problem of the multiple mappings. >>>>>>>>>>>>>>> As an >>>>>>>>>>>>>>> example: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> thelst[1:5] >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> $HBD >>>>>>>>>>>>>>> SYMBOL ENTREZID >>>>>>>>>>>>>>> 2535 HBD 3045 >>>>>>>>>>>>>>> 2536 HBD 100187828 >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> $KIR3DL3 >>>>>>>>>>>>>>> SYMBOL ENTREZID >>>>>>>>>>>>>>> 17513 KIR3DL3 115653 >>>>>>>>>>>>>>> 17514 KIR3DL3 100133046 >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> mget(as.character(thelst[[1]][,2]), org.Hs.egGENENAME) >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> $`3045` >>>>>>>>>>>>>>> [1] "hemoglobin, delta" >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> $`100187828` >>>>>>>>>>>>>>> [1] "hypophosphatemic bone disease" >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> mget(as.character(thelst[[2]][,2]), org.Hs.egGENENAME) >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> $`115653` >>>>>>>>>>>>>>> [1] "killer cell immunoglobulin-like receptor, three >>>>>>>>>>>>>>> domains, >>>>>>>>>>>>>>> long >>>>>>>>>>>>>>> cytoplasmic tail, 3" >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> $`100133046` >>>>>>>>>>>>>>> [1] "killer cell immunoglobulin-like receptor three domains >>>>>>>>>>>>>>> long >>>>>>>>>>>>>>> cytoplasmic >>>>>>>>>>>>>>> tail 3" >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> So HBD is the gene symbol for two different genes! If >>>>>>>>>>>>>>> this gene >>>>>>>>>>>>>>> symbol is in >>>>>>>>>>>>>>> your data, you will now have attributed your data to two >>>>>>>>>>>>>>> genes >>>>>>>>>>>>>>> that >>>>>>>>>>>>>>> apparently are not remotely similar. if KIR3DL3 is in your >>>>>>>>>>>>>>> data, >>>>>>>>>>>>>>> then it >>>>>>>>>>>>>>> worked out OK for that gene. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Best, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Jim >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> The real problem is that the number of rows is now >>>>>>>>>>>>>>>> different >>>>>>>>>>>>>>>> for >>>>>>>>>>>>>>>> the 2 >>>>>>>>>>>>>>>> objects: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> nrow(df); nrow(test) >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> [1] 573 >>>>>>>>>>>>>>>> [1] 201 >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> So I obviously can't put the new data into the original >>>>>>>>>>>>>>>> df. My >>>>>>>>>>>>>>>> impression is that when the 1 to many mapping arises, the >>>>>>>>>>>>>>>> select >>>>>>>>>>>>>>>> functions exits, with that warning message. As a result, my >>>>>>>>>>>>>>>> test >>>>>>>>>>>>>>>> object is incomplete. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On top of that, and I can't really explain this, the row >>>>>>>>>>>>>>>> positions are >>>>>>>>>>>>>>>> messed up, e.g. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> all.equal(df[100,],test[100,]) >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> returns FALSE. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> How can I work around this? >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Thanks a lot! >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Best, >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On 25 July 2013 16:58, James W. MacDonald<jmacdon at="" uw.edu=""> >>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Hi Enrico, >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On 7/25/2013 11:35 AM, Enrico Ferrero wrote: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Hello, >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> I often have data frames where I need to perform ID >>>>>>>>>>>>>>>>>> conversions on >>>>>>>>>>>>>>>>>> one >>>>>>>>>>>>>>>>>> or >>>>>>>>>>>>>>>>>> more of the columns while preserving the order of the >>>>>>>>>>>>>>>>>> rows, >>>>>>>>>>>>>>>>>> e.g.: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> GeneSymbol Value1 Value2 >>>>>>>>>>>>>>>>>> GS1 2.5 0.1 >>>>>>>>>>>>>>>>>> GS2 3 0.2 >>>>>>>>>>>>>>>>>> .. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> And I want to obtain: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> GeneSymbol EntrezGeneID Value1 Value2 >>>>>>>>>>>>>>>>>> GS1 EG1 2.5 0.1 >>>>>>>>>>>>>>>>>> GS2 EG2 3 0.2 >>>>>>>>>>>>>>>>>> .. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> What I've done so far was to create a function that uses >>>>>>>>>>>>>>>>>> org.Hs.eg.db to >>>>>>>>>>>>>>>>>> loop over the rows of the column and does the conversion: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> library(org.Hs.eg.db) >>>>>>>>>>>>>>>>>> alias2EG<- function(x) { >>>>>>>>>>>>>>>>>> for (i in 1:length(x)) { >>>>>>>>>>>>>>>>>> if (!is.na(x[i])) { >>>>>>>>>>>>>>>>>> repl<- org.Hs.egALIAS2EG[[x[i]]][1] >>>>>>>>>>>>>>>>>> if (!is.null(repl)) { >>>>>>>>>>>>>>>>>> x[i]<- repl >>>>>>>>>>>>>>>>>> } >>>>>>>>>>>>>>>>>> else { >>>>>>>>>>>>>>>>>> x[i]<- NA >>>>>>>>>>>>>>>>>> } >>>>>>>>>>>>>>>>>> } >>>>>>>>>>>>>>>>>> } >>>>>>>>>>>>>>>>>> return(x) >>>>>>>>>>>>>>>>>> } >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I should first note that gene symbols are not unique, >>>>>>>>>>>>>>>>> so you >>>>>>>>>>>>>>>>> are >>>>>>>>>>>>>>>>> taking a >>>>>>>>>>>>>>>>> chance on your mappings. Is there no other annotation for >>>>>>>>>>>>>>>>> your >>>>>>>>>>>>>>>>> data? >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> In addition, you should note that it is almost always >>>>>>>>>>>>>>>>> better >>>>>>>>>>>>>>>>> to >>>>>>>>>>>>>>>>> think of >>>>>>>>>>>>>>>>> objects as vectors and matrices in R, rather than as >>>>>>>>>>>>>>>>> things >>>>>>>>>>>>>>>>> that >>>>>>>>>>>>>>>>> need to >>>>>>>>>>>>>>>>> be >>>>>>>>>>>>>>>>> looped over (e.g., R isn't Perl or C). >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> first.two<- select(org.Hs.eg.db, >>>>>>>>>>>>>>>>> as.character(df$GeneSymbol), >>>>>>>>>>>>>>>>> "ENTREZID", >>>>>>>>>>>>>>>>> "SYMBOL") >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Note that there used to be a warning or an error (don't >>>>>>>>>>>>>>>>> remember >>>>>>>>>>>>>>>>> which) >>>>>>>>>>>>>>>>> when >>>>>>>>>>>>>>>>> you did something like this, stating that gene symbols are >>>>>>>>>>>>>>>>> not >>>>>>>>>>>>>>>>> unique, >>>>>>>>>>>>>>>>> and >>>>>>>>>>>>>>>>> that you shouldn't do this sort of thing. Apparently this >>>>>>>>>>>>>>>>> warning has >>>>>>>>>>>>>>>>> been >>>>>>>>>>>>>>>>> removed, but the issue remains valid. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> ## check yourself >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> all.equal(df$GeneSymbol, first.two$SYMBOL) >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> ## if true, proceed >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> df<- data.frame(first.two, df[,-1]) >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Best, >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Jim >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> and then call the function like this: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> df$EntrezGeneID<- alias2GS(df$GeneSymbol) >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> This works well, but gets very slow when I need to do >>>>>>>>>>>>>>>>>> multiple >>>>>>>>>>>>>>>>>> conversions >>>>>>>>>>>>>>>>>> on large datasets. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Is there any way I can achieve the same result but in a >>>>>>>>>>>>>>>>>> quicker, more >>>>>>>>>>>>>>>>>> efficient way? >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Thank you. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>> James W. MacDonald, M.S. >>>>>>>>>>>>>>>>> Biostatistician >>>>>>>>>>>>>>>>> University of Washington >>>>>>>>>>>>>>>>> Environmental and Occupational Health Sciences >>>>>>>>>>>>>>>>> 4225 Roosevelt Way NE, # 100 >>>>>>>>>>>>>>>>> Seattle WA 98105-6099 >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>> James W. MacDonald, M.S. >>>>>>>>>>>>>>> Biostatistician >>>>>>>>>>>>>>> University of Washington >>>>>>>>>>>>>>> Environmental and Occupational Health Sciences >>>>>>>>>>>>>>> 4225 Roosevelt Way NE, # 100 >>>>>>>>>>>>>>> Seattle WA 98105-6099 >>>>>>>>>>>>>>> >>>>>>>>>>>> -- >>>>>>>>>>>> Hervé Pagès >>>>>>>>>>>> >>>>>>>>>>>> Program in Computational Biology >>>>>>>>>>>> Division of Public Health Sciences >>>>>>>>>>>> Fred Hutchinson Cancer Research Center >>>>>>>>>>>> 1100 Fairview Ave. N, M1-B514 >>>>>>>>>>>> P.O. Box 19024 >>>>>>>>>>>> Seattle, WA 98109-1024 >>>>>>>>>>>> >>>>>>>>>>>> E-mail: hpages at fhcrc.org >>>>>>>>>>>> Phone: (206) 667-5791 >>>>>>>>>>>> Fax: (206) 667-1319 >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>> -- >>>>>>>>> Hervé Pagès >>>>>>>>> >>>>>>>>> Program in Computational Biology >>>>>>>>> Division of Public Health Sciences >>>>>>>>> Fred Hutchinson Cancer Research Center >>>>>>>>> 1100 Fairview Ave. N, M1-B514 >>>>>>>>> P.O. Box 19024 >>>>>>>>> Seattle, WA 98109-1024 >>>>>>>>> >>>>>>>>> E-mail: hpages at fhcrc.org >>>>>>>>> Phone: (206) 667-5791 >>>>>>>>> Fax: (206) 667-1319 >>>>>>>> >>>>>>>> >>>>>>>> >>>>>> _______________________________________________ >>>>>> Bioconductor mailing list >>>>>> Bioconductor at r-project.org >>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>> Search the archives: >>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>> >>>>> -- >>>>> Hervé Pagès >>>>> >>>>> Program in Computational Biology >>>>> Division of Public Health Sciences >>>>> Fred Hutchinson Cancer Research Center >>>>> 1100 Fairview Ave. N, M1-B514 >>>>> P.O. Box 19024 >>>>> Seattle, WA 98109-1024 >>>>> >>>>> E-mail: hpages at fhcrc.org >>>>> Phone: (206) 667-5791 >>>>> Fax: (206) 667-1319 >>>>> >>>>> _______________________________________________ >>>>> Bioconductor mailing list >>>>> Bioconductor at r-project.org >>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>> Search the archives: >>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>> >>>> >> >> > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
ADD REPLY
0
Entering edit mode
Thomas Girke ★ 1.7k
@thomas-girke-993
Last seen 9 months ago
United States
A very generic and efficient solution to accomplish this in R is usually to make use of a named vector. Here is an example: ## Sample data frame df <- data.frame(ID=paste("g", 1:10, sep=""), t1=rnorm(10), t2=rnorm(10)) df ID t1 t2 1 g1 0.84906257 -1.10046605 2 g2 -1.29354187 -0.05610518 3 g3 1.00362290 -0.82640813 4 g4 1.61035832 -1.04016446 5 g5 0.23232417 -0.11921920 6 g6 -1.89920999 -1.38235047 7 g7 -0.34786030 -0.16438477 8 g8 -1.28758867 -1.06968997 9 g9 -0.71510804 -3.42711282 10 g10 -0.02800613 0.01825634 ## Sample lookup vector for whatever IDs lookup <- paste("g", sample(21:30), sep="") names(lookup) <- paste("g", sample(1:10), sep="") lookup g5 g4 g10 g6 g9 g1 g3 g7 g8 g2 "g23" "g30" "g25" "g27" "g29" "g21" "g22" "g24" "g26" "g28" ## Replace column with new IDs in proper order df[,"ID"] <- lookup[as.character(df$ID)] ID t1 t2 1 g21 0.84906257 -1.10046605 2 g28 -1.29354187 -0.05610518 3 g22 1.00362290 -0.82640813 4 g30 1.61035832 -1.04016446 5 g23 0.23232417 -0.11921920 6 g27 -1.89920999 -1.38235047 7 g24 -0.34786030 -0.16438477 8 g26 -1.28758867 -1.06968997 9 g29 -0.71510804 -3.42711282 10 g25 -0.02800613 0.01825634 Thomas On Thu, Jul 25, 2013 at 10:54:25PM +0000, Enrico Ferrero wrote: > Hi both, > > Thanks for your insights, this is extremely interesting! > > While I (kind of) understand why NAs get removed, deliberately > truncating the output that way is probably not what most people > expect. It may be worth considering filing a bug report for this? > > This also brings me back to my original question: what's the simplest > and most effienct way to create an exact copy of a column containing > converted IDs in a data.frame? > > I'm surprised there doesn't seem to be an easy ready-to-go solution, > as I would imagine it is a rather common task to perform. As I > mentioned in my first post, the for loop function works, but it's > highly inefficient. > > Any help is greatly appreciated, thank you. > > Best, > > > > On 25 July 2013 23:18, Hervé Pagès <hpages at="" fhcrc.org=""> wrote: > > Hi James, > > > > You're right. > > > > It's actually both: NAs *and* duplicated keys that are mapped to > > more than 1 row are removed from the input. I don't think this > > is documented. > > > > I wonder if select() behavior couldn't be a little bit simpler by > > either preserving or removing all duplicated keys, and not just some > > of them (on a somewhat arbitrary criteria). > > > > Thanks, > > H. > > > > > > > > On 07/25/2013 02:57 PM, James W. MacDonald wrote: > >> > >> Hi Enrico and Herve, > >> > >> This has to do with duplicate entries, but only when the duplicate entry > >> maps to many ENTREZID: > >> > >> > select(org.Hs.eg.db, rep("ADORA2A", 4), "ENTREZID", "ALIAS") > >> ALIAS ENTREZID > >> 1 ADORA2A 135 > >> 2 ADORA2A 135 > >> 3 ADORA2A 135 > >> 4 ADORA2A 135 > >> > >> > select(org.Hs.eg.db, rep("AGT", 4), "ENTREZID", "ALIAS") > >> ALIAS ENTREZID > >> 1 AGT 183 > >> 2 AGT 189 > >> Warning message: > >> In .generateExtraRows(tab, keys, jointype) : > >> 'select' and duplicate query keys resulted in 1:many mapping between > >> keys and return rows > >> > >> > select(org.Hs.eg.db, "AGT", "ENTREZID", "ALIAS") > >> ALIAS ENTREZID > >> 1 AGT 183 > >> 2 AGT 189 > >> Warning message: > >> In .generateExtraRows(tab, keys, jointype) : > >> 'select' resulted in 1:many mapping between keys and return rows > >> > >> > >> So in the instances where a gene symbol maps to more than one ENTREZID, > >> the output gets truncated, whereas if it is a one-to-one mapping, it > >> does not. > >> > >> Best, > >> > >> Jim > >> > >> > >> > >> > >> On 7/25/2013 5:06 PM, Enrico Ferrero wrote: > >>> > >>> Hi, > >>> > >>> Herv?, that's exactly what I'm trying to say. > >>> > >>> Attached to this email is a tab delimited file with two columns of > >>> GeneSymbols (or Aliases), and here is some simple code to reproduce > >>> the unexpected behaviour: > >>> > >>> library(org.Hs.eg.db) > >>> mydf<- read.table("testdata.txt", sep="\t", header=TRUE, as.is=TRUE) > >>> mytest<- select(org.Hs.eg.db, key=mydf$GeneSymbol1, keytype="ALIAS", > >>> cols=c("SYMBOL","ENTREZID","ENSEMBL")) > >>> # check that mytest has less rows than mydf > >>> nrow(mydf) > >>> nrow(mytest) > >>> # pick a random row: they don't match > >>> mydf[250,] > >>> mytest[250,] > >>> > >>> Ideally, mytest should have the same number and position of rows of > >>> mydf so that I can then cbind them. > >>> If mytest has more rows because of multiple mappings that's also fine: > >>> I can always use merge(mydf, mytest), right? > >>> > >>> Thanks a lot to both for your help, it's very appreciated. > >>> Best, > >>> > >>> > >>> On 25 July 2013 21:32, Hervé Pagès<hpages at="" fhcrc.org=""> wrote: > >>>> > >>>> Hi Enrico, > >>>> > >>>> > >>>> On 07/25/2013 01:20 PM, James W. MacDonald wrote: > >>>>> > >>>>> Hi Enrico, > >>>>> > >>>>> Please don't take things off-list (e.g., use reply-all). > >>>>> > >>>>> > >>>>> On 7/25/2013 2:17 PM, Enrico Ferrero wrote: > >>>>>> > >>>>>> Hi James, > >>>>>> > >>>>>> Thanks very much for your help. > >>>>>> There is an issue that needs to be solved before thinking about what's > >>>>>> the best approach in my opinion. > >>>>>> > >>>>>> I don't understand why, but the object created with the call to select > >>>>>> (test in my example, first.two in yours) has a different number of > >>>>>> rows from the original object (df in my example). Specifically it has > >>>>>> *less* rows. > >>>> > >>>> > >>>> I'm surprised it has less rows. It can definitely have more, when some > >>>> of the keys passed to select() are mapped to more than 1 row, but my > >>>> understanding was that select() would propagate unmapped keys to the > >>>> output by placing them in rows stuffed with NAs. So maybe I > >>>> misunderstood how select() works, or its behavior was changed, or > >>>> there is a bug somewhere. Could you please send the code that allows > >>>> us to reproduce this? Thanks. > >>>> > >>>> H. > >>>> > >>>> > >>>>> If all symbols were converted to all possible Entrez IDs, > >>>>>> > >>>>>> I would expect it to have more rows, not less. To me, it looks like > >>>>>> not all rows are looked up and returned. > >>>>>> > >>>>>> Do you see what I mean? > >>>>> > >>>>> > >>>>> Sure. You could be using outdated gene symbols. Or perhaps you are > >>>>> using > >>>>> a mixture of symbols and aliases. Which is even cooler than just all > >>>>> symbols: > >>>>> > >>>>> > symb<- c(Rkeys(org.Hs.egSYMBOL)[1:10], > >>>>> Rkeys(org.Hs.egALIAS2EG)[31:45]) > >>>>> > symb > >>>>> [1] "A1BG" "A2M" "A2MP1" "NAT1" "NAT2" "AACP" > >>>>> [7] "SERPINA3" "AADAC" "AAMP" "AANAT" "AAMP" "AANAT" > >>>>> [13] "DSPS" "SNAT" "AARS" "CMT2N" "AAV" "AAVS1" > >>>>> [19] "ABAT" "GABA-AT" "GABAT" "NPD009" "ABC-1" "ABC1" > >>>>> [25] "ABCA1" > >>>>> > select(org.Hs.eg.db, symb, "ENTREZID","SYMBOL") > >>>>> SYMBOL ENTREZID > >>>>> 1 A1BG 1 > >>>>> 2 A2M 2 > >>>>> 3 A2MP1 3 > >>>>> 4 NAT1 9 > >>>>> 5 NAT2 10 > >>>>> 6 AACP 11 > >>>>> 7 SERPINA3 12 > >>>>> 8 AADAC 13 > >>>>> 9 AAMP 14 > >>>>> 10 AANAT 15 > >>>>> 11 AAMP 14 > >>>>> 12 AANAT 15 > >>>>> 13 DSPS<na> > >>>>> 14 SNAT<na> > >>>>> 15 AARS 16 > >>>>> 16 CMT2N<na> > >>>>> 17 AAV<na> > >>>>> 18 AAVS1 17 > >>>>> 19 ABAT 18 > >>>>> 20 GABA-AT<na> > >>>>> 21 GABAT<na> > >>>>> 22 NPD009<na> > >>>>> 23 ABC-1<na> > >>>>> 24 ABC1<na> > >>>>> 25 ABCA1 19 > >>>>> > select(org.Hs.eg.db, symb, "ENTREZID","ALIAS") > >>>>> ALIAS ENTREZID > >>>>> 1 A1BG 1 > >>>>> 2 A2M 2 > >>>>> 3 A2MP1 3 > >>>>> 4 NAT1 9 > >>>>> 5 NAT1 1982 > >>>>> 6 NAT1 6530 > >>>>> 7 NAT1 10991 > >>>>> 8 NAT2 10 > >>>>> 9 NAT2 81539 > >>>>> 10 AACP 11 > >>>>> 11 SERPINA3 12 > >>>>> 12 AADAC 13 > >>>>> 13 AAMP 14 > >>>>> 14 AANAT 15 > >>>>> 15 DSPS 15 > >>>>> 16 SNAT 15 > >>>>> 17 AARS 16 > >>>>> 18 CMT2N 16 > >>>>> 19 AAV 17 > >>>>> 20 AAVS1 17 > >>>>> 21 ABAT 18 > >>>>> 22 GABA-AT 18 > >>>>> 23 GABAT 18 > >>>>> 24 NPD009 18 > >>>>> 25 ABC-1 19 > >>>>> 26 ABC1 19 > >>>>> 27 ABC1 63897 > >>>>> 28 ABCA1 19 > >>>>> Warning message: > >>>>> In .generateExtraRows(tab, keys, jointype) : > >>>>> 'select' and duplicate query keys resulted in 1:many mapping > >>>>> between > >>>>> keys and return rows > >>>>> > mget(c("1982","6530","10991"), org.Hs.egGENENAME) > >>>>> $`1982` > >>>>> [1] "eukaryotic translation initiation factor 4 gamma, 2" > >>>>> > >>>>> $`6530` > >>>>> [1] "solute carrier family 6 (neurotransmitter transporter, > >>>>> noradrenalin), member 2" > >>>>> > >>>>> $`10991` > >>>>> [1] "solute carrier family 38, member 3" > >>>>> > >>>>> Best, > >>>>> > >>>>> Jim > >>>>> > >>>>>> On 25 July 2013 18:17, James W. MacDonald<jmacdon at="" uw.edu=""> wrote: > >>>>>>> > >>>>>>> Hi Enrico, > >>>>>>> > >>>>>>> > >>>>>>> On 7/25/2013 12:56 PM, Enrico Ferrero wrote: > >>>>>>>> > >>>>>>>> Dear James, > >>>>>>>> > >>>>>>>> Thanks very much for your prompt reply. > >>>>>>>> I knew the problem was the for loop and the select function is > >>>>>>>> indeed > >>>>>>>> a lot faster than that and works perfectly with toy data. > >>>>>>>> > >>>>>>>> However, this is what happens when I try to use it with real data: > >>>>>>>> > >>>>>>>>> test<- select(org.Hs.eg.db, keys=df$GeneSymbol, keytype="ALIAS", > >>>>>>>>> cols=c("SYMBOL","ENTREZID","ENSEMBL")) > >>>>>>>> > >>>>>>>> Warning message: > >>>>>>>> In .generateExtraRows(tab, keys, jointype) : > >>>>>>>> 'select' and duplicate query keys resulted in 1:many mapping > >>>>>>>> between > >>>>>>>> keys and return rows > >>>>>>>> > >>>>>>>> which is probably the warning you mentioned. > >>>>>>> > >>>>>>> > >>>>>>> That's not the warning I mentioned, but it does point out the same > >>>>>>> issue, > >>>>>>> which is that there is a one to many mapping between symbol and > >>>>>>> entrez gene > >>>>>>> ID. > >>>>>>> > >>>>>>> So now you have to decide if you want to be naive (or stupid, > >>>>>>> depending on > >>>>>>> your perspective) or not. You could just cover your eyes and do this: > >>>>>>> > >>>>>>> first.two<- first.two[!duplicated(first.two$SYMBOL),] > >>>>>>> > >>>>>>> which will choose for you the first symbol -> gene ID mapping and > >>>>>>> nuke the > >>>>>>> rest. That's nice and quick, but you are making huge assumptions. > >>>>>>> > >>>>>>> Or you could decide to be a bit more sophisticated and do > >>>>>>> something like > >>>>>>> > >>>>>>> thelst<- tapply(1:nrow(first.two), first.two$SYMBOL, function(x) > >>>>>>> first.two[x,]) > >>>>>>> > >>>>>>> At this point you can take a look at e.g., thelst[1:10] to see what > >>>>>>> we just > >>>>>>> did > >>>>>>> > >>>>>>> thelst<- do.call("rbind", lapply(thelst, function(x) c(x[1,1], > >>>>>>> paste(x[,2], > >>>>>>> collapse = "|"))) > >>>>>>> > >>>>>>> and here you can look at head(thelst). > >>>>>>> > >>>>>>> Then you can check to ensure that the first column of thelst is > >>>>>>> identical to > >>>>>>> the first column of df, and proceed as before. > >>>>>>> > >>>>>>> But there is still the problem of the multiple mappings. As an > >>>>>>> example: > >>>>>>> > >>>>>>>> thelst[1:5] > >>>>>>> > >>>>>>> $HBD > >>>>>>> SYMBOL ENTREZID > >>>>>>> 2535 HBD 3045 > >>>>>>> 2536 HBD 100187828 > >>>>>>> > >>>>>>> $KIR3DL3 > >>>>>>> SYMBOL ENTREZID > >>>>>>> 17513 KIR3DL3 115653 > >>>>>>> 17514 KIR3DL3 100133046 > >>>>>>> > >>>>>>>> mget(as.character(thelst[[1]][,2]), org.Hs.egGENENAME) > >>>>>>> > >>>>>>> $`3045` > >>>>>>> [1] "hemoglobin, delta" > >>>>>>> > >>>>>>> $`100187828` > >>>>>>> [1] "hypophosphatemic bone disease" > >>>>>>> > >>>>>>>> mget(as.character(thelst[[2]][,2]), org.Hs.egGENENAME) > >>>>>>> > >>>>>>> $`115653` > >>>>>>> [1] "killer cell immunoglobulin-like receptor, three domains, long > >>>>>>> cytoplasmic tail, 3" > >>>>>>> > >>>>>>> $`100133046` > >>>>>>> [1] "killer cell immunoglobulin-like receptor three domains long > >>>>>>> cytoplasmic > >>>>>>> tail 3" > >>>>>>> > >>>>>>> > >>>>>>> So HBD is the gene symbol for two different genes! If this gene > >>>>>>> symbol is in > >>>>>>> your data, you will now have attributed your data to two genes that > >>>>>>> apparently are not remotely similar. if KIR3DL3 is in your data, > >>>>>>> then it > >>>>>>> worked out OK for that gene. > >>>>>>> > >>>>>>> Best, > >>>>>>> > >>>>>>> Jim > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>>> The real problem is that the number of rows is now different for > >>>>>>>> the 2 > >>>>>>>> objects: > >>>>>>>>> > >>>>>>>>> nrow(df); nrow(test) > >>>>>>>> > >>>>>>>> [1] 573 > >>>>>>>> [1] 201 > >>>>>>>> > >>>>>>>> So I obviously can't put the new data into the original df. My > >>>>>>>> impression is that when the 1 to many mapping arises, the select > >>>>>>>> functions exits, with that warning message. As a result, my test > >>>>>>>> object is incomplete. > >>>>>>>> > >>>>>>>> On top of that, and I can't really explain this, the row > >>>>>>>> positions are > >>>>>>>> messed up, e.g. > >>>>>>>> > >>>>>>>>> all.equal(df[100,],test[100,]) > >>>>>>>> > >>>>>>>> returns FALSE. > >>>>>>>> > >>>>>>>> How can I work around this? > >>>>>>>> > >>>>>>>> Thanks a lot! > >>>>>>>> > >>>>>>>> Best, > >>>>>>>> > >>>>>>>> On 25 July 2013 16:58, James W. MacDonald<jmacdon at="" uw.edu=""> wrote: > >>>>>>>>> > >>>>>>>>> Hi Enrico, > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> On 7/25/2013 11:35 AM, Enrico Ferrero wrote: > >>>>>>>>>> > >>>>>>>>>> Hello, > >>>>>>>>>> > >>>>>>>>>> I often have data frames where I need to perform ID conversions on > >>>>>>>>>> one > >>>>>>>>>> or > >>>>>>>>>> more of the columns while preserving the order of the rows, e.g.: > >>>>>>>>>> > >>>>>>>>>> GeneSymbol Value1 Value2 > >>>>>>>>>> GS1 2.5 0.1 > >>>>>>>>>> GS2 3 0.2 > >>>>>>>>>> .. > >>>>>>>>>> > >>>>>>>>>> And I want to obtain: > >>>>>>>>>> > >>>>>>>>>> GeneSymbol EntrezGeneID Value1 Value2 > >>>>>>>>>> GS1 EG1 2.5 0.1 > >>>>>>>>>> GS2 EG2 3 0.2 > >>>>>>>>>> .. > >>>>>>>>>> > >>>>>>>>>> What I've done so far was to create a function that uses > >>>>>>>>>> org.Hs.eg.db to > >>>>>>>>>> loop over the rows of the column and does the conversion: > >>>>>>>>>> > >>>>>>>>>> library(org.Hs.eg.db) > >>>>>>>>>> alias2EG<- function(x) { > >>>>>>>>>> for (i in 1:length(x)) { > >>>>>>>>>> if (!is.na(x[i])) { > >>>>>>>>>> repl<- org.Hs.egALIAS2EG[[x[i]]][1] > >>>>>>>>>> if (!is.null(repl)) { > >>>>>>>>>> x[i]<- repl > >>>>>>>>>> } > >>>>>>>>>> else { > >>>>>>>>>> x[i]<- NA > >>>>>>>>>> } > >>>>>>>>>> } > >>>>>>>>>> } > >>>>>>>>>> return(x) > >>>>>>>>>> } > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> I should first note that gene symbols are not unique, so you are > >>>>>>>>> taking a > >>>>>>>>> chance on your mappings. Is there no other annotation for your > >>>>>>>>> data? > >>>>>>>>> > >>>>>>>>> In addition, you should note that it is almost always better to > >>>>>>>>> think of > >>>>>>>>> objects as vectors and matrices in R, rather than as things that > >>>>>>>>> need to > >>>>>>>>> be > >>>>>>>>> looped over (e.g., R isn't Perl or C). > >>>>>>>>> > >>>>>>>>> first.two<- select(org.Hs.eg.db, as.character(df$GeneSymbol), > >>>>>>>>> "ENTREZID", > >>>>>>>>> "SYMBOL") > >>>>>>>>> > >>>>>>>>> Note that there used to be a warning or an error (don't remember > >>>>>>>>> which) > >>>>>>>>> when > >>>>>>>>> you did something like this, stating that gene symbols are not > >>>>>>>>> unique, > >>>>>>>>> and > >>>>>>>>> that you shouldn't do this sort of thing. Apparently this > >>>>>>>>> warning has > >>>>>>>>> been > >>>>>>>>> removed, but the issue remains valid. > >>>>>>>>> > >>>>>>>>> ## check yourself > >>>>>>>>> > >>>>>>>>> all.equal(df$GeneSymbol, first.two$SYMBOL) > >>>>>>>>> > >>>>>>>>> ## if true, proceed > >>>>>>>>> > >>>>>>>>> df<- data.frame(first.two, df[,-1]) > >>>>>>>>> > >>>>>>>>> Best, > >>>>>>>>> > >>>>>>>>> Jim > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>>> and then call the function like this: > >>>>>>>>>> > >>>>>>>>>> df$EntrezGeneID<- alias2GS(df$GeneSymbol) > >>>>>>>>>> > >>>>>>>>>> This works well, but gets very slow when I need to do multiple > >>>>>>>>>> conversions > >>>>>>>>>> on large datasets. > >>>>>>>>>> > >>>>>>>>>> Is there any way I can achieve the same result but in a > >>>>>>>>>> quicker, more > >>>>>>>>>> efficient way? > >>>>>>>>>> > >>>>>>>>>> Thank you. > >>>>>>>>>> > >>>>>>>>> -- > >>>>>>>>> James W. MacDonald, M.S. > >>>>>>>>> Biostatistician > >>>>>>>>> University of Washington > >>>>>>>>> Environmental and Occupational Health Sciences > >>>>>>>>> 4225 Roosevelt Way NE, # 100 > >>>>>>>>> Seattle WA 98105-6099 > >>>>>>>>> > >>>>>>> -- > >>>>>>> James W. MacDonald, M.S. > >>>>>>> Biostatistician > >>>>>>> University of Washington > >>>>>>> Environmental and Occupational Health Sciences > >>>>>>> 4225 Roosevelt Way NE, # 100 > >>>>>>> Seattle WA 98105-6099 > >>>>>>> > >>>>>> > >>>> -- > >>>> Hervé Pagès > >>>> > >>>> Program in Computational Biology > >>>> Division of Public Health Sciences > >>>> Fred Hutchinson Cancer Research Center > >>>> 1100 Fairview Ave. N, M1-B514 > >>>> P.O. Box 19024 > >>>> Seattle, WA 98109-1024 > >>>> > >>>> E-mail: hpages at fhcrc.org > >>>> Phone: (206) 667-5791 > >>>> Fax: (206) 667-1319 > >>> > >>> > >>> > >> > > > > -- > > Hervé Pagès > > > > Program in Computational Biology > > Division of Public Health Sciences > > Fred Hutchinson Cancer Research Center > > 1100 Fairview Ave. N, M1-B514 > > P.O. Box 19024 > > Seattle, WA 98109-1024 > > > > E-mail: hpages at fhcrc.org > > Phone: (206) 667-5791 > > Fax: (206) 667-1319 > > > > -- > Enrico Ferrero > PhD Student > Steve Russell Lab - Department of Genetics > FlyChip - Cambridge Systems Biology Centre > University of Cambridge > > e.ferrero at gen.cam.ac.uk > http://flypress.gen.cam.ac.uk/ > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
ADD COMMENT

Login before adding your answer.

Traffic: 710 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6