biomart to a data.frame

0

Entering edit mode

Assa Yeroslaviz ★ 1.5k

@assa-yeroslaviz-1597

Last seen 4 months ago

Germany

An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https: stat.ethz.ch="" pipermail="" bioconductor="" attachments="" 20120125="" 40a31f06="" attachment.ksh="">

• 2.3k views

ADD COMMENT • link updated 14.0 years ago by Steve Lianoglou ★ 13k • written 14.0 years ago by Assa Yeroslaviz ★ 1.5k

0

Entering edit mode

Steve Lianoglou ★ 13k

@steve-lianoglou-2771

Last seen 11 weeks ago

United States

Hi Assa, Sorry for top posting. Your intuition is correct: you should not being querying biomart inside a for loop. The idea is to create one query for all of your protein IDs, and query it once. This is how you might go about it. First, let's look at the protein IDs you already seem to have somewhere: > 45 ?FBpp0070037 > 46 ?FBpp0070039;FBpp0070040 > 47 ?FBpp0070041;FBpp0070042;FBpp0070043 > 48 ?FBpp0070044;FBpp0110571 It seems you have multiple IDs jammed into one column of a data.frame maybe? The rows which have more than one ID, (eg. "FBpp0070039;FBpp0070040") will have to be split up so that each row (or element in a vector) only has one ID. Look into using `strsplit`. You will need to get a character vector of protein ids -- one protein per bin, it might look like so: pids <- c('FBpp0070037', 'FBpp0070039', 'FBpp0070040', 'FBpp0070041', 'FBpp0070042', 'FBpp0070043') Now ... you're basically done. Let's rig up an object to query biomart with: library(biomaRt) mart <- useMart('ensembl', dataset='dmelanogaster_gene_ensembl') ans <- getBM(attributes=c("flybase_translation_id","flybase_gene_id"," flybasename_gene"), filters="flybase_translation_id", values=pids, mart=mart) Your answer will look like so: flybase_translation_id flybase_gene_id flybasename_gene 1 FBpp0070037 FBgn0010215 alpha-Cat 2 FBpp0070039 FBgn0052230 CG32230 3 FBpp0070040 FBgn0052230 CG32230 4 FBpp0070041 FBgn0000258 CkIIalpha 5 FBpp0070042 FBgn0000258 CkIIalpha 6 FBpp0070043 FBgn0000258 CkIIalpha Now you're left with figuring out what to do with multiple "flybase_translaion_id"s that map to the same "flybasename_gene". You would have to do this anyway, but the key point here is that you can now do it without querying biomart in a loop. HTH, -steve > For each of these protein Ids (FBpp...), I would like to extract the gene > id (Fbgn....) in a third column. the output table should looks like that: > > 45 ?FBpp0070037 ? ? ? ? ? ? ? ? ? ? ? ? ?FBgn001234 > 46 ?FBpp0070039;FBpp0070040 ? ? ? ? ? ? ?FBgn00094432;FBgn002345 > 47 ?FBpp0070041;FBpp0070042;FBpp0070043 ?FBgn0001936;FBgn000102;FBgn004527 > 48 ?FBpp0070044;FBpp0110571 ? ? ? ? ? ? ?FBgn0097234;FBgn00183 > ... > > I was thinking using biomaRt, but I could find a way of automating it for > the complete protein ids in the line. > > What I have done so far is this for loop: > > for(i in 1:dim(data)[1]){ > ?temp=unlist(strsplit(data[i,2],";")) > ?temp= gsub("REV__", "", temp) > ?result= > getBM(attributes=c("flybase_translation_id","flybase_gene_id","flyba sename_gene"),filters="flybase_translation_id",values=temp, > mart=mart, ) > ? ? ?charresult ="" > ? ? ?for (j in 1:length(result[[1]])) { > # ? ? ? ? ?charresult<-paste(charresult,">", > result[[1]][j],":",result[[2]][j], "\t", sep="") > ? ? ? ? ?charresult<-paste(charresult, result[[2]][j], ";", sep="") > ? ? ? ? ?} > ? ? ?out<-"CompleteResults.txt" > ? ? ?cat("line: ", i-1,"\t", "was written to ");cat(out);cat("\n") > ? ? ?write.table(paste(i-1, charresult, sep="\t"),out, sep="\t", quote=F, > col.names=F, row.names=F,append=T) > ? ?} > > What I am doing is converting the string of FBpp Ids into a character > vector and than run each line into the getBM command. I first think it is a > bad idea, as I am using a loop to inquire an online data base, but i don't > have a better option at the moment. > > The second problem is that it just takes a lot of time. > > I would appreciate your Ideas, If there is a better/faster way of doing it > > Thanks A. > > ? ? ? ?[[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Steve Lianoglou Graduate Student: Computational Systems Biology ?| Memorial Sloan-Kettering Cancer Center ?| Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact

ADD COMMENT • link 14.0 years ago Steve Lianoglou ★ 13k

0

Entering edit mode

Hi Steve, thanks for the help. I know about the strsplit function and i used it to split each row on its own by the ';' symbol. The problem I have is that I need to keep the information of each row in the row ( or at least to give it back after the biomaRt extraction). The table I have contains not only the protein IDs but also a lot of other stuff, which is connected to each of the proteins. This is why I need to know which proteins came from which line (Id). It will be nice if there was a possibility to do it as you suggested. Take all the Protein IDs, write them into one vector and run them with biomaRt. But than I would like to be able to put them back together in a row- wise fashion like I suggested at the beginning. Thanks again Assa On Wed, Jan 25, 2012 at 16:02, Steve Lianoglou < mailinglist.honeypot@gmail.com> wrote: > Hi Assa, > > Sorry for top posting. > > Your intuition is correct: you should not being querying biomart > inside a for loop. The idea is to create one query for all of your > protein IDs, and query it once. > > This is how you might go about it. First, let's look at the protein > IDs you already seem to have somewhere: > > > 45 FBpp0070037 > > 46 FBpp0070039;FBpp0070040 > > 47 FBpp0070041;FBpp0070042;FBpp0070043 > > 48 FBpp0070044;FBpp0110571 > > It seems you have multiple IDs jammed into one column of a data.frame > maybe? The rows which have more than one ID, (eg. > "FBpp0070039;FBpp0070040") will have to be split up so that each row > (or element in a vector) only has one ID. Look into using `strsplit`. > > You will need to get a character vector of protein ids -- one protein > per bin, it might look like so: > > pids <- c('FBpp0070037', 'FBpp0070039', 'FBpp0070040', 'FBpp0070041', > 'FBpp0070042', 'FBpp0070043') > > Now ... you're basically done. Let's rig up an object to query biomart > with: > > library(biomaRt) > mart <- useMart('ensembl', dataset='dmelanogaster_gene_ensembl') > ans <- > getBM(attributes=c("flybase_translation_id","flybase_gene_id","flyba sename_gene"), > filters="flybase_translation_id", values=pids, > mart=mart) > > Your answer will look like so: > > flybase_translation_id flybase_gene_id flybasename_gene > 1 FBpp0070037 FBgn0010215 alpha-Cat > 2 FBpp0070039 FBgn0052230 CG32230 > 3 FBpp0070040 FBgn0052230 CG32230 > 4 FBpp0070041 FBgn0000258 CkIIalpha > 5 FBpp0070042 FBgn0000258 CkIIalpha > 6 FBpp0070043 FBgn0000258 CkIIalpha > > Now you're left with figuring out what to do with multiple > "flybase_translaion_id"s that map to the same "flybasename_gene". > > You would have to do this anyway, but the key point here is that you > can now do it without querying biomart in a loop. > > HTH, > -steve > > > > > For each of these protein Ids (FBpp...), I would like to extract the gene > > id (Fbgn....) in a third column. the output table should looks like that: > > > > 45 FBpp0070037 FBgn001234 > > 46 FBpp0070039;FBpp0070040 FBgn00094432;FBgn002345 > > 47 FBpp0070041;FBpp0070042;FBpp0070043 > FBgn0001936;FBgn000102;FBgn004527 > > 48 FBpp0070044;FBpp0110571 FBgn0097234;FBgn00183 > > ... > > > > I was thinking using biomaRt, but I could find a way of automating it for > > the complete protein ids in the line. > > > > What I have done so far is this for loop: > > > > for(i in 1:dim(data)[1]){ > > temp=unlist(strsplit(data[i,2],";")) > > temp= gsub("REV__", "", temp) > > result= > > > getBM(attributes=c("flybase_translation_id","flybase_gene_id","flyba sename_gene"),filters="flybase_translation_id",values=temp, > > mart=mart, ) > > charresult ="" > > for (j in 1:length(result[[1]])) { > > # charresult<-paste(charresult,">", > > result[[1]][j],":",result[[2]][j], "\t", sep="") > > charresult<-paste(charresult, result[[2]][j], ";", sep="") > > } > > out<-"CompleteResults.txt" > > cat("line: ", i-1,"\t", "was written to ");cat(out);cat("\n") > > write.table(paste(i-1, charresult, sep="\t"),out, sep="\t", quote=F, > > col.names=F, row.names=F,append=T) > > } > > > > What I am doing is converting the string of FBpp Ids into a character > > vector and than run each line into the getBM command. I first think it > is a > > bad idea, as I am using a loop to inquire an online data base, but i > don't > > have a better option at the moment. > > > > The second problem is that it just takes a lot of time. > > > > I would appreciate your Ideas, If there is a better/faster way of doing > it > > > > Thanks A. > > > > [[alternative HTML version deleted]] > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor@r-project.org > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > -- > Steve Lianoglou > Graduate Student: Computational Systems Biology > | Memorial Sloan-Kettering Cancer Center > | Weill Medical College of Cornell University > Contact Info: http://cbio.mskcc.org/~lianos/contact > [[alternative HTML version deleted]]

ADD REPLY • link 14.0 years ago Assa Yeroslaviz ★ 1.5k

0

Entering edit mode

Hi Assa, you can try this con <- textConnection(data2seperate) seperatedData <- read.table(con,sep=";",stringsAsFactors=FALSE) #splitten It's nearly the same as the strsplit function but you get a table as output sorted by your input. I hope this helps. Best Basti 2012/1/26 Assa Yeroslaviz <frymor at="" gmail.com="">: > Hi Steve, > > thanks for the help. > > I know about the strsplit function and i used it to split each row on its > own by the ';' symbol. > The problem I have is that I need to keep the information of each row in > the row ( or at least to give it back after the biomaRt extraction). > > The table I have contains not only the protein IDs but also a lot of other > stuff, which is connected to each of the proteins. This is why I need to > know which proteins came from which line (Id). > > It will be nice if there was a possibility to do it as you suggested. Take > all the Protein IDs, write them into one vector and run them with biomaRt. > But than I would like to be able to put them back together in a row- wise > fashion like I suggested at the beginning. > > Thanks again > Assa > > On Wed, Jan 25, 2012 at 16:02, Steve Lianoglou < > mailinglist.honeypot at gmail.com> wrote: > >> Hi Assa, >> >> Sorry for top posting. >> >> Your intuition is correct: you should not being querying biomart >> inside a for loop. The idea is to create one query for all of your >> protein IDs, and query it once. >> >> This is how you might go about it. First, let's look at the protein >> IDs you already seem to have somewhere: >> >> > 45 ?FBpp0070037 >> > 46 ?FBpp0070039;FBpp0070040 >> > 47 ?FBpp0070041;FBpp0070042;FBpp0070043 >> > 48 ?FBpp0070044;FBpp0110571 >> >> It seems you have multiple IDs jammed into one column of a data.frame >> maybe? The rows which have more than one ID, (eg. >> "FBpp0070039;FBpp0070040") will have to be split up so that each row >> (or element in a vector) only has one ID. Look into using `strsplit`. >> >> You will need to get a character vector of protein ids -- one protein >> per bin, it might look like so: >> >> pids <- c('FBpp0070037', 'FBpp0070039', 'FBpp0070040', 'FBpp0070041', >> ? ? ? ? ?'FBpp0070042', 'FBpp0070043') >> >> Now ... you're basically done. Let's rig up an object to query biomart >> with: >> >> library(biomaRt) >> mart <- useMart('ensembl', dataset='dmelanogaster_gene_ensembl') >> ans <- >> getBM(attributes=c("flybase_translation_id","flybase_gene_id","flyb asename_gene"), >> ? ? ? ? ? ? ? ? ? ? filters="flybase_translation_id", values=pids, >> mart=mart) >> >> Your answer will look like so: >> >> ?flybase_translation_id flybase_gene_id flybasename_gene >> 1 ? ? ? ? ? ?FBpp0070037 ? ? FBgn0010215 ? ? ? ?alpha-Cat >> 2 ? ? ? ? ? ?FBpp0070039 ? ? FBgn0052230 ? ? ? ? ?CG32230 >> 3 ? ? ? ? ? ?FBpp0070040 ? ? FBgn0052230 ? ? ? ? ?CG32230 >> 4 ? ? ? ? ? ?FBpp0070041 ? ? FBgn0000258 ? ? ? ?CkIIalpha >> 5 ? ? ? ? ? ?FBpp0070042 ? ? FBgn0000258 ? ? ? ?CkIIalpha >> 6 ? ? ? ? ? ?FBpp0070043 ? ? FBgn0000258 ? ? ? ?CkIIalpha >> >> Now you're left with figuring out what to do with multiple >> "flybase_translaion_id"s that map to the same "flybasename_gene". >> >> You would have to do this anyway, but the key point here is that you >> can now do it without querying biomart in a loop. >> >> HTH, >> -steve >> >> >> >> > For each of these protein Ids (FBpp...), I would like to extract the gene >> > id (Fbgn....) in a third column. the output table should looks like that: >> > >> > 45 ?FBpp0070037 ? ? ? ? ? ? ? ? ? ? ? ? ?FBgn001234 >> > 46 ?FBpp0070039;FBpp0070040 ? ? ? ? ? ? ?FBgn00094432;FBgn002345 >> > 47 ?FBpp0070041;FBpp0070042;FBpp0070043 >> ?FBgn0001936;FBgn000102;FBgn004527 >> > 48 ?FBpp0070044;FBpp0110571 ? ? ? ? ? ? ?FBgn0097234;FBgn00183 >> > ... >> > >> > I was thinking using biomaRt, but I could find a way of automating it for >> > the complete protein ids in the line. >> > >> > What I have done so far is this for loop: >> > >> > for(i in 1:dim(data)[1]){ >> > ?temp=unlist(strsplit(data[i,2],";")) >> > ?temp= gsub("REV__", "", temp) >> > ?result= >> > >> getBM(attributes=c("flybase_translation_id","flybase_gene_id","flyb asename_gene"),filters="flybase_translation_id",values=temp, >> > mart=mart, ) >> > ? ? ?charresult ="" >> > ? ? ?for (j in 1:length(result[[1]])) { >> > # ? ? ? ? ?charresult<-paste(charresult,">", >> > result[[1]][j],":",result[[2]][j], "\t", sep="") >> > ? ? ? ? ?charresult<-paste(charresult, result[[2]][j], ";", sep="") >> > ? ? ? ? ?} >> > ? ? ?out<-"CompleteResults.txt" >> > ? ? ?cat("line: ", i-1,"\t", "was written to ");cat(out);cat("\n") >> > ? ? ?write.table(paste(i-1, charresult, sep="\t"),out, sep="\t", quote=F, >> > col.names=F, row.names=F,append=T) >> > ? ?} >> > >> > What I am doing is converting the string of FBpp Ids into a character >> > vector and than run each line into the getBM command. I first think it >> is a >> > bad idea, as I am using a loop to inquire an online data base, but i >> don't >> > have a better option at the moment. >> > >> > The second problem is that it just takes a lot of time. >> > >> > I would appreciate your Ideas, If there is a better/faster way of doing >> it >> > >> > Thanks A. >> > >> > ? ? ? ?[[alternative HTML version deleted]] >> > >> > _______________________________________________ >> > Bioconductor mailing list >> > Bioconductor at r-project.org >> > https://stat.ethz.ch/mailman/listinfo/bioconductor >> > Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> >> >> -- >> Steve Lianoglou >> Graduate Student: Computational Systems Biology >> ?| Memorial Sloan-Kettering Cancer Center >> ?| Weill Medical College of Cornell University >> Contact Info: http://cbio.mskcc.org/~lianos/contact >> > > ? ? ? ?[[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD REPLY • link 14.0 years ago Sebastian Thieme ▴ 60

0

Entering edit mode

On 01/26/2012 08:28 AM, Assa Yeroslaviz wrote: > Hi Steve, > > thanks for the help. > > I know about the strsplit function and i used it to split each row on its > own by the ';' symbol. > The problem I have is that I need to keep the information of each row in > the row ( or at least to give it back after the biomaRt extraction). > > The table I have contains not only the protein IDs but also a lot of other > stuff, which is connected to each of the proteins. This is why I need to > know which proteins came from which line (Id). > > It will be nice if there was a possibility to do it as you suggested. Take > all the Protein IDs, write them into one vector and run them with biomaRt. > But than I would like to be able to put them back together in a row- wise > fashion like I suggested at the beginning. > Hi Please allow me to jump in: If I understand your question correctly, then there is no other (easy) solution than querying biomart inside a loop. The problem is not the Bioconductor packagae biomaRt, but the actual biomart server behind the scene: Apparently there is now way to preserve the order of the input (or keep duplicates, or indicate which id does not have a result, etc). I recently asked the biomart folks about this issue, and the answer was that I need to post-process the output to get my original order back - I was lazy, and I queried the server in a loop (for my defense: it was only a handful of ids) Regards, Hans > Thanks again > Assa > > On Wed, Jan 25, 2012 at 16:02, Steve Lianoglou< > mailinglist.honeypot at gmail.com> wrote: > >> Hi Assa, >> >> Sorry for top posting. >> >> Your intuition is correct: you should not being querying biomart >> inside a for loop. The idea is to create one query for all of your >> protein IDs, and query it once. >> >> This is how you might go about it. First, let's look at the protein >> IDs you already seem to have somewhere: >> >>> 45 FBpp0070037 >>> 46 FBpp0070039;FBpp0070040 >>> 47 FBpp0070041;FBpp0070042;FBpp0070043 >>> 48 FBpp0070044;FBpp0110571 >> >> It seems you have multiple IDs jammed into one column of a data.frame >> maybe? The rows which have more than one ID, (eg. >> "FBpp0070039;FBpp0070040") will have to be split up so that each row >> (or element in a vector) only has one ID. Look into using `strsplit`. >> >> You will need to get a character vector of protein ids -- one protein >> per bin, it might look like so: >> >> pids<- c('FBpp0070037', 'FBpp0070039', 'FBpp0070040', 'FBpp0070041', >> 'FBpp0070042', 'FBpp0070043') >> >> Now ... you're basically done. Let's rig up an object to query biomart >> with: >> >> library(biomaRt) >> mart<- useMart('ensembl', dataset='dmelanogaster_gene_ensembl') >> ans<- >> getBM(attributes=c("flybase_translation_id","flybase_gene_id","flyb asename_gene"), >> filters="flybase_translation_id", values=pids, >> mart=mart) >> >> Your answer will look like so: >> >> flybase_translation_id flybase_gene_id flybasename_gene >> 1 FBpp0070037 FBgn0010215 alpha-Cat >> 2 FBpp0070039 FBgn0052230 CG32230 >> 3 FBpp0070040 FBgn0052230 CG32230 >> 4 FBpp0070041 FBgn0000258 CkIIalpha >> 5 FBpp0070042 FBgn0000258 CkIIalpha >> 6 FBpp0070043 FBgn0000258 CkIIalpha >> >> Now you're left with figuring out what to do with multiple >> "flybase_translaion_id"s that map to the same "flybasename_gene". >> >> You would have to do this anyway, but the key point here is that you >> can now do it without querying biomart in a loop. >> >> HTH, >> -steve >> >> >> >>> For each of these protein Ids (FBpp...), I would like to extract the gene >>> id (Fbgn....) in a third column. the output table should looks like that: >>> >>> 45 FBpp0070037 FBgn001234 >>> 46 FBpp0070039;FBpp0070040 FBgn00094432;FBgn002345 >>> 47 FBpp0070041;FBpp0070042;FBpp0070043 >> FBgn0001936;FBgn000102;FBgn004527 >>> 48 FBpp0070044;FBpp0110571 FBgn0097234;FBgn00183 >>> ... >>> >>> I was thinking using biomaRt, but I could find a way of automating it for >>> the complete protein ids in the line. >>> >>> What I have done so far is this for loop: >>> >>> for(i in 1:dim(data)[1]){ >>> temp=unlist(strsplit(data[i,2],";")) >>> temp= gsub("REV__", "", temp) >>> result= >>> >> getBM(attributes=c("flybase_translation_id","flybase_gene_id","flyb asename_gene"),filters="flybase_translation_id",values=temp, >>> mart=mart, ) >>> charresult ="" >>> for (j in 1:length(result[[1]])) { >>> # charresult<-paste(charresult,">", >>> result[[1]][j],":",result[[2]][j], "\t", sep="") >>> charresult<-paste(charresult, result[[2]][j], ";", sep="") >>> } >>> out<-"CompleteResults.txt" >>> cat("line: ", i-1,"\t", "was written to ");cat(out);cat("\n") >>> write.table(paste(i-1, charresult, sep="\t"),out, sep="\t", quote=F, >>> col.names=F, row.names=F,append=T) >>> } >>> >>> What I am doing is converting the string of FBpp Ids into a character >>> vector and than run each line into the getBM command. I first think it >> is a >>> bad idea, as I am using a loop to inquire an online data base, but i >> don't >>> have a better option at the moment. >>> >>> The second problem is that it just takes a lot of time. >>> >>> I would appreciate your Ideas, If there is a better/faster way of doing >> it >>> >>> Thanks A. >>> >>> [[alternative HTML version deleted]] >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> >> >> -- >> Steve Lianoglou >> Graduate Student: Computational Systems Biology >> | Memorial Sloan-Kettering Cancer Center >> | Weill Medical College of Cornell University >> Contact Info: http://cbio.mskcc.org/~lianos/contact >> > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD REPLY • link 14.0 years ago Hotz, Hans-Rudolf ▴ 400

0

Entering edit mode

On 01/26/2012 01:24 AM, Hans-Rudolf Hotz wrote: > > > On 01/26/2012 08:28 AM, Assa Yeroslaviz wrote: >> Hi Steve, >> >> thanks for the help. >> >> I know about the strsplit function and i used it to split each row on its >> own by the ';' symbol. >> The problem I have is that I need to keep the information of each row in >> the row ( or at least to give it back after the biomaRt extraction). >> >> The table I have contains not only the protein IDs but also a lot of >> other >> stuff, which is connected to each of the proteins. This is why I need to >> know which proteins came from which line (Id). >> >> It will be nice if there was a possibility to do it as you suggested. >> Take >> all the Protein IDs, write them into one vector and run them with >> biomaRt. >> But than I would like to be able to put them back together in a row-wise >> fashion like I suggested at the beginning. >> > > Hi > > Please allow me to jump in: > > If I understand your question correctly, then there is no other (easy) > solution than querying biomart inside a loop. > > The problem is not the Bioconductor packagae biomaRt, but the actual > biomart server behind the scene: Apparently there is now way to preserve > the order of the input (or keep duplicates, or indicate which id does > not have a result, etc). If the original ids are in a data.frame df <- data.frame(FBpp=c("FBpp0070037", "FBpp0070039;FBpp0070040", "FBpp0070041;FBpp0070042;FBpp0070043", "FBpp0070044;FBpp0110571"), stringsAsFactors=FALSE) and the 'split' ids are ids <- strsplit(df$FBpp, ";") then 'map' relates the ids to the row they come from: map <- rep(seq_len(nrow(df)), sapply(ids, length)) names(map) <- unlist(ids) so after querying biomaRt library(biomaRt) mart <- useMart('ensembl', dataset='dmelanogaster_gene_ensembl') ans <- getBM(attributes=c("flybase_translation_id", "flybase_gene_id", "flybasename_gene"), filters="flybase_translation_id", values=names(map), mart=mart) and writing a little helper function to 'unsplit' a character vector x into 'collapsed' strings based on a factor f strunsplit <- function(x, f, collapse=";") { sapply(split(x, f), paste, collapse=collapse) } the original data.frame can be updated as FBgn <- strunsplit(ans$flybase_gene_id, map[ans$flybase_translation_id]) df$FBgn[as.integer(names(FBgn))] <- FBgn I guess the contortions occur because of the original data.frame. A different representation with the same information, assuming 'Id' is the criterion for joining the FBbb ids in the first place, is > df1 <- data.frame(Id=map, FBpp=names(map), row.names=NULL) > df1 Id FBpp 1 1 FBpp0070037 2 2 FBpp0070039 3 2 FBpp0070040 ... Martin > > I recently asked the biomart folks about this issue, and the answer was > that I need to post-process the output to get my original order back - I > was lazy, and I queried the server in a loop (for my defense: it was > only a handful of ids) > > > Regards, Hans > >> Thanks again >> Assa >> >> On Wed, Jan 25, 2012 at 16:02, Steve Lianoglou< >> mailinglist.honeypot at gmail.com> wrote: >> >>> Hi Assa, >>> >>> Sorry for top posting. >>> >>> Your intuition is correct: you should not being querying biomart >>> inside a for loop. The idea is to create one query for all of your >>> protein IDs, and query it once. >>> >>> This is how you might go about it. First, let's look at the protein >>> IDs you already seem to have somewhere: >>> >>>> 45 FBpp0070037 >>>> 46 FBpp0070039;FBpp0070040 >>>> 47 FBpp0070041;FBpp0070042;FBpp0070043 >>>> 48 FBpp0070044;FBpp0110571 >>> >>> It seems you have multiple IDs jammed into one column of a data.frame >>> maybe? The rows which have more than one ID, (eg. >>> "FBpp0070039;FBpp0070040") will have to be split up so that each row >>> (or element in a vector) only has one ID. Look into using `strsplit`. >>> >>> You will need to get a character vector of protein ids -- one protein >>> per bin, it might look like so: >>> >>> pids<- c('FBpp0070037', 'FBpp0070039', 'FBpp0070040', 'FBpp0070041', >>> 'FBpp0070042', 'FBpp0070043') >>> >>> Now ... you're basically done. Let's rig up an object to query biomart >>> with: >>> >>> library(biomaRt) >>> mart<- useMart('ensembl', dataset='dmelanogaster_gene_ensembl') >>> ans<- >>> getBM(attributes=c("flybase_translation_id","flybase_gene_id","fly basename_gene"), >>> >>> filters="flybase_translation_id", values=pids, >>> mart=mart) >>> >>> Your answer will look like so: >>> >>> flybase_translation_id flybase_gene_id flybasename_gene >>> 1 FBpp0070037 FBgn0010215 alpha-Cat >>> 2 FBpp0070039 FBgn0052230 CG32230 >>> 3 FBpp0070040 FBgn0052230 CG32230 >>> 4 FBpp0070041 FBgn0000258 CkIIalpha >>> 5 FBpp0070042 FBgn0000258 CkIIalpha >>> 6 FBpp0070043 FBgn0000258 CkIIalpha >>> >>> Now you're left with figuring out what to do with multiple >>> "flybase_translaion_id"s that map to the same "flybasename_gene". >>> >>> You would have to do this anyway, but the key point here is that you >>> can now do it without querying biomart in a loop. >>> >>> HTH, >>> -steve >>> >>> >>> >>>> For each of these protein Ids (FBpp...), I would like to extract the >>>> gene >>>> id (Fbgn....) in a third column. the output table should looks like >>>> that: >>>> >>>> 45 FBpp0070037 FBgn001234 >>>> 46 FBpp0070039;FBpp0070040 FBgn00094432;FBgn002345 >>>> 47 FBpp0070041;FBpp0070042;FBpp0070043 >>> FBgn0001936;FBgn000102;FBgn004527 >>>> 48 FBpp0070044;FBpp0110571 FBgn0097234;FBgn00183 >>>> ... >>>> >>>> I was thinking using biomaRt, but I could find a way of automating >>>> it for >>>> the complete protein ids in the line. >>>> >>>> What I have done so far is this for loop: >>>> >>>> for(i in 1:dim(data)[1]){ >>>> temp=unlist(strsplit(data[i,2],";")) >>>> temp= gsub("REV__", "", temp) >>>> result= >>>> >>> getBM(attributes=c("flybase_translation_id","flybase_gene_id","fly basename_gene"),filters="flybase_translation_id",values=temp, >>> >>>> mart=mart, ) >>>> charresult ="" >>>> for (j in 1:length(result[[1]])) { >>>> # charresult<-paste(charresult,">", >>>> result[[1]][j],":",result[[2]][j], "\t", sep="") >>>> charresult<-paste(charresult, result[[2]][j], ";", sep="") >>>> } >>>> out<-"CompleteResults.txt" >>>> cat("line: ", i-1,"\t", "was written to ");cat(out);cat("\n") >>>> write.table(paste(i-1, charresult, sep="\t"),out, sep="\t", quote=F, >>>> col.names=F, row.names=F,append=T) >>>> } >>>> >>>> What I am doing is converting the string of FBpp Ids into a character >>>> vector and than run each line into the getBM command. I first think it >>> is a >>>> bad idea, as I am using a loop to inquire an online data base, but i >>> don't >>>> have a better option at the moment. >>>> >>>> The second problem is that it just takes a lot of time. >>>> >>>> I would appreciate your Ideas, If there is a better/faster way of doing >>> it >>>> >>>> Thanks A. >>>> >>>> [[alternative HTML version deleted]] >>>> >>>> _______________________________________________ >>>> Bioconductor mailing list >>>> Bioconductor at r-project.org >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >>> >>> >>> -- >>> Steve Lianoglou >>> Graduate Student: Computational Systems Biology >>> | Memorial Sloan-Kettering Cancer Center >>> | Weill Medical College of Cornell University >>> Contact Info: http://cbio.mskcc.org/~lianos/contact >>> >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor -- Computational Biology Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: M1-B861 Telephone: 206 667-2793

ADD REPLY • link 14.0 years ago Martin Morgan 25k

Login before adding your answer.