How to get position for each gene ID/gene symbol instead of position for each transcript
1
1
Entering edit mode
shirley zhang ★ 1.0k
@shirley-zhang-2038
Last seen 9.6 years ago
Hi Marc, Thanks a lot for your detailed reply. It is very helpful. I tried library "GenomicFeatures" and your code. It works great. I did get position for each Transcript. However I am only interested in obtaining position for each gene ID/gene symbol instead of position for each transcript. For example, if one gene has multiple transcripts, I only want the biggest boundary of each gene which includes all of its transcripts. Does GenomicFeatures or other package can do this? Or I have to compute the gene position by combining all of its corresponding transcripts? Thanks again for your suggestions. Shirley On Thu, Aug 19, 2010 at 2:50 PM, Marc Carlson <mcarlson at="" fhcrc.org=""> wrote: > Hi Shirly, > > That is a very good question! ?Looking at the latest builds of the > annotations (which is what we normally would recommend), the manual page > will now tell you that the source for the latest CHRLOC and CHRLOCEND > information is from here: > > ? ? Mappings were based on data provided by: UCSC Genome > ? ? Bioinformatics (Homo sapiens) > ? ? ftp://hgdownload.cse.ucsc.edu/goldenPath/hg19 > > That means that this information corresponds to the more recent "hg19" > build which in turn will correspond to "build 37". ?So if you want older > builds you could use older versions of the org packages, but I since I > would never in good conscience recommend using stale annotations, I > think that in that case you would be much better off to do the initial > steps I showed you (to get the entrez gene IDs), and THEN using the > GenomicFeatures package, which should allow you to get positions for the > hg18 build mapped onto entrez gene IDs. > > You can see a vignette for GenomicFeatures here: > > http://www.bioconductor.org/packages/release/bioc/html/GenomicFeatur es.html > > You would basically just need to use makeTranscriptDbFromUCSC(), > serialize the result for safekeeping, and then call transcripts() on the > object it gave you. ?Once you had done so, you could simply ask for the > > library(GenomicFeatures) > foo = makeTranscriptDbFromUCSC(genome="hg18",tablename="knownGene") > bar = transcriptsBy(foo, by="gene") > ## At this point you will have ?GRanges object which is really useful > for a lot of great stuff, > ## but since you are only interested in matching the gene_ids to the > positions, > ## you can just pare it down to a much simpler object like so: > > frm = as.data.frame(bar) > > ## The values in "element" will be the names of the elements from the > GRangesList object bar. > ## If you check them you will see that for this UCSC track, they are the > entrez gene IDs that you want to merge with. > > Hope this helps, > > > ?Marc > > > > On 08/18/2010 06:26 PM, shirley zhang wrote: >> Hi Marc, >> >> Thanks a lot for your information. It is very very helpful. May I ask >> one more question? >> >> In my case, ?the snp_position is based on Human Build 36. ?The package >> org.Hs.eg.db was updated in October 2009, so can I assume it is on >> Build 36? >> >> Thanks again, >> Shirely >> >> On Wed, Aug 18, 2010 at 12:56 PM, Marc Carlson <mcarlson at="" fhcrc.org=""> wrote: >> >>> Hi Shirley, >>> >>> You can find this information in an organism package. ?org.Hs.eg.db for >>> humans (for example). >>> >>> What will be harder will be getting all the information from the mixed >>> bag of IDs that you describe. ?But you can still retrieve it for each ID >>> type separately like this example for refseq IDs: >>> >>> ## So for refseq you can start by getting the entrez gene ID >>> library(org.Hs.eg.db) >>> a = c("NM_000014", "NM_000015", "foo") >>> gene_id = unlist2(mget(a, revmap(org.Hs.egREFSEQ), ifnotfound=NA)) >>> gene_id = gene_id[!is.na(gene_id)] >>> >>> ## And then once you have the IDs, you can map them to a position >>> chrloc = toTable(org.Hs.egCHRLOC[gene_id]) >>> chrlocend = toTable(org.Hs.egCHRLOCEND[gene_id]) >>> >>> ## And now because we were careful to make sure that we always have an >>> entrez gene ID, >>> ## you can just merge all the results together to make this easier to >>> look at: >>> merge(merge(data.frame(gene_id=gene_id, refseq=names(gene_id)), chrloc), >>> chrlocend) >>> >>> >>> ## This process can then be repeated for other kinds of IDs etc: >>> b = c("Hs.100217", "Hs.100299") >>> gene_id = unlist2(mget(b, revmap(org.Hs.egUNIGENE), ifnotfound=NA)) >>> gene_id = gene_id[!is.na(gene_id)] >>> >>> chrloc = toTable(org.Hs.egCHRLOC[gene_id]) >>> chrlocend = toTable(org.Hs.egCHRLOCEND[gene_id]) >>> ## etc. >>> >>> Does that help? >>> >>> >>> ?Marc >>> >>> >>> >>> >>> >>> On 08/17/2010 07:26 PM, shirley zhang wrote: >>> >>>> Dear list, >>>> >>>> I have a list of human SNPs and genes/transcripts which are obtained >>>> from different studies. ?I would like to calculate the distance >>>> between these SNPs and genes/transcripts on the same chromosome. >>>> However, the ID for the gene/transcript across studies ?is not >>>> consistent. For example, the ID could be Gene Symbol/Alias (TRIM5l), >>>> RNA nucleotide accession (NM_013387), XM_173221 (its gene symbol is >>>> LOC254266, but its record is an obsolete version), Contig22780_RC, >>>> HSS00001378, Hs.445401 (UniGene entry Hs.445401), etc. >>>> >>>> I would like to retrieve the start and end position for those >>>> gene/transcript. Can anybody help me how to get the position >>>> information based on these different gene/transcript IDs. If I also >>>> need to match all of other IDs to gene symbol/entrezID, what kind of >>>> database or bioConductor package will allow me to do the mapping as >>>> perfect as possible? How about biomaRt package? >>>> >>>> Thanks >>>> >>>> _______________________________________________ >>>> Bioconductor mailing list >>>> Bioconductor at stat.math.ethz.ch >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >>>> >>>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at stat.math.ethz.ch >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >>> >> >> >> > >
Organism PROcess biomaRt GenomicFeatures Organism PROcess biomaRt GenomicFeatures • 2.1k views
ADD COMMENT
1
Entering edit mode
@steve-lianoglou-2771
Last seen 14 months ago
United States
Hi Shirley, On Tue, Aug 24, 2010 at 9:07 PM, shirley zhang <shirley0818 at="" gmail.com=""> wrote: > Hi Marc, > > Thanks a lot for your detailed reply. It is very helpful. > > I tried library "GenomicFeatures" and your code. It works great. I did > get position for each Transcript. ?However I am only interested in > obtaining position for each gene ID/gene symbol instead of position > for each transcript. For example, if one gene has multiple > transcripts, I only want the biggest boundary of each gene which > includes all of its transcripts. Does GenomicFeatures or other package > can do this? Or I have to compute the gene position by combining all > of its corresponding transcripts? You can do this pretty "simply" with GenomicFeatures, if you want to stick with that: R> txdb <- loadFeatures('your.transcript.db') R> xcripts <- transcriptsBy(txdb, by='gene') ## This part is really slow -- this will be subject of next email R> gene.bounds <- seqapply(xcripts, reduce) the names() of gene.bounds is the entrez.id of the gene. You can use the org.Hs.eg.db pacakges R> library(org.Hs.eg.db) R> symbols <- mget(names(gene.bounds), org.Hs.egSYMBOL, ifnotfound=NA) symbols will now be a list (names are entrez ids, values are the gene symbols) that you can manipulate in "the standard R way" Hope that helps, -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology ?| Memorial Sloan-Kettering Cancer Center ?| Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact
ADD COMMENT
0
Entering edit mode
Sorry: > You can do this pretty "simply" with GenomicFeatures, if you want to > stick with that: > > R> txdb <- loadFeatures('your.transcript.db') > R> xcripts <- transcriptsBy(txdb, by='gene') > > ## This part is really slow -- this will be subject of next email > R> gene.bounds <- seqapply(xcripts, reduce) Should have used `range` instead of `reduce` here: R> gene.bounds <- seqapply(xcripts, range) The rest is the same ... > the names() of gene.bounds is the entrez.id of the gene. You can use > the org.Hs.eg.db pacakges > > R> library(org.Hs.eg.db) > R> symbols <- mget(names(gene.bounds), org.Hs.egSYMBOL, ifnotfound=NA) > > symbols will now be a list (names are entrez ids, values are the gene > symbols) that you can manipulate in "the standard R way" > > Hope that helps, > > -steve > > -- > Steve Lianoglou > Graduate Student: Computational Systems Biology > ?| Memorial Sloan-Kettering Cancer Center > ?| Weill Medical College of Cornell University > Contact Info: http://cbio.mskcc.org/~lianos/contact > -- Steve Lianoglou Graduate Student: Computational Systems Biology ?| Memorial Sloan-Kettering Cancer Center ?| Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact
ADD REPLY
0
Entering edit mode
Thanks Steve. It is exactly what I want. Good night! Shirley On Tue, Aug 24, 2010 at 10:41 PM, Steve Lianoglou <mailinglist.honeypot at="" gmail.com=""> wrote: > Sorry: > >> You can do this pretty "simply" with GenomicFeatures, if you want to >> stick with that: >> >> R> txdb <- loadFeatures('your.transcript.db') >> R> xcripts <- transcriptsBy(txdb, by='gene') >> >> ## This part is really slow -- this will be subject of next email >> R> gene.bounds <- seqapply(xcripts, reduce) > > Should have used `range` instead of `reduce` here: > > R> gene.bounds <- seqapply(xcripts, range) > > The rest is the same ... > >> the names() of gene.bounds is the entrez.id of the gene. You can use >> the org.Hs.eg.db pacakges >> >> R> library(org.Hs.eg.db) >> R> symbols <- mget(names(gene.bounds), org.Hs.egSYMBOL, ifnotfound=NA) >> >> symbols will now be a list (names are entrez ids, values are the gene >> symbols) that you can manipulate in "the standard R way" >> >> Hope that helps, >> >> -steve >> >> -- >> Steve Lianoglou >> Graduate Student: Computational Systems Biology >> ?| Memorial Sloan-Kettering Cancer Center >> ?| Weill Medical College of Cornell University >> Contact Info: http://cbio.mskcc.org/~lianos/contact >> > > > > -- > Steve Lianoglou > Graduate Student: Computational Systems Biology > ?| Memorial Sloan-Kettering Cancer Center > ?| Weill Medical College of Cornell University > Contact Info: http://cbio.mskcc.org/~lianos/contact > -- Xiaoling (Shirley) Zhang M.D., Ph.D. (Bioinformatics) Boston University, Boston, MA Tel: (857) 233-9862 Email: zhangxl at bu.edu
ADD REPLY

Login before adding your answer.

Traffic: 1058 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6