get position information for different gene/transcript IDs

0

Entering edit mode

shirley zhang ★ 1.0k

@shirley-zhang-2038

Last seen 9.6 years ago

Dear list, I have a list of human SNPs and genes/transcripts which are obtained from different studies. I would like to calculate the distance between these SNPs and genes/transcripts on the same chromosome. However, the ID for the gene/transcript across studies is not consistent. For example, the ID could be Gene Symbol/Alias (TRIM5l), RNA nucleotide accession (NM_013387), XM_173221 (its gene symbol is LOC254266, but its record is an obsolete version), Contig22780_RC, HSS00001378, Hs.445401 (UniGene entry Hs.445401), etc. I would like to retrieve the start and end position for those gene/transcript. Can anybody help me how to get the position information based on these different gene/transcript IDs. If I also need to match all of other IDs to gene symbol/entrezID, what kind of database or bioConductor package will allow me to do the mapping as perfect as possible? How about biomaRt package? Thanks

biomaRt biomaRt • 931 views

ADD COMMENT • link updated 13.7 years ago by Marc Carlson ★ 7.2k • written 13.7 years ago by shirley zhang ★ 1.0k

0

Entering edit mode

Marc Carlson ★ 7.2k

@marc-carlson-2264

Last seen 7.7 years ago

United States

Hi Shirley, You can find this information in an organism package. org.Hs.eg.db for humans (for example). What will be harder will be getting all the information from the mixed bag of IDs that you describe. But you can still retrieve it for each ID type separately like this example for refseq IDs: ## So for refseq you can start by getting the entrez gene ID library(org.Hs.eg.db) a = c("NM_000014", "NM_000015", "foo") gene_id = unlist2(mget(a, revmap(org.Hs.egREFSEQ), ifnotfound=NA)) gene_id = gene_id[!is.na(gene_id)] ## And then once you have the IDs, you can map them to a position chrloc = toTable(org.Hs.egCHRLOC[gene_id]) chrlocend = toTable(org.Hs.egCHRLOCEND[gene_id]) ## And now because we were careful to make sure that we always have an entrez gene ID, ## you can just merge all the results together to make this easier to look at: merge(merge(data.frame(gene_id=gene_id, refseq=names(gene_id)), chrloc), chrlocend) ## This process can then be repeated for other kinds of IDs etc: b = c("Hs.100217", "Hs.100299") gene_id = unlist2(mget(b, revmap(org.Hs.egUNIGENE), ifnotfound=NA)) gene_id = gene_id[!is.na(gene_id)] chrloc = toTable(org.Hs.egCHRLOC[gene_id]) chrlocend = toTable(org.Hs.egCHRLOCEND[gene_id]) ## etc. Does that help? Marc On 08/17/2010 07:26 PM, shirley zhang wrote: > Dear list, > > I have a list of human SNPs and genes/transcripts which are obtained > from different studies. I would like to calculate the distance > between these SNPs and genes/transcripts on the same chromosome. > However, the ID for the gene/transcript across studies is not > consistent. For example, the ID could be Gene Symbol/Alias (TRIM5l), > RNA nucleotide accession (NM_013387), XM_173221 (its gene symbol is > LOC254266, but its record is an obsolete version), Contig22780_RC, > HSS00001378, Hs.445401 (UniGene entry Hs.445401), etc. > > I would like to retrieve the start and end position for those > gene/transcript. Can anybody help me how to get the position > information based on these different gene/transcript IDs. If I also > need to match all of other IDs to gene symbol/entrezID, what kind of > database or bioConductor package will allow me to do the mapping as > perfect as possible? How about biomaRt package? > > Thanks > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >

ADD COMMENT • link 13.7 years ago Marc Carlson ★ 7.2k

0

Entering edit mode

Hi Marc, Thanks a lot for your information. It is very very helpful. May I ask one more question? In my case, the snp_position is based on Human Build 36. ?The package org.Hs.eg.db was updated in October 2009, so can I assume it is on Build 36? Thanks again, Shirely On Wed, Aug 18, 2010 at 12:56 PM, Marc Carlson <mcarlson at="" fhcrc.org=""> wrote: > Hi Shirley, > > You can find this information in an organism package. ?org.Hs.eg.db for > humans (for example). > > What will be harder will be getting all the information from the mixed > bag of IDs that you describe. ?But you can still retrieve it for each ID > type separately like this example for refseq IDs: > > ## So for refseq you can start by getting the entrez gene ID > library(org.Hs.eg.db) > a = c("NM_000014", "NM_000015", "foo") > gene_id = unlist2(mget(a, revmap(org.Hs.egREFSEQ), ifnotfound=NA)) > gene_id = gene_id[!is.na(gene_id)] > > ## And then once you have the IDs, you can map them to a position > chrloc = toTable(org.Hs.egCHRLOC[gene_id]) > chrlocend = toTable(org.Hs.egCHRLOCEND[gene_id]) > > ## And now because we were careful to make sure that we always have an > entrez gene ID, > ## you can just merge all the results together to make this easier to > look at: > merge(merge(data.frame(gene_id=gene_id, refseq=names(gene_id)), chrloc), > chrlocend) > > > ## This process can then be repeated for other kinds of IDs etc: > b = c("Hs.100217", "Hs.100299") > gene_id = unlist2(mget(b, revmap(org.Hs.egUNIGENE), ifnotfound=NA)) > gene_id = gene_id[!is.na(gene_id)] > > chrloc = toTable(org.Hs.egCHRLOC[gene_id]) > chrlocend = toTable(org.Hs.egCHRLOCEND[gene_id]) > ## etc. > > Does that help? > > > ?Marc > > > > > > On 08/17/2010 07:26 PM, shirley zhang wrote: >> Dear list, >> >> I have a list of human SNPs and genes/transcripts which are obtained >> from different studies. ?I would like to calculate the distance >> between these SNPs and genes/transcripts on the same chromosome. >> However, the ID for the gene/transcript across studies ?is not >> consistent. For example, the ID could be Gene Symbol/Alias (TRIM5l), >> RNA nucleotide accession (NM_013387), XM_173221 (its gene symbol is >> LOC254266, but its record is an obsolete version), Contig22780_RC, >> HSS00001378, Hs.445401 (UniGene entry Hs.445401), etc. >> >> I would like to retrieve the start and end position for those >> gene/transcript. Can anybody help me how to get the position >> information based on these different gene/transcript IDs. If I also >> need to match all of other IDs to gene symbol/entrezID, what kind of >> database or bioConductor package will allow me to do the mapping as >> perfect as possible? How about biomaRt package? >> >> Thanks >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Xiaoling (Shirley) Zhang M.D., Ph.D. (Bioinformatics) Boston University, Boston, MA Tel: (857) 233-9862 Email: zhangxl at bu.edu

ADD REPLY • link 13.7 years ago shirley zhang ★ 1.0k

0

Entering edit mode

Hi Shirly, That is a very good question! Looking at the latest builds of the annotations (which is what we normally would recommend), the manual page will now tell you that the source for the latest CHRLOC and CHRLOCEND information is from here: Mappings were based on data provided by: UCSC Genome Bioinformatics (Homo sapiens) ftp://hgdownload.cse.ucsc.edu/goldenPath/hg19 That means that this information corresponds to the more recent "hg19" build which in turn will correspond to "build 37". So if you want older builds you could use older versions of the org packages, but I since I would never in good conscience recommend using stale annotations, I think that in that case you would be much better off to do the initial steps I showed you (to get the entrez gene IDs), and THEN using the GenomicFeatures package, which should allow you to get positions for the hg18 build mapped onto entrez gene IDs. You can see a vignette for GenomicFeatures here: http://www.bioconductor.org/packages/release/bioc/html/GenomicFeatures .html You would basically just need to use makeTranscriptDbFromUCSC(), serialize the result for safekeeping, and then call transcripts() on the object it gave you. Once you had done so, you could simply ask for the library(GenomicFeatures) foo = makeTranscriptDbFromUCSC(genome="hg18",tablename="knownGene") bar = transcriptsBy(foo, by="gene") ## At this point you will have GRanges object which is really useful for a lot of great stuff, ## but since you are only interested in matching the gene_ids to the positions, ## you can just pare it down to a much simpler object like so: frm = as.data.frame(bar) ## The values in "element" will be the names of the elements from the GRangesList object bar. ## If you check them you will see that for this UCSC track, they are the entrez gene IDs that you want to merge with. Hope this helps, Marc On 08/18/2010 06:26 PM, shirley zhang wrote: > Hi Marc, > > Thanks a lot for your information. It is very very helpful. May I ask > one more question? > > In my case, the snp_position is based on Human Build 36. The package > org.Hs.eg.db was updated in October 2009, so can I assume it is on > Build 36? > > Thanks again, > Shirely > > On Wed, Aug 18, 2010 at 12:56 PM, Marc Carlson <mcarlson at="" fhcrc.org=""> wrote: > >> Hi Shirley, >> >> You can find this information in an organism package. org.Hs.eg.db for >> humans (for example). >> >> What will be harder will be getting all the information from the mixed >> bag of IDs that you describe. But you can still retrieve it for each ID >> type separately like this example for refseq IDs: >> >> ## So for refseq you can start by getting the entrez gene ID >> library(org.Hs.eg.db) >> a = c("NM_000014", "NM_000015", "foo") >> gene_id = unlist2(mget(a, revmap(org.Hs.egREFSEQ), ifnotfound=NA)) >> gene_id = gene_id[!is.na(gene_id)] >> >> ## And then once you have the IDs, you can map them to a position >> chrloc = toTable(org.Hs.egCHRLOC[gene_id]) >> chrlocend = toTable(org.Hs.egCHRLOCEND[gene_id]) >> >> ## And now because we were careful to make sure that we always have an >> entrez gene ID, >> ## you can just merge all the results together to make this easier to >> look at: >> merge(merge(data.frame(gene_id=gene_id, refseq=names(gene_id)), chrloc), >> chrlocend) >> >> >> ## This process can then be repeated for other kinds of IDs etc: >> b = c("Hs.100217", "Hs.100299") >> gene_id = unlist2(mget(b, revmap(org.Hs.egUNIGENE), ifnotfound=NA)) >> gene_id = gene_id[!is.na(gene_id)] >> >> chrloc = toTable(org.Hs.egCHRLOC[gene_id]) >> chrlocend = toTable(org.Hs.egCHRLOCEND[gene_id]) >> ## etc. >> >> Does that help? >> >> >> Marc >> >> >> >> >> >> On 08/17/2010 07:26 PM, shirley zhang wrote: >> >>> Dear list, >>> >>> I have a list of human SNPs and genes/transcripts which are obtained >>> from different studies. I would like to calculate the distance >>> between these SNPs and genes/transcripts on the same chromosome. >>> However, the ID for the gene/transcript across studies is not >>> consistent. For example, the ID could be Gene Symbol/Alias (TRIM5l), >>> RNA nucleotide accession (NM_013387), XM_173221 (its gene symbol is >>> LOC254266, but its record is an obsolete version), Contig22780_RC, >>> HSS00001378, Hs.445401 (UniGene entry Hs.445401), etc. >>> >>> I would like to retrieve the start and end position for those >>> gene/transcript. Can anybody help me how to get the position >>> information based on these different gene/transcript IDs. If I also >>> need to match all of other IDs to gene symbol/entrezID, what kind of >>> database or bioConductor package will allow me to do the mapping as >>> perfect as possible? How about biomaRt package? >>> >>> Thanks >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at stat.math.ethz.ch >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >>> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> > > >

ADD REPLY • link 13.7 years ago Marc Carlson ★ 7.2k

Login before adding your answer.