How to know in advance what kind of ID one has in using getGO or GetGene with biomaRt?

0

Entering edit mode

Glynn, Earl ▴ 170

@glynn-earl-952

Last seen 9.6 years ago

I have a list of about 20,000 "Accession Reference" IDs and I want to find corresponding Gene and GO information. The IDs that start with "NM_" all seem to work fine as type="refseq", but others, starting with, "S" or "AB" or "AF" can be found only as type="embl". Those with "XM" seemingly cannot be found. What information is stored in the prefix of an ID? What do NM_, S, AB, AF, or XM mean, and where is information about these prefixes? Does it make sense to have a function that returns the type of an ID? Does it make sense to have biomaRt functions automtically "know" about the various kinds of IDs? I don't see how to vectorize any of this when one must check the type of ID with each call. Below I try "embl" IDs first because after a first pass I know I can only connect about 3,000 out of 20,000 identifiers as "refseq". Overall, trying both "embl" and then "refseq" matches perhaps 90% of the dataset of 20,000, but this doesn't seem very "clean", and perhaps about 1,000 XM probes were never matched: > # Show problem in knowing type of identifier while fetching GO or Gene info > # using biomaRt. efg, 30 Jan 2006 > > library(biomaRt) Loading required package: RMySQL Loading required package: DBI Loading required package: XML Warning message: DLL attempted to change FPU control word from 8001f to 9001f > mart <- martConnect() connected to: ensembl_mart_36 > > # First five "Accession Reference" IDs from CAMDA06-related probe dataset: > # http://ecom2.mwgdna.com/download/arrays/arrays/gene_id/xls/gene_id_hum an_40k_a.xls > # (discard _N or _NN in IDs) > probe.list <- c("NM_001533", "NM_031990", "S76822", "AF232742", "AB035863") > > GeneInfo.List <- NULL > > for (i in 1:length(probe.list)) + { + probe <- probe.list[i] + + # Assume embl ID + GOinfo <- getGO(id=probe,type="embl",species="hsapiens",mart=mart) + if ( (length(GOinfo at table$GOID) == 1) & is.na(GOinfo at table$GOID[1]) ) + { + # IF embl ID fails, try as refseq (perhaps 15% refseqs with NM_ + GOinfo <- getGO( id=probe,type="refseq",species="hsapiens",mart=mart) + GeneInfo <- getGene(id=probe,type="refseq",species="hsapiens",mart=mart) + cat(i, "refseq", probe, "\n") + + + } else { + cat(i, "embl", probe, "\n") + GeneInfo <- getGene(id=probe,type="embl",species="hsapiens",mart=mart) + } + + GeneInfo.List <- rbind( GeneInfo.List, + c(probe, + unlist( GeneInfo at table[c(1,3,4,5,6,7,2)]) )) + + cat(GOinfo at id[1], GOinfo at table$GOID, "\n") + } 1 refseq NM_001533 NM_001533 GO:0000166 GO:0003723 GO:0006397 GO:0005654 GO:0030530 GO:0005634 2 refseq NM_031990 NM_031990 GO:0000166 GO:0005515 GO:0008187 GO:0000398 GO:0008380 GO:0005654 GO:0005730 GO:0030530 GO:0003676 GO:0003723 GO:0006397 GO:0005634 3 embl S76822 S76822 GO:0000287 GO:0004310 GO:0016491 GO:0016740 GO:0006695 GO:0008299 GO:0005783 GO:0016021 4 embl AF232742 AF232742 GO:0003807 GO:0004263 GO:0004295 GO:0008233 GO:0006508 GO:0006954 GO:0007596 GO:0042730 GO:0005615 5 embl AB035863 AB035863 GO:0016874 GO:0008152 GO:0004775 GO:0006099 GO:0006104 GO:0006781 GO:0005739 > > print(GeneInfo.List) symbol band chromosome start end martID [1,] "NM_001533" "HNRPL" "q13.2" "19" "44018883" "44032452" "ENSG00000104824" [2,] "NM_031990" "PTBP1" "p13.3" "19" "748411" "763327" "ENSG00000011304" [3,] "S76822" "FDFT1" "p23.1" "8" "11697664" "11734215" "ENSG00000079459" [4,] "AF232742" "KLKB1" "q35.2" "4" "187523815" "187554773" "ENSG00000164344" [5,] "AB035863" "SUCLA2" "q14.2" "13" "47414793" "47473463" "ENSG00000136143" description [1,] "Heterogeneous nuclear ribonucleoprotein L (hnRNP L). [Source:Uniprot/SWISSPROT;Acc:P14866]" [2,] "Polypyrimidine tract-binding protein 1 (PTB) (Heterogeneous nuclear ribonucleoprotein I) (hnRNP I) (57 kDa RNA-binding protein PPTB-1). [Source:Uniprot/SWISSPROT;Acc:P26599]" [3,] "Squalene synthetase (EC 2.5.1.21) (SQS) (SS) (Farnesyl- diphosphate farnesyltransferase) (FPP:FPP farnesyltransferase). [Source:Uniprot/SWISSPROT;Acc:P37268]" [4,] "Plasma kallikrein precursor (EC 3.4.21.34) (Plasma prekallikrein) (Kininogenin) (Fletcher factor) [Contains: Plasma kallikrein heavy chain; Plasma kallikrein light chain]. [Source:Uniprot/SWISSPROT;Acc:P03952]" [5,] "Succinyl-CoA ligase [ADP-forming] beta-chain, mitochondrial precursor (EC 6.2.1.5) (Succinyl-CoA synthetase, betaA chain) (SCS-betaA) (ATP- specific succinyl-CoA synthetase beta subunit). [Source:Uniprot/SWISSPROT;Acc:Q9P2R7]" > write.csv(GeneInfo.List, row.names=F, file="GeneInfo.csv") > > martDisconnect(mart) efg Bioinformatics Stowers Institute

GO probe biomaRt GO probe biomaRt • 1.1k views

ADD COMMENT • link updated 18.2 years ago by Steffen Durinck ▴ 420 • written 18.2 years ago by Glynn, Earl ▴ 170

0

Entering edit mode

Steffen Durinck ▴ 420

@steffen-durinck-519

Last seen 9.6 years ago

Hi, Ideally your microarray provider should have added an extra column to the excel sheet indicating which from database each identifier comes. This is what I found on "XM_*" IDs on the NCBI website: >RefSeq Model (predicted) Sequence Records from the Human Genome annotation >process Two letters (XM, XP, or XR), an underscore bar, and six >digits, e.g.: >XM_000483 Your XM IDs have an extra underscore and digit, removing that and querying at NCBI gives (for example for ID XM_170432_1) XM_170432 Reports gi|20542485|ref|XM_170432.1|[20542485] This record was removed as a result of standard genome annotation processing. See the genome build documentation at http://www.ncbi.nlm.nih.gov/genome/guide/build.html for further information, or contact info at ncbi.nlm.nih.gov. best, Steffen > I have a list of about 20,000 "Accession Reference" IDs and I want to find > corresponding Gene and GO information. > > The IDs that start with "NM_" all seem to work fine as type="refseq", but > others, starting with, "S" or "AB" or "AF" can be found only as > type="embl". > Those with "XM" seemingly cannot be found. > > What information is stored in the prefix of an ID? What do NM_, S, AB, > AF, > or XM mean, and where is information about these prefixes? > > Does it make sense to have a function that returns the type of an ID? > Does > it make sense to have biomaRt functions automtically "know" about the > various kinds of IDs? I don't see how to vectorize any of this when one > must check the type of ID with each call. > > Below I try "embl" IDs first because after a first pass I know I can only > connect about 3,000 out of 20,000 identifiers as "refseq". Overall, > trying > both "embl" and then "refseq" matches perhaps 90% of the dataset of > 20,000, > but this doesn't seem very "clean", and perhaps about 1,000 XM probes were > never matched: > >> # Show problem in knowing type of identifier while fetching GO or Gene > info >> # using biomaRt. efg, 30 Jan 2006 >> >> library(biomaRt) > Loading required package: RMySQL > Loading required package: DBI > Loading required package: XML > Warning message: > DLL attempted to change FPU control word from 8001f to 9001f >> mart <- martConnect() > connected to: ensembl_mart_36 >> >> # First five "Accession Reference" IDs from CAMDA06-related probe >> dataset: >> # > http://ecom2.mwgdna.com/download/arrays/arrays/gene_id/xls/gene_id_h uman_40k_a.xls >> # (discard _N or _NN in IDs) >> probe.list <- c("NM_001533", "NM_031990", "S76822", "AF232742", > "AB035863") >> >> GeneInfo.List <- NULL >> >> for (i in 1:length(probe.list)) > + { > + probe <- probe.list[i] > + > + # Assume embl ID > + GOinfo <- getGO(id=probe,type="embl",species="hsapiens",mart=mart) > + if ( (length(GOinfo at table$GOID) == 1) & is.na(GOinfo at table$GOID[1]) ) > + { > + # IF embl ID fails, try as refseq (perhaps 15% refseqs with NM_ > + GOinfo <- getGO( > id=probe,type="refseq",species="hsapiens",mart=mart) > + GeneInfo <- > getGene(id=probe,type="refseq",species="hsapiens",mart=mart) > + cat(i, "refseq", probe, "\n") > + > + > + } else { > + cat(i, "embl", probe, "\n") > + GeneInfo <- > getGene(id=probe,type="embl",species="hsapiens",mart=mart) > + } > + > + GeneInfo.List <- rbind( GeneInfo.List, > + c(probe, > + unlist( GeneInfo at table[c(1,3,4,5,6,7,2)]) > )) > + > + cat(GOinfo at id[1], GOinfo at table$GOID, "\n") > + } > 1 refseq NM_001533 > NM_001533 GO:0000166 GO:0003723 GO:0006397 GO:0005654 GO:0030530 > GO:0005634 > 2 refseq NM_031990 > NM_031990 GO:0000166 GO:0005515 GO:0008187 GO:0000398 GO:0008380 > GO:0005654 > GO:0005730 GO:0030530 GO:0003676 GO:0003723 GO:0006397 GO:0005634 > 3 embl S76822 > S76822 GO:0000287 GO:0004310 GO:0016491 GO:0016740 GO:0006695 GO:0008299 > GO:0005783 GO:0016021 > 4 embl AF232742 > AF232742 GO:0003807 GO:0004263 GO:0004295 GO:0008233 GO:0006508 GO:0006954 > GO:0007596 GO:0042730 GO:0005615 > 5 embl AB035863 > AB035863 GO:0016874 GO:0008152 GO:0004775 GO:0006099 GO:0006104 GO:0006781 > GO:0005739 >> >> print(GeneInfo.List) > symbol band chromosome start end > martID > [1,] "NM_001533" "HNRPL" "q13.2" "19" "44018883" "44032452" > "ENSG00000104824" > [2,] "NM_031990" "PTBP1" "p13.3" "19" "748411" "763327" > "ENSG00000011304" > [3,] "S76822" "FDFT1" "p23.1" "8" "11697664" "11734215" > "ENSG00000079459" > [4,] "AF232742" "KLKB1" "q35.2" "4" "187523815" "187554773" > "ENSG00000164344" > [5,] "AB035863" "SUCLA2" "q14.2" "13" "47414793" "47473463" > "ENSG00000136143" > description > [1,] "Heterogeneous nuclear ribonucleoprotein L (hnRNP L). > [Source:Uniprot/SWISSPROT;Acc:P14866]" > [2,] "Polypyrimidine tract-binding protein 1 (PTB) (Heterogeneous nuclear > ribonucleoprotein I) (hnRNP I) (57 kDa RNA-binding protein PPTB-1). > [Source:Uniprot/SWISSPROT;Acc:P26599]" > [3,] "Squalene synthetase (EC 2.5.1.21) (SQS) (SS) (Farnesyl- diphosphate > farnesyltransferase) (FPP:FPP farnesyltransferase). > [Source:Uniprot/SWISSPROT;Acc:P37268]" > [4,] "Plasma kallikrein precursor (EC 3.4.21.34) (Plasma prekallikrein) > (Kininogenin) (Fletcher factor) [Contains: Plasma kallikrein heavy chain; > Plasma kallikrein light chain]. [Source:Uniprot/SWISSPROT;Acc:P03952]" > [5,] "Succinyl-CoA ligase [ADP-forming] beta-chain, mitochondrial > precursor > (EC 6.2.1.5) (Succinyl-CoA synthetase, betaA chain) (SCS-betaA) (ATP- > specific succinyl-CoA synthetase beta subunit). > [Source:Uniprot/SWISSPROT;Acc:Q9P2R7]" >> write.csv(GeneInfo.List, row.names=F, file="GeneInfo.csv") >> >> martDisconnect(mart) > > > efg > Bioinformatics > Stowers Institute > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor >

ADD COMMENT • link 18.2 years ago Steffen Durinck ▴ 420

Login before adding your answer.