Entering edit mode
Glynn, Earl
▴
170
@glynn-earl-952
Last seen 10.3 years ago
I have a list of about 20,000 "Accession Reference" IDs and I want to
find
corresponding Gene and GO information.
The IDs that start with "NM_" all seem to work fine as type="refseq",
but
others, starting with, "S" or "AB" or "AF" can be found only as
type="embl".
Those with "XM" seemingly cannot be found.
What information is stored in the prefix of an ID? What do NM_, S,
AB, AF,
or XM mean, and where is information about these prefixes?
Does it make sense to have a function that returns the type of an ID?
Does
it make sense to have biomaRt functions automtically "know" about the
various kinds of IDs? I don't see how to vectorize any of this when
one
must check the type of ID with each call.
Below I try "embl" IDs first because after a first pass I know I can
only
connect about 3,000 out of 20,000 identifiers as "refseq". Overall,
trying
both "embl" and then "refseq" matches perhaps 90% of the dataset of
20,000,
but this doesn't seem very "clean", and perhaps about 1,000 XM probes
were
never matched:
> # Show problem in knowing type of identifier while fetching GO or
Gene
info
> # using biomaRt. efg, 30 Jan 2006
>
> library(biomaRt)
Loading required package: RMySQL
Loading required package: DBI
Loading required package: XML
Warning message:
DLL attempted to change FPU control word from 8001f to 9001f
> mart <- martConnect()
connected to: ensembl_mart_36
>
> # First five "Accession Reference" IDs from CAMDA06-related probe
dataset:
> #
http://ecom2.mwgdna.com/download/arrays/arrays/gene_id/xls/gene_id_hum
an_40k_a.xls
> # (discard _N or _NN in IDs)
> probe.list <- c("NM_001533", "NM_031990", "S76822", "AF232742",
"AB035863")
>
> GeneInfo.List <- NULL
>
> for (i in 1:length(probe.list))
+ {
+ probe <- probe.list[i]
+
+ # Assume embl ID
+ GOinfo <- getGO(id=probe,type="embl",species="hsapiens",mart=mart)
+ if ( (length(GOinfo at table$GOID) == 1) & is.na(GOinfo at
table$GOID[1]) )
+ {
+ # IF embl ID fails, try as refseq (perhaps 15% refseqs with NM_
+ GOinfo <- getGO(
id=probe,type="refseq",species="hsapiens",mart=mart)
+ GeneInfo <-
getGene(id=probe,type="refseq",species="hsapiens",mart=mart)
+ cat(i, "refseq", probe, "\n")
+
+
+ } else {
+ cat(i, "embl", probe, "\n")
+ GeneInfo <-
getGene(id=probe,type="embl",species="hsapiens",mart=mart)
+ }
+
+ GeneInfo.List <- rbind( GeneInfo.List,
+ c(probe,
+ unlist( GeneInfo at
table[c(1,3,4,5,6,7,2)]) ))
+
+ cat(GOinfo at id[1], GOinfo at table$GOID, "\n")
+ }
1 refseq NM_001533
NM_001533 GO:0000166 GO:0003723 GO:0006397 GO:0005654 GO:0030530
GO:0005634
2 refseq NM_031990
NM_031990 GO:0000166 GO:0005515 GO:0008187 GO:0000398 GO:0008380
GO:0005654
GO:0005730 GO:0030530 GO:0003676 GO:0003723 GO:0006397 GO:0005634
3 embl S76822
S76822 GO:0000287 GO:0004310 GO:0016491 GO:0016740 GO:0006695
GO:0008299
GO:0005783 GO:0016021
4 embl AF232742
AF232742 GO:0003807 GO:0004263 GO:0004295 GO:0008233 GO:0006508
GO:0006954
GO:0007596 GO:0042730 GO:0005615
5 embl AB035863
AB035863 GO:0016874 GO:0008152 GO:0004775 GO:0006099 GO:0006104
GO:0006781
GO:0005739
>
> print(GeneInfo.List)
symbol band chromosome start end
martID
[1,] "NM_001533" "HNRPL" "q13.2" "19" "44018883" "44032452"
"ENSG00000104824"
[2,] "NM_031990" "PTBP1" "p13.3" "19" "748411" "763327"
"ENSG00000011304"
[3,] "S76822" "FDFT1" "p23.1" "8" "11697664" "11734215"
"ENSG00000079459"
[4,] "AF232742" "KLKB1" "q35.2" "4" "187523815" "187554773"
"ENSG00000164344"
[5,] "AB035863" "SUCLA2" "q14.2" "13" "47414793" "47473463"
"ENSG00000136143"
description
[1,] "Heterogeneous nuclear ribonucleoprotein L (hnRNP L).
[Source:Uniprot/SWISSPROT;Acc:P14866]"
[2,] "Polypyrimidine tract-binding protein 1 (PTB) (Heterogeneous
nuclear
ribonucleoprotein I) (hnRNP I) (57 kDa RNA-binding protein PPTB-1).
[Source:Uniprot/SWISSPROT;Acc:P26599]"
[3,] "Squalene synthetase (EC 2.5.1.21) (SQS) (SS) (Farnesyl-
diphosphate
farnesyltransferase) (FPP:FPP farnesyltransferase).
[Source:Uniprot/SWISSPROT;Acc:P37268]"
[4,] "Plasma kallikrein precursor (EC 3.4.21.34) (Plasma
prekallikrein)
(Kininogenin) (Fletcher factor) [Contains: Plasma kallikrein heavy
chain;
Plasma kallikrein light chain]. [Source:Uniprot/SWISSPROT;Acc:P03952]"
[5,] "Succinyl-CoA ligase [ADP-forming] beta-chain, mitochondrial
precursor
(EC 6.2.1.5) (Succinyl-CoA synthetase, betaA chain) (SCS-betaA) (ATP-
specific succinyl-CoA synthetase beta subunit).
[Source:Uniprot/SWISSPROT;Acc:Q9P2R7]"
> write.csv(GeneInfo.List, row.names=F, file="GeneInfo.csv")
>
> martDisconnect(mart)
efg
Bioinformatics
Stowers Institute