Question

biomart doesn't annotate all the genes queried -CEACAM6

0

Entering edit mode

Julian Lee ▴ 140

@julian-lee-2487

Last seen 9.6 years ago

Dear all, I'm having some problems trying to annotate some genes, eg. CEACAM6. My problem is as follows, Query input = chromosome numbers 1:22, Output attributes = ensembl_gene_id, unigene, chromosome_number,start_position,end_position,hgnc_symbol ##Code >require(biomaRt) >ensembl<-useMart('ensembl') >ensembl<-useDataset('hsapiens_gene_ensembl',mart=ensembl) >ensembl Object of class 'Mart': Using the ensembl BioMart database Using the hsapiens_gene_ensembl dataset ##Build Attributes of Interest a<-c('ensembl_gene_id','unigene','illumina_v2','affy_hg_u133_plus_2',' hgnc_symbol','chromosome_name','start_position','end_position') > a [1] "ensembl_gene_id" "unigene" "illumina_v2" [4] "affy_hg_u133_plus_2" "hgnc_symbol" "chromosome_name" [7] "start_position" "end_position" ##Retrieving chromosome 1:22 from biomart getBM(attributes=a,filters='chromosome_name',values=1:22,mart=ensembl, verbose=T)->mydataset <query virtualschemaname="default" uniquerows="1" count="0" datasetconfigversion="0.6" requestid="biomaRt"> <dataset name="hsapiens_gene_ensembl"><attribute name="ensembl_gene_id"/><attribute name="unigene"/><attribute name="illumina_v2"/><attribute name="affy_hg_u133_plus_2"/><attribute name="hgnc_symbol"/><attribute name="chromosome_name"/><attribute name="start_position"/><attribute name="end_position"/><filter name="chromosome_name" value="1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22"/></dataset></query> ##I can't find CEACAM6 in mydataset >grep('CEACAM6',mydataset$hgnc_symbol) integer(0) > grep(c('203757_s_at','211657_at'),mydataset$affy_hg_u133_plus_2) integer(0) ##and the number of affy probes doesn't match to a u133plus2 chip (54,000 probes) >length(unique(mydataset$affy_hg_u133_plus_2)) [1] 23474 ##However, if i'm looking for CEACAM6 using the affy probes, i can find it, > getBM(attributes=a,filters='affy_hg_u133_plus_2',values=c('203757_s_ at','211657_at'),mart=ensembl) ensembl_gene_id unigene illumina_v2 affy_hg_u133_plus_2 hgnc_symbol 1 ENSG00000086548 Hs.602441 ILMN_21866 203757_s_at CEACAM6 2 ENSG00000086548 Hs.466814 ILMN_21866 203757_s_at CEACAM6 3 ENSG00000086548 Hs.602441 ILMN_21866 211657_at CEACAM6 4 ENSG00000086548 Hs.466814 ILMN_21866 211657_at CEACAM6 chromosome_name start_position end_position 1 19 46951341 46967953 2 19 46951341 46967953 3 19 46951341 46967953 4 19 46951341 46967953 ##End Code I'm not too sure what's going on. Why is it when queried with chromosome numbers, CEACAM6 disappears, but when queried with affy_hg_u133_plus_2 probes, it appears. Any help on this would be great. thanks. regards btw- i couldn't find EGFR. as a control, i managed to identify TP53 ##R Code > grep('EGFR',mydataset$hgnc_symbol) integer(0) > mydataset[grep('TP53',mydataset$hgnc_symbol),'hgnc_symbol'] [1] "TP53I13" "TP53I13" "TP53AP1" "TP53AP1" "TP53AP1" "TP53AP1" [7] "TP53BP1" "TP53BP1" "TP53BP1" "TP53BP1" "TP53BP1" "TP53I3" [13] "TP53I3" "TP53I3" "TP53I3" "TP53I3" "TP53" "TP53" [19] "TP53" "TP53INP2" "TP53INP2" "TP53INP2" "TP53BP2" "TP53BP2" [25] "TP53BP2" "TP53BP2" ##end R Code > sessionInfo() R version 2.7.1 (2008-06-23) i486-pc-linux-gnu locale: LC_CTYPE=en_SG.UTF-8;LC_NUMERIC=C;LC_TIME=en_SG.UTF-8;LC_COLLATE=en_SG .UTF-8;LC_MONETARY=C;LC_MESSAGES=en_SG.UTF-8;LC_PAPER=en_SG.UTF-8;LC_N AME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_SG.UTF-8;LC_IDENTI FICATION=C attached base packages: [1] tools stats graphics grDevices utils datasets methods [8] base other attached packages: [1] hgu133plus2.db_2.2.0 illuminaHumanv2ProbeID.db_1.1.1 [3] AnnotationDbi_1.2.2 RSQLite_0.6-9 [5] DBI_0.2-4 Biobase_2.0.1 [7] biomaRt_1.14.1 RCurl_0.9-4 loaded via a namespace (and not attached): [1] XML_1.96-0 -- Julian Lee Bioinformatics Specialist Cellular and Molecular Research National Cancer Center Singapore

Cancer hgu133plus2 annotate affy biomaRt Cancer hgu133plus2 annotate affy biomaRt • 1.1k views

ADD COMMENT • link 15.6 years ago Julian Lee ▴ 140

score 0 · Answer 1 · 2008-09-16

Hi Julian, You could also use either of the following two standard annotation packages for this: org.Hs.eg.db or hgu133plus2.db. It appears that the only field you are looking for that we don't have yet, is the chromosome end position, and even that should be available within days inside of the latest devel release. This won't help with your biomaRt question (and I will leave this question for the true biomaRt experts), but is always good to have multiple options. ;) Marc Julian Lee wrote: > Dear all, > > I'm having some problems trying to annotate some genes, eg. CEACAM6. > > My problem is as follows, > Query input = chromosome numbers 1:22, > Output attributes = ensembl_gene_id, unigene, chromosome_number,start_position,end_position,hgnc_symbol > > > ##Code > >> require(biomaRt) >> ensembl<-useMart('ensembl') >> ensembl<-useDataset('hsapiens_gene_ensembl',mart=ensembl) >> ensembl >> > Object of class 'Mart': > Using the ensembl BioMart database > Using the hsapiens_gene_ensembl dataset > > > ##Build Attributes of Interest > a<-c('ensembl_gene_id','unigene','illumina_v2','affy_hg_u133_plus_2' ,'hgnc_symbol','chromosome_name','start_position','end_position') > >> a >> > [1] "ensembl_gene_id" "unigene" "illumina_v2" > [4] "affy_hg_u133_plus_2" "hgnc_symbol" "chromosome_name" > [7] "start_position" "end_position" > > ##Retrieving chromosome 1:22 from biomart > getBM(attributes=a,filters='chromosome_name',values=1:22,mart=ensemb l,verbose=T)->mydataset > <query virtualschemaname="default" uniquerows="1" count="0" datasetconfigversion="0.6" requestid="biomaRt"> <dataset name="hsapiens_gene_ensembl"><attribute name="ensembl_gene_id"/><attribute name="unigene"/><attribute name="illumina_v2"/><attribute name="affy_hg_u133_plus_2"/><attribute name="hgnc_symbol"/><attribute name="chromosome_name"/><attribute name="start_position"/><attribute name="end_position"/><filter name="chromosome_name" value="1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22"/></dataset></query> > > ##I can't find CEACAM6 in mydataset > >> grep('CEACAM6',mydataset$hgnc_symbol) >> > integer(0) > > >> grep(c('203757_s_at','211657_at'),mydataset$affy_hg_u133_plus_2) >> > integer(0) > > > ##and the number of affy probes doesn't match to a u133plus2 chip (54,000 probes) > >> length(unique(mydataset$affy_hg_u133_plus_2)) >> > [1] 23474 > > ##However, if i'm looking for CEACAM6 using the affy probes, i can find it, > >> getBM(attributes=a,filters='affy_hg_u133_plus_2',values=c('203757_s _at','211657_at'),mart=ensembl) >> > ensembl_gene_id unigene illumina_v2 affy_hg_u133_plus_2 hgnc_symbol > 1 ENSG00000086548 Hs.602441 ILMN_21866 203757_s_at CEACAM6 > 2 ENSG00000086548 Hs.466814 ILMN_21866 203757_s_at CEACAM6 > 3 ENSG00000086548 Hs.602441 ILMN_21866 211657_at CEACAM6 > 4 ENSG00000086548 Hs.466814 ILMN_21866 211657_at CEACAM6 > chromosome_name start_position end_position > 1 19 46951341 46967953 > 2 19 46951341 46967953 > 3 19 46951341 46967953 > 4 19 46951341 46967953 > ##End Code > > I'm not too sure what's going on. Why is it when queried with chromosome numbers, CEACAM6 disappears, but when queried with affy_hg_u133_plus_2 probes, it appears. > > Any help on this would be great. thanks. > > regards > > btw- i couldn't find EGFR. as a control, i managed to identify TP53 > > ##R Code > >> grep('EGFR',mydataset$hgnc_symbol) >> > integer(0) > >> mydataset[grep('TP53',mydataset$hgnc_symbol),'hgnc_symbol'] >> > [1] "TP53I13" "TP53I13" "TP53AP1" "TP53AP1" "TP53AP1" "TP53AP1" > [7] "TP53BP1" "TP53BP1" "TP53BP1" "TP53BP1" "TP53BP1" "TP53I3" > [13] "TP53I3" "TP53I3" "TP53I3" "TP53I3" "TP53" "TP53" > [19] "TP53" "TP53INP2" "TP53INP2" "TP53INP2" "TP53BP2" "TP53BP2" > [25] "TP53BP2" "TP53BP2" > ##end R Code > > > >> sessionInfo() >> > R version 2.7.1 (2008-06-23) > i486-pc-linux-gnu > > locale: > LC_CTYPE=en_SG.UTF-8;LC_NUMERIC=C;LC_TIME=en_SG.UTF-8;LC_COLLATE=en_ SG.UTF-8;LC_MONETARY=C;LC_MESSAGES=en_SG.UTF-8;LC_PAPER=en_SG.UTF-8;LC _NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_SG.UTF-8;LC_IDEN TIFICATION=C > > attached base packages: > [1] tools stats graphics grDevices utils datasets methods > [8] base > > other attached packages: > [1] hgu133plus2.db_2.2.0 illuminaHumanv2ProbeID.db_1.1.1 > [3] AnnotationDbi_1.2.2 RSQLite_0.6-9 > [5] DBI_0.2-4 Biobase_2.0.1 > [7] biomaRt_1.14.1 RCurl_0.9-4 > > loaded via a namespace (and not attached): > [1] XML_1.96-0 > > > > > > > > > >

score 0 · Answer 2 · 2008-09-17

Hi marc, thanks. I'm using the hgu133plus2.db package as a control to the biomaRt package. I think biomaRt has plenty of potential and gives users tremendous power in mapping across different organisms, databases etc. I'm quite certain I'll be using some of its functionality in future. regards julian ----- Original Message ----- From: "Marc Carlson" <mcarlson@fhcrc.org> To: "Julian Lee" <julian at="" omniarray.com=""> Cc: bioconductor at stat.math.ethz.ch Sent: Tuesday, September 16, 2008 11:45:15 PM GMT +08:00 Beijing / Chongqing / Hong Kong / Urumqi Subject: Re: [BioC] biomart doesn't annotate all the genes queried -CEACAM6 Hi Julian, You could also use either of the following two standard annotation packages for this: org.Hs.eg.db or hgu133plus2.db. It appears that the only field you are looking for that we don't have yet, is the chromosome end position, and even that should be available within days inside of the latest devel release. This won't help with your biomaRt question (and I will leave this question for the true biomaRt experts), but is always good to have multiple options. ;) Marc Julian Lee wrote: > Dear all, > > I'm having some problems trying to annotate some genes, eg. CEACAM6. > > My problem is as follows, > Query input = chromosome numbers 1:22, > Output attributes = ensembl_gene_id, unigene, chromosome_number,start_position,end_position,hgnc_symbol > > > ##Code > >> require(biomaRt) >> ensembl<-useMart('ensembl') >> ensembl<-useDataset('hsapiens_gene_ensembl',mart=ensembl) >> ensembl >> > Object of class 'Mart': > Using the ensembl BioMart database > Using the hsapiens_gene_ensembl dataset > > > ##Build Attributes of Interest > a<-c('ensembl_gene_id','unigene','illumina_v2','affy_hg_u133_plus_2' ,'hgnc_symbol','chromosome_name','start_position','end_position') > >> a >> > [1] "ensembl_gene_id" "unigene" "illumina_v2" > [4] "affy_hg_u133_plus_2" "hgnc_symbol" "chromosome_name" > [7] "start_position" "end_position" > > ##Retrieving chromosome 1:22 from biomart > getBM(attributes=a,filters='chromosome_name',values=1:22,mart=ensemb l,verbose=T)->mydataset > <query virtualschemaname="default" uniquerows="1" count="0" datasetconfigversion="0.6" requestid="biomaRt"> <dataset name="hsapiens_gene_ensembl"><attribute name="ensembl_gene_id"/><attribute name="unigene"/><attribute name="illumina_v2"/><attribute name="affy_hg_u133_plus_2"/><attribute name="hgnc_symbol"/><attribute name="chromosome_name"/><attribute name="start_position"/><attribute name="end_position"/><filter name="chromosome_name" value="1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22"/></dataset></query> > > ##I can't find CEACAM6 in mydataset > >> grep('CEACAM6',mydataset$hgnc_symbol) >> > integer(0) > > >> grep(c('203757_s_at','211657_at'),mydataset$affy_hg_u133_plus_2) >> > integer(0) > > > ##and the number of affy probes doesn't match to a u133plus2 chip (54,000 probes) > >> length(unique(mydataset$affy_hg_u133_plus_2)) >> > [1] 23474 > > ##However, if i'm looking for CEACAM6 using the affy probes, i can find it, > >> getBM(attributes=a,filters='affy_hg_u133_plus_2',values=c('203757_s _at','211657_at'),mart=ensembl) >> > ensembl_gene_id unigene illumina_v2 affy_hg_u133_plus_2 hgnc_symbol > 1 ENSG00000086548 Hs.602441 ILMN_21866 203757_s_at CEACAM6 > 2 ENSG00000086548 Hs.466814 ILMN_21866 203757_s_at CEACAM6 > 3 ENSG00000086548 Hs.602441 ILMN_21866 211657_at CEACAM6 > 4 ENSG00000086548 Hs.466814 ILMN_21866 211657_at CEACAM6 > chromosome_name start_position end_position > 1 19 46951341 46967953 > 2 19 46951341 46967953 > 3 19 46951341 46967953 > 4 19 46951341 46967953 > ##End Code > > I'm not too sure what's going on. Why is it when queried with chromosome numbers, CEACAM6 disappears, but when queried with affy_hg_u133_plus_2 probes, it appears. > > Any help on this would be great. thanks. > > regards > > btw- i couldn't find EGFR. as a control, i managed to identify TP53 > > ##R Code > >> grep('EGFR',mydataset$hgnc_symbol) >> > integer(0) > >> mydataset[grep('TP53',mydataset$hgnc_symbol),'hgnc_symbol'] >> > [1] "TP53I13" "TP53I13" "TP53AP1" "TP53AP1" "TP53AP1" "TP53AP1" > [7] "TP53BP1" "TP53BP1" "TP53BP1" "TP53BP1" "TP53BP1" "TP53I3" > [13] "TP53I3" "TP53I3" "TP53I3" "TP53I3" "TP53" "TP53" > [19] "TP53" "TP53INP2" "TP53INP2" "TP53INP2" "TP53BP2" "TP53BP2" > [25] "TP53BP2" "TP53BP2" > ##end R Code > > > >> sessionInfo() >> > R version 2.7.1 (2008-06-23) > i486-pc-linux-gnu > > locale: > LC_CTYPE=en_SG.UTF-8;LC_NUMERIC=C;LC_TIME=en_SG.UTF-8;LC_COLLATE=en_ SG.UTF-8;LC_MONETARY=C;LC_MESSAGES=en_SG.UTF-8;LC_PAPER=en_SG.UTF-8;LC _NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_SG.UTF-8;LC_IDEN TIFICATION=C > > attached base packages: > [1] tools stats graphics grDevices utils datasets methods > [8] base > > other attached packages: > [1] hgu133plus2.db_2.2.0 illuminaHumanv2ProbeID.db_1.1.1 > [3] AnnotationDbi_1.2.2 RSQLite_0.6-9 > [5] DBI_0.2-4 Biobase_2.0.1 > [7] biomaRt_1.14.1 RCurl_0.9-4 > > loaded via a namespace (and not attached): > [1] XML_1.96-0 > > > > > > > > > > -- Julian Lee Bioinformatics Specialist Cellular and Molecular Research National Cancer Center Singapore