Search
Question: biomart doesn't annotate all the genes queried -CEACAM6
0
gravatar for Julian Lee
10.0 years ago by
Julian Lee140
Julian Lee140 wrote:
Dear all, I'm having some problems trying to annotate some genes, eg. CEACAM6. My problem is as follows, Query input = chromosome numbers 1:22, Output attributes = ensembl_gene_id, unigene, chromosome_number,start_position,end_position,hgnc_symbol ##Code >require(biomaRt) >ensembl<-useMart('ensembl') >ensembl<-useDataset('hsapiens_gene_ensembl',mart=ensembl) >ensembl Object of class 'Mart': Using the ensembl BioMart database Using the hsapiens_gene_ensembl dataset ##Build Attributes of Interest a<-c('ensembl_gene_id','unigene','illumina_v2','affy_hg_u133_plus_2',' hgnc_symbol','chromosome_name','start_position','end_position') > a [1] "ensembl_gene_id" "unigene" "illumina_v2" [4] "affy_hg_u133_plus_2" "hgnc_symbol" "chromosome_name" [7] "start_position" "end_position" ##Retrieving chromosome 1:22 from biomart getBM(attributes=a,filters='chromosome_name',values=1:22,mart=ensembl, verbose=T)->mydataset <query virtualschemaname="default" uniquerows="1" count="0" datasetconfigversion="0.6" requestid="biomaRt"> <dataset name="hsapiens_gene_ensembl"><attribute name="ensembl_gene_id"/><attribute name="unigene"/><attribute name="illumina_v2"/><attribute name="affy_hg_u133_plus_2"/><attribute name="hgnc_symbol"/><attribute name="chromosome_name"/><attribute name="start_position"/><attribute name="end_position"/><filter name="chromosome_name" value="1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22"/></dataset></query> ##I can't find CEACAM6 in mydataset >grep('CEACAM6',mydataset$hgnc_symbol) integer(0) > grep(c('203757_s_at','211657_at'),mydataset$affy_hg_u133_plus_2) integer(0) ##and the number of affy probes doesn't match to a u133plus2 chip (54,000 probes) >length(unique(mydataset$affy_hg_u133_plus_2)) [1] 23474 ##However, if i'm looking for CEACAM6 using the affy probes, i can find it, > getBM(attributes=a,filters='affy_hg_u133_plus_2',values=c('203757_s_ at','211657_at'),mart=ensembl) ensembl_gene_id unigene illumina_v2 affy_hg_u133_plus_2 hgnc_symbol 1 ENSG00000086548 Hs.602441 ILMN_21866 203757_s_at CEACAM6 2 ENSG00000086548 Hs.466814 ILMN_21866 203757_s_at CEACAM6 3 ENSG00000086548 Hs.602441 ILMN_21866 211657_at CEACAM6 4 ENSG00000086548 Hs.466814 ILMN_21866 211657_at CEACAM6 chromosome_name start_position end_position 1 19 46951341 46967953 2 19 46951341 46967953 3 19 46951341 46967953 4 19 46951341 46967953 ##End Code I'm not too sure what's going on. Why is it when queried with chromosome numbers, CEACAM6 disappears, but when queried with affy_hg_u133_plus_2 probes, it appears. Any help on this would be great. thanks. regards btw- i couldn't find EGFR. as a control, i managed to identify TP53 ##R Code > grep('EGFR',mydataset$hgnc_symbol) integer(0) > mydataset[grep('TP53',mydataset$hgnc_symbol),'hgnc_symbol'] [1] "TP53I13" "TP53I13" "TP53AP1" "TP53AP1" "TP53AP1" "TP53AP1" [7] "TP53BP1" "TP53BP1" "TP53BP1" "TP53BP1" "TP53BP1" "TP53I3" [13] "TP53I3" "TP53I3" "TP53I3" "TP53I3" "TP53" "TP53" [19] "TP53" "TP53INP2" "TP53INP2" "TP53INP2" "TP53BP2" "TP53BP2" [25] "TP53BP2" "TP53BP2" ##end R Code > sessionInfo() R version 2.7.1 (2008-06-23) i486-pc-linux-gnu locale: LC_CTYPE=en_SG.UTF-8;LC_NUMERIC=C;LC_TIME=en_SG.UTF-8;LC_COLLATE=en_SG .UTF-8;LC_MONETARY=C;LC_MESSAGES=en_SG.UTF-8;LC_PAPER=en_SG.UTF-8;LC_N AME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_SG.UTF-8;LC_IDENTI FICATION=C attached base packages: [1] tools stats graphics grDevices utils datasets methods [8] base other attached packages: [1] hgu133plus2.db_2.2.0 illuminaHumanv2ProbeID.db_1.1.1 [3] AnnotationDbi_1.2.2 RSQLite_0.6-9 [5] DBI_0.2-4 Biobase_2.0.1 [7] biomaRt_1.14.1 RCurl_0.9-4 loaded via a namespace (and not attached): [1] XML_1.96-0 -- Julian Lee Bioinformatics Specialist Cellular and Molecular Research National Cancer Center Singapore
ADD COMMENTlink modified 10.0 years ago • written 10.0 years ago by Julian Lee140
0
gravatar for Marc Carlson
10.0 years ago by
Marc Carlson7.2k
United States
Marc Carlson7.2k wrote:
Hi Julian, You could also use either of the following two standard annotation packages for this: org.Hs.eg.db or hgu133plus2.db. It appears that the only field you are looking for that we don't have yet, is the chromosome end position, and even that should be available within days inside of the latest devel release. This won't help with your biomaRt question (and I will leave this question for the true biomaRt experts), but is always good to have multiple options. ;) Marc Julian Lee wrote: > Dear all, > > I'm having some problems trying to annotate some genes, eg. CEACAM6. > > My problem is as follows, > Query input = chromosome numbers 1:22, > Output attributes = ensembl_gene_id, unigene, chromosome_number,start_position,end_position,hgnc_symbol > > > ##Code > >> require(biomaRt) >> ensembl<-useMart('ensembl') >> ensembl<-useDataset('hsapiens_gene_ensembl',mart=ensembl) >> ensembl >> > Object of class 'Mart': > Using the ensembl BioMart database > Using the hsapiens_gene_ensembl dataset > > > ##Build Attributes of Interest > a<-c('ensembl_gene_id','unigene','illumina_v2','affy_hg_u133_plus_2' ,'hgnc_symbol','chromosome_name','start_position','end_position') > >> a >> > [1] "ensembl_gene_id" "unigene" "illumina_v2" > [4] "affy_hg_u133_plus_2" "hgnc_symbol" "chromosome_name" > [7] "start_position" "end_position" > > ##Retrieving chromosome 1:22 from biomart > getBM(attributes=a,filters='chromosome_name',values=1:22,mart=ensemb l,verbose=T)->mydataset > <query virtualschemaname="default" uniquerows="1" count="0" datasetconfigversion="0.6" requestid="biomaRt"> <dataset name="hsapiens_gene_ensembl"><attribute name="ensembl_gene_id"/><attribute name="unigene"/><attribute name="illumina_v2"/><attribute name="affy_hg_u133_plus_2"/><attribute name="hgnc_symbol"/><attribute name="chromosome_name"/><attribute name="start_position"/><attribute name="end_position"/><filter name="chromosome_name" value="1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22"/></dataset></query> > > ##I can't find CEACAM6 in mydataset > >> grep('CEACAM6',mydataset$hgnc_symbol) >> > integer(0) > > >> grep(c('203757_s_at','211657_at'),mydataset$affy_hg_u133_plus_2) >> > integer(0) > > > ##and the number of affy probes doesn't match to a u133plus2 chip (54,000 probes) > >> length(unique(mydataset$affy_hg_u133_plus_2)) >> > [1] 23474 > > ##However, if i'm looking for CEACAM6 using the affy probes, i can find it, > >> getBM(attributes=a,filters='affy_hg_u133_plus_2',values=c('203757_s _at','211657_at'),mart=ensembl) >> > ensembl_gene_id unigene illumina_v2 affy_hg_u133_plus_2 hgnc_symbol > 1 ENSG00000086548 Hs.602441 ILMN_21866 203757_s_at CEACAM6 > 2 ENSG00000086548 Hs.466814 ILMN_21866 203757_s_at CEACAM6 > 3 ENSG00000086548 Hs.602441 ILMN_21866 211657_at CEACAM6 > 4 ENSG00000086548 Hs.466814 ILMN_21866 211657_at CEACAM6 > chromosome_name start_position end_position > 1 19 46951341 46967953 > 2 19 46951341 46967953 > 3 19 46951341 46967953 > 4 19 46951341 46967953 > ##End Code > > I'm not too sure what's going on. Why is it when queried with chromosome numbers, CEACAM6 disappears, but when queried with affy_hg_u133_plus_2 probes, it appears. > > Any help on this would be great. thanks. > > regards > > btw- i couldn't find EGFR. as a control, i managed to identify TP53 > > ##R Code > >> grep('EGFR',mydataset$hgnc_symbol) >> > integer(0) > >> mydataset[grep('TP53',mydataset$hgnc_symbol),'hgnc_symbol'] >> > [1] "TP53I13" "TP53I13" "TP53AP1" "TP53AP1" "TP53AP1" "TP53AP1" > [7] "TP53BP1" "TP53BP1" "TP53BP1" "TP53BP1" "TP53BP1" "TP53I3" > [13] "TP53I3" "TP53I3" "TP53I3" "TP53I3" "TP53" "TP53" > [19] "TP53" "TP53INP2" "TP53INP2" "TP53INP2" "TP53BP2" "TP53BP2" > [25] "TP53BP2" "TP53BP2" > ##end R Code > > > >> sessionInfo() >> > R version 2.7.1 (2008-06-23) > i486-pc-linux-gnu > > locale: > LC_CTYPE=en_SG.UTF-8;LC_NUMERIC=C;LC_TIME=en_SG.UTF-8;LC_COLLATE=en_ SG.UTF-8;LC_MONETARY=C;LC_MESSAGES=en_SG.UTF-8;LC_PAPER=en_SG.UTF-8;LC _NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_SG.UTF-8;LC_IDEN TIFICATION=C > > attached base packages: > [1] tools stats graphics grDevices utils datasets methods > [8] base > > other attached packages: > [1] hgu133plus2.db_2.2.0 illuminaHumanv2ProbeID.db_1.1.1 > [3] AnnotationDbi_1.2.2 RSQLite_0.6-9 > [5] DBI_0.2-4 Biobase_2.0.1 > [7] biomaRt_1.14.1 RCurl_0.9-4 > > loaded via a namespace (and not attached): > [1] XML_1.96-0 > > > > > > > > > >
ADD COMMENTlink written 10.0 years ago by Marc Carlson7.2k
0
gravatar for Julian Lee
10.0 years ago by
Julian Lee140
Julian Lee140 wrote:
Hi marc, thanks. I'm using the hgu133plus2.db package as a control to the biomaRt package. I think biomaRt has plenty of potential and gives users tremendous power in mapping across different organisms, databases etc. I'm quite certain I'll be using some of its functionality in future. regards julian ----- Original Message ----- From: "Marc Carlson" <mcarlson@fhcrc.org> To: "Julian Lee" <julian at="" omniarray.com=""> Cc: bioconductor at stat.math.ethz.ch Sent: Tuesday, September 16, 2008 11:45:15 PM GMT +08:00 Beijing / Chongqing / Hong Kong / Urumqi Subject: Re: [BioC] biomart doesn't annotate all the genes queried -CEACAM6 Hi Julian, You could also use either of the following two standard annotation packages for this: org.Hs.eg.db or hgu133plus2.db. It appears that the only field you are looking for that we don't have yet, is the chromosome end position, and even that should be available within days inside of the latest devel release. This won't help with your biomaRt question (and I will leave this question for the true biomaRt experts), but is always good to have multiple options. ;) Marc Julian Lee wrote: > Dear all, > > I'm having some problems trying to annotate some genes, eg. CEACAM6. > > My problem is as follows, > Query input = chromosome numbers 1:22, > Output attributes = ensembl_gene_id, unigene, chromosome_number,start_position,end_position,hgnc_symbol > > > ##Code > >> require(biomaRt) >> ensembl<-useMart('ensembl') >> ensembl<-useDataset('hsapiens_gene_ensembl',mart=ensembl) >> ensembl >> > Object of class 'Mart': > Using the ensembl BioMart database > Using the hsapiens_gene_ensembl dataset > > > ##Build Attributes of Interest > a<-c('ensembl_gene_id','unigene','illumina_v2','affy_hg_u133_plus_2' ,'hgnc_symbol','chromosome_name','start_position','end_position') > >> a >> > [1] "ensembl_gene_id" "unigene" "illumina_v2" > [4] "affy_hg_u133_plus_2" "hgnc_symbol" "chromosome_name" > [7] "start_position" "end_position" > > ##Retrieving chromosome 1:22 from biomart > getBM(attributes=a,filters='chromosome_name',values=1:22,mart=ensemb l,verbose=T)->mydataset > <query virtualschemaname="default" uniquerows="1" count="0" datasetconfigversion="0.6" requestid="biomaRt"> <dataset name="hsapiens_gene_ensembl"><attribute name="ensembl_gene_id"/><attribute name="unigene"/><attribute name="illumina_v2"/><attribute name="affy_hg_u133_plus_2"/><attribute name="hgnc_symbol"/><attribute name="chromosome_name"/><attribute name="start_position"/><attribute name="end_position"/><filter name="chromosome_name" value="1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22"/></dataset></query> > > ##I can't find CEACAM6 in mydataset > >> grep('CEACAM6',mydataset$hgnc_symbol) >> > integer(0) > > >> grep(c('203757_s_at','211657_at'),mydataset$affy_hg_u133_plus_2) >> > integer(0) > > > ##and the number of affy probes doesn't match to a u133plus2 chip (54,000 probes) > >> length(unique(mydataset$affy_hg_u133_plus_2)) >> > [1] 23474 > > ##However, if i'm looking for CEACAM6 using the affy probes, i can find it, > >> getBM(attributes=a,filters='affy_hg_u133_plus_2',values=c('203757_s _at','211657_at'),mart=ensembl) >> > ensembl_gene_id unigene illumina_v2 affy_hg_u133_plus_2 hgnc_symbol > 1 ENSG00000086548 Hs.602441 ILMN_21866 203757_s_at CEACAM6 > 2 ENSG00000086548 Hs.466814 ILMN_21866 203757_s_at CEACAM6 > 3 ENSG00000086548 Hs.602441 ILMN_21866 211657_at CEACAM6 > 4 ENSG00000086548 Hs.466814 ILMN_21866 211657_at CEACAM6 > chromosome_name start_position end_position > 1 19 46951341 46967953 > 2 19 46951341 46967953 > 3 19 46951341 46967953 > 4 19 46951341 46967953 > ##End Code > > I'm not too sure what's going on. Why is it when queried with chromosome numbers, CEACAM6 disappears, but when queried with affy_hg_u133_plus_2 probes, it appears. > > Any help on this would be great. thanks. > > regards > > btw- i couldn't find EGFR. as a control, i managed to identify TP53 > > ##R Code > >> grep('EGFR',mydataset$hgnc_symbol) >> > integer(0) > >> mydataset[grep('TP53',mydataset$hgnc_symbol),'hgnc_symbol'] >> > [1] "TP53I13" "TP53I13" "TP53AP1" "TP53AP1" "TP53AP1" "TP53AP1" > [7] "TP53BP1" "TP53BP1" "TP53BP1" "TP53BP1" "TP53BP1" "TP53I3" > [13] "TP53I3" "TP53I3" "TP53I3" "TP53I3" "TP53" "TP53" > [19] "TP53" "TP53INP2" "TP53INP2" "TP53INP2" "TP53BP2" "TP53BP2" > [25] "TP53BP2" "TP53BP2" > ##end R Code > > > >> sessionInfo() >> > R version 2.7.1 (2008-06-23) > i486-pc-linux-gnu > > locale: > LC_CTYPE=en_SG.UTF-8;LC_NUMERIC=C;LC_TIME=en_SG.UTF-8;LC_COLLATE=en_ SG.UTF-8;LC_MONETARY=C;LC_MESSAGES=en_SG.UTF-8;LC_PAPER=en_SG.UTF-8;LC _NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_SG.UTF-8;LC_IDEN TIFICATION=C > > attached base packages: > [1] tools stats graphics grDevices utils datasets methods > [8] base > > other attached packages: > [1] hgu133plus2.db_2.2.0 illuminaHumanv2ProbeID.db_1.1.1 > [3] AnnotationDbi_1.2.2 RSQLite_0.6-9 > [5] DBI_0.2-4 Biobase_2.0.1 > [7] biomaRt_1.14.1 RCurl_0.9-4 > > loaded via a namespace (and not attached): > [1] XML_1.96-0 > > > > > > > > > > -- Julian Lee Bioinformatics Specialist Cellular and Molecular Research National Cancer Center Singapore
ADD COMMENTlink written 10.0 years ago by Julian Lee140
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 307 users visited in the last hour