Entering edit mode
Julian Lee
▴
140
@julian-lee-2487
Last seen 10.4 years ago
Dear all,
I'm having some problems trying to annotate some genes, eg. CEACAM6.
My problem is as follows,
Query input = chromosome numbers 1:22,
Output attributes = ensembl_gene_id, unigene,
chromosome_number,start_position,end_position,hgnc_symbol
##Code
>require(biomaRt)
>ensembl<-useMart('ensembl')
>ensembl<-useDataset('hsapiens_gene_ensembl',mart=ensembl)
>ensembl
Object of class 'Mart':
Using the ensembl BioMart database
Using the hsapiens_gene_ensembl dataset
##Build Attributes of Interest
a<-c('ensembl_gene_id','unigene','illumina_v2','affy_hg_u133_plus_2','
hgnc_symbol','chromosome_name','start_position','end_position')
> a
[1] "ensembl_gene_id" "unigene" "illumina_v2"
[4] "affy_hg_u133_plus_2" "hgnc_symbol" "chromosome_name"
[7] "start_position" "end_position"
##Retrieving chromosome 1:22 from biomart
getBM(attributes=a,filters='chromosome_name',values=1:22,mart=ensembl,
verbose=T)->mydataset
<query virtualschemaname="default" uniquerows="1" count="0" datasetconfigversion="0.6" requestid="biomaRt"> <dataset name="hsapiens_gene_ensembl"><attribute name="ensembl_gene_id"/><attribute name="unigene"/><attribute name="illumina_v2"/><attribute name="affy_hg_u133_plus_2"/><attribute name="hgnc_symbol"/><attribute name="chromosome_name"/><attribute name="start_position"/><attribute name="end_position"/><filter name="chromosome_name" value="1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22"/></dataset></query>
##I can't find CEACAM6 in mydataset
>grep('CEACAM6',mydataset$hgnc_symbol)
integer(0)
> grep(c('203757_s_at','211657_at'),mydataset$affy_hg_u133_plus_2)
integer(0)
##and the number of affy probes doesn't match to a u133plus2 chip
(54,000 probes)
>length(unique(mydataset$affy_hg_u133_plus_2))
[1] 23474
##However, if i'm looking for CEACAM6 using the affy probes, i can
find it,
> getBM(attributes=a,filters='affy_hg_u133_plus_2',values=c('203757_s_
at','211657_at'),mart=ensembl)
ensembl_gene_id unigene illumina_v2 affy_hg_u133_plus_2
hgnc_symbol
1 ENSG00000086548 Hs.602441 ILMN_21866 203757_s_at
CEACAM6
2 ENSG00000086548 Hs.466814 ILMN_21866 203757_s_at
CEACAM6
3 ENSG00000086548 Hs.602441 ILMN_21866 211657_at
CEACAM6
4 ENSG00000086548 Hs.466814 ILMN_21866 211657_at
CEACAM6
chromosome_name start_position end_position
1 19 46951341 46967953
2 19 46951341 46967953
3 19 46951341 46967953
4 19 46951341 46967953
##End Code
I'm not too sure what's going on. Why is it when queried with
chromosome numbers, CEACAM6 disappears, but when queried with
affy_hg_u133_plus_2 probes, it appears.
Any help on this would be great. thanks.
regards
btw- i couldn't find EGFR. as a control, i managed to identify TP53
##R Code
> grep('EGFR',mydataset$hgnc_symbol)
integer(0)
> mydataset[grep('TP53',mydataset$hgnc_symbol),'hgnc_symbol']
[1] "TP53I13" "TP53I13" "TP53AP1" "TP53AP1" "TP53AP1" "TP53AP1"
[7] "TP53BP1" "TP53BP1" "TP53BP1" "TP53BP1" "TP53BP1" "TP53I3"
[13] "TP53I3" "TP53I3" "TP53I3" "TP53I3" "TP53" "TP53"
[19] "TP53" "TP53INP2" "TP53INP2" "TP53INP2" "TP53BP2" "TP53BP2"
[25] "TP53BP2" "TP53BP2"
##end R Code
> sessionInfo()
R version 2.7.1 (2008-06-23)
i486-pc-linux-gnu
locale:
LC_CTYPE=en_SG.UTF-8;LC_NUMERIC=C;LC_TIME=en_SG.UTF-8;LC_COLLATE=en_SG
.UTF-8;LC_MONETARY=C;LC_MESSAGES=en_SG.UTF-8;LC_PAPER=en_SG.UTF-8;LC_N
AME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_SG.UTF-8;LC_IDENTI
FICATION=C
attached base packages:
[1] tools stats graphics grDevices utils datasets
methods
[8] base
other attached packages:
[1] hgu133plus2.db_2.2.0 illuminaHumanv2ProbeID.db_1.1.1
[3] AnnotationDbi_1.2.2 RSQLite_0.6-9
[5] DBI_0.2-4 Biobase_2.0.1
[7] biomaRt_1.14.1 RCurl_0.9-4
loaded via a namespace (and not attached):
[1] XML_1.96-0
--
Julian Lee
Bioinformatics Specialist
Cellular and Molecular Research
National Cancer Center Singapore