Question

Problem with genomicFeatures: id2name

0

Entering edit mode

Paul Leo ▴ 970

@paul-leo-2092

Last seen 10.6 years ago

id2name(txdb, feature.type="cds") and id2name(txdb, feature.type="exon") both return all NAs foe ensemble and refseq. The cds_id perhaps don't have names ? but the exon results is strange for ensemble . using the.cds<-cds(txdb,columns=c("cds_id","tx_id","tx_name")) takes a *VERY* long time but is perhaps not indeed for use on a whole genome scale (often) ? Looking for a quick way to map the cds_id, or exon_ids to exon_names etc so I can complete the annotations with biomaRt when needed..... > txdb TranscriptDb object: | Db type: TranscriptDb | Data source: UCSC | Genome: hg19 | UCSC Table: ensGene | Type of Gene ID: Ensembl gene ID | Full dataset: yes | transcript_nrow: 151222 | exon_nrow: 470051 | cds_nrow: 264558 | Db created by: GenomicFeatures package from Bioconductor | Creation time: 2010-09-24 11:00:14 +1000 (Fri, 24 Sep 2010) | GenomicFeatures version at creation time: 1.1.12 | RSQLite version at creation time: 0.9-2 > the.cds<-cds(txdb) > the.cds GRanges with 264558 ranges and 1 elementMetadata value seqnames ranges strand | cds_id <rle> <iranges> <rle> | <integer> [1] chr1 [ 69091, 70008] + | 10762 [2] chr1 [367659, 368597] + | 10763 [3] chr1 [721406, 721912] + | 10765 [4] chr1 [861322, 861393] + | 10766 [5] chr1 [865535, 865716] + | 10767 [6] chr1 [865692, 865716] + | 10782 [7] chr1 [866419, 866469] + | 10768 [8] chr1 [871152, 871173] + | 10772 [9] chr1 [871152, 871276] + | 10769 ... ... ... ... ... ... [264550] chrY [26951104, 26951167] - | 139000 [264551] chrY [26951604, 26951655] - | 139001 [264552] chrY [26952216, 26952307] - | 139002 [264553] chrY [26952582, 26952728] - | 139003 [264554] chrY [26959330, 26959332] - | 139004 [264555] chrY [27184245, 27184263] - | 139018 [264556] chrY [27184956, 27185061] - | 139019 [264557] chrY [27187916, 27188033] - | 139020 [264558] chrY [27190093, 27190170] - | 139021 seqlengths chr1 chr2 ... chr18_gl000207_random 249250621 243199373 ... 4262 > ?id2name > cds.id.to.name<-id2name(txdb, feature.type="cds") > lengthcds.id.to.name) [1] 264558 > sum(!is.nacds.id.to.name)) [1] 0 ## ALL NA's > exon.id.to.name<-id2name(txdb, feature.type="exon") > exon.id.to.name[40000:40100] 40000 40001 40002 40003 40004 40005 40006 40007 40008 40009 40010 40011 40012 NA NA NA NA NA NA NA NA NA NA NA NA NA 40013 40014 40015 40016 40017 40018 40019 40020 40021 40022 40023 40024 40025 NA NA NA NA NA NA NA NA NA NA NA NA NA 40026 40027 40028 40029 40030 40031 40032 40033 40034 40035 40036 40037 40038 NA NA NA NA NA NA NA NA NA NA NA NA NA 40039 40040 40041 40042 40043 40044 40045 40046 40047 40048 40049 40050 40051 NA NA NA NA NA NA NA NA NA NA NA NA NA 40052 40053 40054 40055 40056 40057 40058 40059 40060 40061 40062 40063 40064 NA NA NA NA NA NA NA NA NA NA NA NA NA 40065 40066 40067 40068 40069 40070 40071 40072 40073 40074 40075 40076 40077 NA NA NA NA NA NA NA NA NA NA NA NA NA 40078 40079 40080 40081 40082 40083 40084 40085 40086 40087 40088 40089 40090 NA NA NA NA NA NA NA NA NA NA NA NA NA 40091 40092 40093 40094 40095 40096 40097 40098 40099 40100 NA NA NA NA NA NA NA NA NA NA > lengthexon.id.to.name) [1] 470051 > sum(!is.naexon.id.to.name)) [1] 0 > tx.id.to.n ################# they are all missing same is true for > sessionInfo() R version 2.13.0 Under development (unstable) (2010-09-20 r52949) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_AU.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_AU.UTF-8 LC_COLLATE=en_AU.UTF-8 [5] LC_MONETARY=C LC_MESSAGES=en_AU.UTF-8 [7] LC_PAPER=en_AU.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] BSgenome.Hsapiens.UCSC.hg19_1.3.16 BSgenome_1.17.7 [3] Biostrings_2.17.47 GenomicFeatures_1.1.12 [5] GenomicRanges_1.1.25 IRanges_1.7.34 [7] biomaRt_2.5.1 loaded via a namespace (and not attached): [1] Biobase_2.9.1 DBI_0.2-5 RCurl_1.4-3 RSQLite_0.9-2 [5] rtracklayer_1.9.9 tools_2.13.0 XML_3.1-1 > -- [[alternative HTML version deleted]]

BSgenome biomaRt BSgenome GenomicFeatures BSgenome biomaRt BSgenome GenomicFeatures • 1.1k views

ADD COMMENT • link updated 14.5 years ago by Marc Carlson ★ 7.2k • written 14.5 years ago by Paul Leo ▴ 970

score 0 · Answer 1 · 2010-09-30

Hi Paul, The NAs is because there are no unique IDs (that we can find) for these elements. In practice we almost never get unique IDs for cds or exons from either ensembl or UCSC. But there is always hope that this will change in the future. Marc On 09/23/2010 06:21 PM, Paul Leo wrote: > id2name(txdb, feature.type="cds") and id2name(txdb, > feature.type="exon") both return all NAs foe ensemble and refseq. > > The cds_id perhaps don't have names ? but the exon results is strange > for ensemble . > using the.cds<-cds(txdb,columns=c("cds_id","tx_id","tx_name")) takes a > *VERY* long time but is perhaps not indeed for use on a whole genome > scale (often) ? > > Looking for a quick way to map the cds_id, or exon_ids to exon_names etc > so I can complete the annotations with biomaRt when needed..... > > > >> txdb >> > TranscriptDb object: > | Db type: TranscriptDb > | Data source: UCSC > | Genome: hg19 > | UCSC Table: ensGene > | Type of Gene ID: Ensembl gene ID > | Full dataset: yes > | transcript_nrow: 151222 > | exon_nrow: 470051 > | cds_nrow: 264558 > | Db created by: GenomicFeatures package from Bioconductor > | Creation time: 2010-09-24 11:00:14 +1000 (Fri, 24 Sep 2010) > | GenomicFeatures version at creation time: 1.1.12 > | RSQLite version at creation time: 0.9-2 > >> the.cds<-cds(txdb) >> the.cds >> > GRanges with 264558 ranges and 1 elementMetadata value > seqnames ranges strand | cds_id > <rle> <iranges> <rle> | <integer> > [1] chr1 [ 69091, 70008] + | 10762 > [2] chr1 [367659, 368597] + | 10763 > [3] chr1 [721406, 721912] + | 10765 > [4] chr1 [861322, 861393] + | 10766 > [5] chr1 [865535, 865716] + | 10767 > [6] chr1 [865692, 865716] + | 10782 > [7] chr1 [866419, 866469] + | 10768 > [8] chr1 [871152, 871173] + | 10772 > [9] chr1 [871152, 871276] + | 10769 > ... ... ... ... ... ... > [264550] chrY [26951104, 26951167] - | 139000 > [264551] chrY [26951604, 26951655] - | 139001 > [264552] chrY [26952216, 26952307] - | 139002 > [264553] chrY [26952582, 26952728] - | 139003 > [264554] chrY [26959330, 26959332] - | 139004 > [264555] chrY [27184245, 27184263] - | 139018 > [264556] chrY [27184956, 27185061] - | 139019 > [264557] chrY [27187916, 27188033] - | 139020 > [264558] chrY [27190093, 27190170] - | 139021 > > seqlengths > chr1 chr2 ... chr18_gl000207_random > 249250621 243199373 ... 4262 > >> ?id2name >> cds.id.to.name<-id2name(txdb, feature.type="cds") >> lengthcds.id.to.name) >> > [1] 264558 > >> sum(!is.nacds.id.to.name)) >> > [1] 0 ## ALL NA's > > >> exon.id.to.name<-id2name(txdb, feature.type="exon") >> exon.id.to.name[40000:40100] >> > 40000 40001 40002 40003 40004 40005 40006 40007 40008 40009 40010 40011 > 40012 > NA NA NA NA NA NA NA NA NA NA NA NA > NA > 40013 40014 40015 40016 40017 40018 40019 40020 40021 40022 40023 40024 > 40025 > NA NA NA NA NA NA NA NA NA NA NA NA > NA > 40026 40027 40028 40029 40030 40031 40032 40033 40034 40035 40036 40037 > 40038 > NA NA NA NA NA NA NA NA NA NA NA NA > NA > 40039 40040 40041 40042 40043 40044 40045 40046 40047 40048 40049 40050 > 40051 > NA NA NA NA NA NA NA NA NA NA NA NA > NA > 40052 40053 40054 40055 40056 40057 40058 40059 40060 40061 40062 40063 > 40064 > NA NA NA NA NA NA NA NA NA NA NA NA > NA > 40065 40066 40067 40068 40069 40070 40071 40072 40073 40074 40075 40076 > 40077 > NA NA NA NA NA NA NA NA NA NA NA NA > NA > 40078 40079 40080 40081 40082 40083 40084 40085 40086 40087 40088 40089 > 40090 > NA NA NA NA NA NA NA NA NA NA NA NA > NA > 40091 40092 40093 40094 40095 40096 40097 40098 40099 40100 > NA NA NA NA NA NA NA NA NA NA > >> lengthexon.id.to.name) >> > [1] 470051 > >> sum(!is.naexon.id.to.name)) >> > [1] 0 > >> tx.id.to.n >> > ################# they are all missing same is true for > >> sessionInfo() >> > R version 2.13.0 Under development (unstable) (2010-09-20 r52949) > Platform: x86_64-unknown-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_AU.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_AU.UTF-8 LC_COLLATE=en_AU.UTF-8 > [5] LC_MONETARY=C LC_MESSAGES=en_AU.UTF-8 > [7] LC_PAPER=en_AU.UTF-8 LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] stats graphics grDevices utils datasets methods > base > > other attached packages: > [1] BSgenome.Hsapiens.UCSC.hg19_1.3.16 > BSgenome_1.17.7 > [3] Biostrings_2.17.47 > GenomicFeatures_1.1.12 > [5] GenomicRanges_1.1.25 > IRanges_1.7.34 > [7] biomaRt_2.5.1 > > loaded via a namespace (and not attached): > [1] Biobase_2.9.1 DBI_0.2-5 RCurl_1.4-3 > RSQLite_0.9-2 > [5] rtracklayer_1.9.9 tools_2.13.0 XML_3.1-1 > >> >