retrieving annotation
2
0
Entering edit mode
Kathi Zarnack ▴ 110
@kathi-zarnack-4596
Last seen 9.6 years ago
Hi, I wanted to ask whether any of the annotation packages contains information on the transcript biotype (protein-coding, etc). I would like to select only protein-coding isoforms from Ensembl annotation, but I could not find any package that includes this information (otherwise I will get it with biomaRt, I just wondered whether it is already included somewhere). Also, I tried to download GENCODE annotation using GenomicFeatures, and got the following error: > test=makeTranscriptDbFromUCSC(genome="hg19", tablename="wgEncodeGencodeManualV3") Error in tableNames(ucscTableQuery(session, track = track)) : error in evaluating the argument 'object' in selecting a method for function 'tableNames': Error in normArgTrack(track, trackids) : Unknown track: Gencode Genes I tried to get the same table for hg18, but I get only one step further: test=makeTranscriptDbFromUCSC(genome="hg18", tablename="wgEncodeGencodeManualV3") Download the wgEncodeGencodeManualV3 table ... OK Download the wgEncodeGencodeClassesV3 table ... Error in normArgTable(value, x) : unknown table name 'wgEncodeGencodeClassesV3' Thank you very much for your help, Kathi ------------------------------------------ > sessionInfo() R version 3.0.2 (2013-09-25) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8 [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8 [7] LC_PAPER=en_GB.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats graphics grDevices utils datasets methods [8] base other attached packages: [1] GenomicFeatures_1.14.0 AnnotationDbi_1.24.0 Biobase_2.22.0 [4] GenomicRanges_1.14.3 XVector_0.2.0 IRanges_1.20.5 [7] BiocGenerics_0.8.0 BiocInstaller_1.12.0 loaded via a namespace (and not attached): [1] biomaRt_2.18.0 Biostrings_2.30.0 bitops_1.0-6 BSgenome_1.30.0 [5] DBI_0.2-7 RCurl_1.95-4.1 Rsamtools_1.14.1 RSQLite_0.11.4 [9] rtracklayer_1.22.0 stats4_3.0.2 tcltk_3.0.2 tools_3.0.2 [13] XML_3.98-1.1 zlibbioc_1.8.0 -- Dr. Kathi Zarnack Luscombe Group European Molecular Biology Laboratory European Bioinformatics Institute (EMBL-EBI) Wellcome Trust Genome Campus Hinxton Cambridge CB10 1SD United Kingdom emailzarnack at ebi.ac.uk tel +44 1223 494 526
Annotation biomaRt GenomicFeatures Annotation biomaRt GenomicFeatures • 1.5k views
ADD COMMENT
0
Entering edit mode
@delhommeemblde-3232
Last seen 9.6 years ago
Hej Kathi! In a different thread (GTF file error when using easyRNAseq), Martin mentioned that you can access ensemble gff files through AnnotationHub. I just copy part of this answer below and as you can see, the gene_biotype is part of the annotation: > library(AnnotationHub) > hub = AnnotationHub() > hub$ensembl.release.73.<tab> hub$ensembl.release.73.fasta. ... [378] hub$ensembl.release.73.gtf. ... [63] > xx = hub$ensembl.release.73.gtf.gallus_gallus.Gallus_gallus.Galgal4. 73.gtf_0.0.1.RData > xx GRanges with 381368 ranges and 12 metadata columns: seqnames ranges strand | source type <rle> <iranges> <rle> | <factor> <factor> [1] 1 [1735, 2449] + | protein_coding exon [2] 1 [2379, 2449] + | protein_coding CDS score phase gene_id transcript_id <numeric> <integer> <character> <character> [1] <na> <na> ENSGALG00000009771 ENSGALT00000015891 [2] <na> 0 ENSGALG00000009771 ENSGALT00000015891 exon_number gene_biotype exon_id protein_id <numeric> <character> <character> <character> [1] 1 protein_coding ENSGALE00000301221 <na> [2] 1 protein_coding <na> ENSGALP00000015874 gene_name transcript_name <character> <character> [1] <na> <na> [2] <na> <na> [ reached getOption("max.print") -- omitted 9 rows ] --- seqlengths: 1 2 ... AADN03010940.1 NA NA ? NA Hope this helps, Cheers, Nico --------------------------------------------------------------- Nicolas Delhomme Genome Biology Computational Support European Molecular Biology Laboratory Tel: +49 6221 387 8310 Email: nicolas.delhomme at embl.de Meyerhofstrasse 1 - Postfach 10.2209 69102 Heidelberg, Germany --------------------------------------------------------------- On 7 Nov 2013, at 14:11, Kathi Zarnack <zarnack at="" ebi.ac.uk=""> wrote: > Hi, > > I wanted to ask whether any of the annotation packages contains information on the transcript biotype (protein-coding, etc). I would like to select only protein-coding isoforms from Ensembl annotation, but I could not find any package that includes this information (otherwise I will get it with biomaRt, I just wondered whether it is already included somewhere). > > Also, I tried to download GENCODE annotation using GenomicFeatures, and got the following error: > > > test=makeTranscriptDbFromUCSC(genome="hg19", tablename="wgEncodeGencodeManualV3") > Error in tableNames(ucscTableQuery(session, track = track)) : > error in evaluating the argument 'object' in selecting a method for function 'tableNames': Error in normArgTrack(track, trackids) : Unknown track: Gencode Genes > > I tried to get the same table for hg18, but I get only one step further: > > test=makeTranscriptDbFromUCSC(genome="hg18", tablename="wgEncodeGencodeManualV3") > Download the wgEncodeGencodeManualV3 table ... OK > Download the wgEncodeGencodeClassesV3 table ... Error in normArgTable(value, x) : > unknown table name 'wgEncodeGencodeClassesV3' > > Thank you very much for your help, > Kathi > > > ------------------------------------------ > > > sessionInfo() > R version 3.0.2 (2013-09-25) > Platform: x86_64-unknown-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8 > [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8 > [7] LC_PAPER=en_GB.UTF-8 LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] parallel stats graphics grDevices utils datasets methods > [8] base > > other attached packages: > [1] GenomicFeatures_1.14.0 AnnotationDbi_1.24.0 Biobase_2.22.0 > [4] GenomicRanges_1.14.3 XVector_0.2.0 IRanges_1.20.5 > [7] BiocGenerics_0.8.0 BiocInstaller_1.12.0 > > loaded via a namespace (and not attached): > [1] biomaRt_2.18.0 Biostrings_2.30.0 bitops_1.0-6 BSgenome_1.30.0 > [5] DBI_0.2-7 RCurl_1.95-4.1 Rsamtools_1.14.1 RSQLite_0.11.4 > [9] rtracklayer_1.22.0 stats4_3.0.2 tcltk_3.0.2 tools_3.0.2 > [13] XML_3.98-1.1 zlibbioc_1.8.0 > > > -- > Dr. Kathi Zarnack > Luscombe Group > > European Molecular Biology Laboratory > European Bioinformatics Institute (EMBL-EBI) > Wellcome Trust Genome Campus > Hinxton > Cambridge CB10 1SD > United Kingdom > > emailzarnack at ebi.ac.uk > tel +44 1223 494 526 > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
ADD COMMENT
0
Entering edit mode
Hi Nico, thanks for the hint. I will have a look at AnnotationHub. I was looking for the transcript_biotype rather than the gene_biotype (to discriminate protein_coding isoforms from the rest like processed_transcript etc), but this should also be included in the Ensembl gtf file. Thanks, Kathi On 16/11/13 22:39, Nicolas Delhomme wrote: > Hej Kathi! > > In a different thread (GTF file error when using easyRNAseq), Martin mentioned that you can access ensemble gff files through AnnotationHub. I just copy part of this answer below and as you can see, the gene_biotype is part of the annotation: > >> library(AnnotationHub) >> hub = AnnotationHub() >> hub$ensembl.release.73.<tab> > hub$ensembl.release.73.fasta. ... [378] > hub$ensembl.release.73.gtf. ... [63] >> xx = hub$ensembl.release.73.gtf.gallus_gallus.Gallus_gallus.Galgal4 .73.gtf_0.0.1.RData >> xx > GRanges with 381368 ranges and 12 metadata columns: > seqnames ranges strand | source type > <rle> <iranges> <rle> | <factor> <factor> > [1] 1 [1735, 2449] + | protein_coding exon > [2] 1 [2379, 2449] + | protein_coding CDS > score phase gene_id transcript_id > <numeric> <integer> <character> <character> > [1] <na> <na> ENSGALG00000009771 ENSGALT00000015891 > [2] <na> 0 ENSGALG00000009771 ENSGALT00000015891 > exon_number gene_biotype exon_id protein_id > <numeric> <character> <character> <character> > [1] 1 protein_coding ENSGALE00000301221 <na> > [2] 1 protein_coding <na> ENSGALP00000015874 > gene_name transcript_name > <character> <character> > [1] <na> <na> > [2] <na> <na> > [ reached getOption("max.print") -- omitted 9 rows ] > --- > seqlengths: > 1 2 ... AADN03010940.1 > NA NA ? NA > > Hope this helps, > > Cheers, > > Nico > > --------------------------------------------------------------- > Nicolas Delhomme > > Genome Biology Computational Support > > European Molecular Biology Laboratory > > Tel: +49 6221 387 8310 > Email: nicolas.delhomme at embl.de > Meyerhofstrasse 1 - Postfach 10.2209 > 69102 Heidelberg, Germany > --------------------------------------------------------------- > > > > > > On 7 Nov 2013, at 14:11, Kathi Zarnack <zarnack at="" ebi.ac.uk=""> wrote: > >> Hi, >> >> I wanted to ask whether any of the annotation packages contains information on the transcript biotype (protein-coding, etc). I would like to select only protein-coding isoforms from Ensembl annotation, but I could not find any package that includes this information (otherwise I will get it with biomaRt, I just wondered whether it is already included somewhere). >> >> Also, I tried to download GENCODE annotation using GenomicFeatures, and got the following error: >> >>> test=makeTranscriptDbFromUCSC(genome="hg19", tablename="wgEncodeGencodeManualV3") >> Error in tableNames(ucscTableQuery(session, track = track)) : >> error in evaluating the argument 'object' in selecting a method for function 'tableNames': Error in normArgTrack(track, trackids) : Unknown track: Gencode Genes >> >> I tried to get the same table for hg18, but I get only one step further: >> >> test=makeTranscriptDbFromUCSC(genome="hg18", tablename="wgEncodeGencodeManualV3") >> Download the wgEncodeGencodeManualV3 table ... OK >> Download the wgEncodeGencodeClassesV3 table ... Error in normArgTable(value, x) : >> unknown table name 'wgEncodeGencodeClassesV3' >> >> Thank you very much for your help, >> Kathi >> >> >> ------------------------------------------ >> >>> sessionInfo() >> R version 3.0.2 (2013-09-25) >> Platform: x86_64-unknown-linux-gnu (64-bit) >> >> locale: >> [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C >> [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8 >> [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8 >> [7] LC_PAPER=en_GB.UTF-8 LC_NAME=C >> [9] LC_ADDRESS=C LC_TELEPHONE=C >> [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C >> >> attached base packages: >> [1] parallel stats graphics grDevices utils datasets methods >> [8] base >> >> other attached packages: >> [1] GenomicFeatures_1.14.0 AnnotationDbi_1.24.0 Biobase_2.22.0 >> [4] GenomicRanges_1.14.3 XVector_0.2.0 IRanges_1.20.5 >> [7] BiocGenerics_0.8.0 BiocInstaller_1.12.0 >> >> loaded via a namespace (and not attached): >> [1] biomaRt_2.18.0 Biostrings_2.30.0 bitops_1.0-6 BSgenome_1.30.0 >> [5] DBI_0.2-7 RCurl_1.95-4.1 Rsamtools_1.14.1 RSQLite_0.11.4 >> [9] rtracklayer_1.22.0 stats4_3.0.2 tcltk_3.0.2 tools_3.0.2 >> [13] XML_3.98-1.1 zlibbioc_1.8.0 >> >> >> -- >> Dr. Kathi Zarnack >> Luscombe Group >> >> European Molecular Biology Laboratory >> European Bioinformatics Institute (EMBL-EBI) >> Wellcome Trust Genome Campus >> Hinxton >> Cambridge CB10 1SD >> United Kingdom >> >> emailzarnack at ebi.ac.uk >> tel +44 1223 494 526 >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Dr. Kathi Zarnack Luscombe Group European Molecular Biology Laboratory European Bioinformatics Institute (EMBL-EBI) Wellcome Trust Genome Campus Hinxton Cambridge CB10 1SD United Kingdom email zarnack at ebi.ac.uk tel +44 1223 494 526
ADD REPLY
0
Entering edit mode
@herve-pages-1542
Last seen 13 hours ago
Seattle, WA, United States
Hi Kathi, On 11/07/2013 05:11 AM, Kathi Zarnack wrote: > Hi, > > I wanted to ask whether any of the annotation packages contains > information on the transcript biotype (protein-coding, etc). I would > like to select only protein-coding isoforms from Ensembl annotation, but > I could not find any package that includes this information (otherwise I > will get it with biomaRt, I just wondered whether it is already included > somewhere). > > Also, I tried to download GENCODE annotation using GenomicFeatures, and > got the following error: > > > test=makeTranscriptDbFromUCSC(genome="hg19", > tablename="wgEncodeGencodeManualV3") > Error in tableNames(ucscTableQuery(session, track = track)) : > error in evaluating the argument 'object' in selecting a method for > function 'tableNames': Error in normArgTrack(track, trackids) : Unknown > track: Gencode Genes > > I tried to get the same table for hg18, but I get only one step further: > > test=makeTranscriptDbFromUCSC(genome="hg18", > tablename="wgEncodeGencodeManualV3") > Download the wgEncodeGencodeManualV3 table ... OK > Download the wgEncodeGencodeClassesV3 table ... Error in > normArgTable(value, x) : > unknown table name 'wgEncodeGencodeClassesV3' Note that the wgEncodeGencodeManualV3 table seems to be for hg18 only: there doesn't seem to be such table for hg19. For hg19, UCSC provides 3 GENCODE tracks: GENCODE Genes V17, GENCODE Genes V14, and GENCODE Genes V7. Each of them contains 5 tables that are compatible with makeTranscriptDbFromUCSC(). For example, for GENCODE Genes V17, those tables are: wgEncodeGencodeBasicV17 wgEncodeGencodeCompV17 wgEncodeGencodePseudoGeneV17 wgEncodeGencode2wayConsPseudoV17 wgEncodeGencodePolyaV17 See here for the details: http://genome.ucsc.edu/cgi- bin/hgTrackUi?db=hg19&g=wgEncodeGencodeSuper I just made some adjustments to the GenomicFeatures package so makeTranscriptDbFromUCSC() can work on those tables. Unfortunately I also needed to fix support for the wgEncodeGencode*V3 tables (for hg18) which was broken due to changes on the UCSC side. Those updates are in GenomicFeatures 1.14.2 (release) and 1.15.4 (devel). Both should become available via biocLite() in the next 24 hours or so. Please let us know if you run into any other problem with the GenomicFeatures package. Thanks, H. > > Thank you very much for your help, > Kathi > > > ------------------------------------------ > > > sessionInfo() > R version 3.0.2 (2013-09-25) > Platform: x86_64-unknown-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8 > [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8 > [7] LC_PAPER=en_GB.UTF-8 LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] parallel stats graphics grDevices utils datasets methods > [8] base > > other attached packages: > [1] GenomicFeatures_1.14.0 AnnotationDbi_1.24.0 Biobase_2.22.0 > [4] GenomicRanges_1.14.3 XVector_0.2.0 IRanges_1.20.5 > [7] BiocGenerics_0.8.0 BiocInstaller_1.12.0 > > loaded via a namespace (and not attached): > [1] biomaRt_2.18.0 Biostrings_2.30.0 bitops_1.0-6 BSgenome_1.30.0 > [5] DBI_0.2-7 RCurl_1.95-4.1 Rsamtools_1.14.1 RSQLite_0.11.4 > [9] rtracklayer_1.22.0 stats4_3.0.2 tcltk_3.0.2 tools_3.0.2 > [13] XML_3.98-1.1 zlibbioc_1.8.0 > > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
ADD COMMENT
0
Entering edit mode
Hi Herve, thanks for pointing me to the files and also for updating GenomicFeatures. It's a great package! I let you know if I run into any other problems. Best regards, Kathi On 17/11/13 21:43, Hervé Pagès wrote: > Hi Kathi, > > On 11/07/2013 05:11 AM, Kathi Zarnack wrote: >> Hi, >> >> I wanted to ask whether any of the annotation packages contains >> information on the transcript biotype (protein-coding, etc). I would >> like to select only protein-coding isoforms from Ensembl annotation, but >> I could not find any package that includes this information (otherwise I >> will get it with biomaRt, I just wondered whether it is already included >> somewhere). >> >> Also, I tried to download GENCODE annotation using GenomicFeatures, and >> got the following error: >> >> > test=makeTranscriptDbFromUCSC(genome="hg19", >> tablename="wgEncodeGencodeManualV3") >> Error in tableNames(ucscTableQuery(session, track = track)) : >> error in evaluating the argument 'object' in selecting a method for >> function 'tableNames': Error in normArgTrack(track, trackids) : Unknown >> track: Gencode Genes >> >> I tried to get the same table for hg18, but I get only one step further: >> >> test=makeTranscriptDbFromUCSC(genome="hg18", >> tablename="wgEncodeGencodeManualV3") >> Download the wgEncodeGencodeManualV3 table ... OK >> Download the wgEncodeGencodeClassesV3 table ... Error in >> normArgTable(value, x) : >> unknown table name 'wgEncodeGencodeClassesV3' > > Note that the wgEncodeGencodeManualV3 table seems to be for hg18 > only: there doesn't seem to be such table for hg19. > > For hg19, UCSC provides 3 GENCODE tracks: GENCODE Genes V17, GENCODE > Genes V14, and GENCODE Genes V7. Each of them contains 5 tables > that are compatible with makeTranscriptDbFromUCSC(). For example, > for GENCODE Genes V17, those tables are: > > wgEncodeGencodeBasicV17 > wgEncodeGencodeCompV17 > wgEncodeGencodePseudoGeneV17 > wgEncodeGencode2wayConsPseudoV17 > wgEncodeGencodePolyaV17 > > See here for the details: > > http://genome.ucsc.edu/cgi- bin/hgTrackUi?db=hg19&g=wgEncodeGencodeSuper > > I just made some adjustments to the GenomicFeatures package so > makeTranscriptDbFromUCSC() can work on those tables. Unfortunately > I also needed to fix support for the wgEncodeGencode*V3 tables (for > hg18) which was broken due to changes on the UCSC side. > > Those updates are in GenomicFeatures 1.14.2 (release) and 1.15.4 > (devel). Both should become available via biocLite() in the next 24 > hours or so. > > Please let us know if you run into any other problem with the > GenomicFeatures package. > > Thanks, > H. > > >> >> Thank you very much for your help, >> Kathi >> >> >> ------------------------------------------ >> >> > sessionInfo() >> R version 3.0.2 (2013-09-25) >> Platform: x86_64-unknown-linux-gnu (64-bit) >> >> locale: >> [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C >> [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8 >> [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8 >> [7] LC_PAPER=en_GB.UTF-8 LC_NAME=C >> [9] LC_ADDRESS=C LC_TELEPHONE=C >> [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C >> >> attached base packages: >> [1] parallel stats graphics grDevices utils datasets methods >> [8] base >> >> other attached packages: >> [1] GenomicFeatures_1.14.0 AnnotationDbi_1.24.0 Biobase_2.22.0 >> [4] GenomicRanges_1.14.3 XVector_0.2.0 IRanges_1.20.5 >> [7] BiocGenerics_0.8.0 BiocInstaller_1.12.0 >> >> loaded via a namespace (and not attached): >> [1] biomaRt_2.18.0 Biostrings_2.30.0 bitops_1.0-6 BSgenome_1.30.0 >> [5] DBI_0.2-7 RCurl_1.95-4.1 Rsamtools_1.14.1 >> RSQLite_0.11.4 >> [9] rtracklayer_1.22.0 stats4_3.0.2 tcltk_3.0.2 tools_3.0.2 >> [13] XML_3.98-1.1 zlibbioc_1.8.0 >> >> > -- Dr. Kathi Zarnack Luscombe Group European Molecular Biology Laboratory European Bioinformatics Institute (EMBL-EBI) Wellcome Trust Genome Campus Hinxton Cambridge CB10 1SD United Kingdom email zarnack at ebi.ac.uk tel +44 1223 494 526
ADD REPLY

Login before adding your answer.

Traffic: 516 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6