makeTranscriptDbFromGFF Error for UCSC GTF File
1
0
Entering edit mode
Dario Strbenac ★ 1.5k
@dario-strbenac-5916
Last seen 14 hours ago
Australia
Hello, I used : > system.time(hg19 <- makeTranscriptDbFromGFF("/home/dario/data/Annotation/hg19.gtf", format = "gtf")) Error in .parse_attrCol(attrCol, file, colnames) : Some attributes do not conform to 'tag value' format Timing stopped at: 15.605 0.296 16.07 I downloaded the GTF file from UCSC Table Browser. The table's name was refGene. To me, it seems that the attributes are fine : > hg19table <- read.table("/home/dario/data/Annotation/hg19.gtf", sep = '\t', stringsAsFactors=FALSE) > table(sapply(strsplit(hg19table[, 9], ' '), length)) 4 967118 I have R version 3.1.0 (2014-04-10) and GenomicFeatures 1.16.2 -------------------------------------- Dario Strbenac PhD Student University of Sydney Camperdown NSW 2050 Australia
GenomicFeatures GenomicFeatures • 2.2k views
ADD COMMENT
0
Entering edit mode
Marc Carlson ★ 7.2k
@marc-carlson-2264
Last seen 7.7 years ago
United States
Hi Dario, That error says that some of the attributes have been formatted in a way that leaves them uninterpretable by the parser. But what really puzzles me is why you want to parse this track as a GTF file at all? The UCSC hg19 track is already available as a package here: http://www.bioconductor.org/packages/release/data/annotation/html/TxDb .Hsapiens.UCSC.hg19.knownGene.html And if that is not actually the track you are trying for, then perhaps you should just use the makeTranscriptDbFromUCSC() function instead? That would be the more typical tool for making UCSC tracks into TranscriptDb objects. In contrast, using GTF or GFF files for making TranscriptDb objects is always a little risky because many of these files will not have been created with the intention of holding a transcriptome as data (which is the specific thing that a TranscriptDb object is meant to hold). This is because the GTF and GFF file formats were not initially intended for the specific purpose of holding a transcriptome but were instead intended to be something more general. Hope this helps, Marc On 07/02/2014 12:00 AM, Dario Strbenac wrote: > Hello, > > I used : > >> system.time(hg19 <- makeTranscriptDbFromGFF("/home/dario/data/Annotation/hg19.gtf", format = "gtf")) > Error in .parse_attrCol(attrCol, file, colnames) : > Some attributes do not conform to 'tag value' format > Timing stopped at: 15.605 0.296 16.07 > > I downloaded the GTF file from UCSC Table Browser. The table's name was refGene. To me, it seems that the attributes are fine : > >> hg19table <- read.table("/home/dario/data/Annotation/hg19.gtf", sep = '\t', stringsAsFactors=FALSE) >> table(sapply(strsplit(hg19table[, 9], ' '), length)) > 4 > 967118 > > I have R version 3.1.0 (2014-04-10) and GenomicFeatures 1.16.2 > > -------------------------------------- > Dario Strbenac > PhD Student > University of Sydney > Camperdown NSW 2050 > Australia > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
ADD COMMENT
0
Entering edit mode
Hi Dario, Marc, FWIW, I get a different error. Like you I downloaded the refGene table in GTF format using the UCSC Table Browser web interface (https://genome.ucsc.edu/cgi- bin/hgTables?db=hg19&hgta_group=genes&hgta_track=refGene). Then: ## No problem with the parser (used internally by makeTranscriptDbFromGFF): library(rtracklayer) hg19_refGene <- import("hg19_refGene.gtf") ## Error with makeTranscriptDbFromGFF: > library(GenomicFeatures) > txdb <- makeTranscriptDbFromGFF("hg19_refGene.gtf", format="gtf") extracting transcript information Estimating transcript ranges. Extracting gene IDs Processing splicing information for gtf file. Deducing exon rank from relative coordinates provided Warning messages: 1: In .deduceTranscriptsFromGTF(transcripts) : Some of your transcripts have exons on more than one chromsome. We cannot deduce the order of these exons so these transcripts have been discarded. 2: In .deduceExonRankings(exs, format = "gtf") : Infering Exon Rankings. If this is not what you expected, then please be sure that you have provided a valid attribute for exonRankAttributeName Error in unlist(mapply(.assignRankings, starts, strands)) : error in evaluating the argument 'x' in selecting a method for function 'unlist': Error in (function (starts, strands) : Exon rank inference cannot accomodate trans-splicing. Cheers, H. > sessionInfo() R version 3.1.0 Patched (2014-06-21 r66002) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats graphics grDevices utils datasets methods [8] base other attached packages: [1] GenomicFeatures_1.17.12 AnnotationDbi_1.27.8 Biobase_2.25.0 [4] rtracklayer_1.25.11 GenomicRanges_1.17.18 GenomeInfoDb_1.1.9 [7] IRanges_1.99.16 S4Vectors_0.0.9 BiocGenerics_0.11.2 loaded via a namespace (and not attached): [1] BatchJobs_1.2 BBmisc_1.7 BiocParallel_0.7.5 [4] biomaRt_2.21.0 Biostrings_2.33.10 bitops_1.0-6 [7] brew_1.0-6 checkmate_1.1 codetools_0.2-8 [10] DBI_0.2-7 digest_0.6.4 fail_1.2 [13] foreach_1.4.2 GenomicAlignments_1.1.14 iterators_1.0.7 [16] plyr_1.8.1 Rcpp_0.11.2 RCurl_1.95-4.1 [19] Rsamtools_1.17.27 RSQLite_0.11.4 sendmailR_1.1-2 [22] stats4_3.1.0 stringr_0.6.2 tools_3.1.0 [25] XML_3.98-1.1 XVector_0.5.6 zlibbioc_1.11.1 On 07/02/2014 10:16 AM, Marc Carlson wrote: > Hi Dario, > > That error says that some of the attributes have been formatted in a way > that leaves them uninterpretable by the parser. But what really puzzles > me is why you want to parse this track as a GTF file at all? The UCSC > hg19 track is already available as a package here: > > http://www.bioconductor.org/packages/release/data/annotation/html/Tx Db.Hsapiens.UCSC.hg19.knownGene.html > > > And if that is not actually the track you are trying for, then perhaps > you should just use the makeTranscriptDbFromUCSC() function instead? > That would be the more typical tool for making UCSC tracks into > TranscriptDb objects. > > In contrast, using GTF or GFF files for making TranscriptDb objects is > always a little risky because many of these files will not have been > created with the intention of holding a transcriptome as data (which is > the specific thing that a TranscriptDb object is meant to hold). This > is because the GTF and GFF file formats were not initially intended for > the specific purpose of holding a transcriptome but were instead > intended to be something more general. > > Hope this helps, > > > Marc > > > > On 07/02/2014 12:00 AM, Dario Strbenac wrote: >> Hello, >> >> I used : >> >>> system.time(hg19 <- >>> makeTranscriptDbFromGFF("/home/dario/data/Annotation/hg19.gtf", >>> format = "gtf")) >> Error in .parse_attrCol(attrCol, file, colnames) : >> Some attributes do not conform to 'tag value' format >> Timing stopped at: 15.605 0.296 16.07 >> >> I downloaded the GTF file from UCSC Table Browser. The table's name >> was refGene. To me, it seems that the attributes are fine : >> >>> hg19table <- read.table("/home/dario/data/Annotation/hg19.gtf", sep = >>> '\t', stringsAsFactors=FALSE) >>> table(sapply(strsplit(hg19table[, 9], ' '), length)) >> 4 >> 967118 >> >> I have R version 3.1.0 (2014-04-10) and GenomicFeatures 1.16.2 >> >> -------------------------------------- >> Dario Strbenac >> PhD Student >> University of Sydney >> Camperdown NSW 2050 >> Australia >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
ADD REPLY
0
Entering edit mode
On Wed, Jul 2, 2014 at 10:16 AM, Marc Carlson <mcarlson@fhcrc.org> wrote: > Hi Dario, > > That error says that some of the attributes have been formatted in a way > that leaves them uninterpretable by the parser. But what really puzzles me > is why you want to parse this track as a GTF file at all? The UCSC hg19 > track is already available as a package here: > > http://www.bioconductor.org/packages/release/data/ > annotation/html/TxDb.Hsapiens.UCSC.hg19.knownGene.html > > And if that is not actually the track you are trying for, then perhaps you > should just use the makeTranscriptDbFromUCSC() function instead? That > would be the more typical tool for making UCSC tracks into TranscriptDb > objects. > > In contrast, using GTF or GFF files for making TranscriptDb objects is > always a little risky because many of these files will not have been > created with the intention of holding a transcriptome as data (which is the > specific thing that a TranscriptDb object is meant to hold). This is > because the GTF and GFF file formats were not initially intended for the > specific purpose of holding a transcriptome but were instead intended to be > something more general. > > Actually GTF (Gene Transfer Format) files are designed specifically for representing gene models, and we have no excuse for not parsing them correctly. There have been some tweaks to attribute parsing (I thought limited to GFF3) in devel, so there may be a difference between Herve's devel result and Dario's release result. I'll try to find some time to look into this. > Hope this helps, > > > Marc > > > > > On 07/02/2014 12:00 AM, Dario Strbenac wrote: > >> Hello, >> >> I used : >> >> system.time(hg19 <- makeTranscriptDbFromGFF("/ >>> home/dario/data/Annotation/hg19.gtf", format = "gtf")) >>> >> Error in .parse_attrCol(attrCol, file, colnames) : >> Some attributes do not conform to 'tag value' format >> Timing stopped at: 15.605 0.296 16.07 >> >> I downloaded the GTF file from UCSC Table Browser. The table's name was >> refGene. To me, it seems that the attributes are fine : >> >> hg19table <- read.table("/home/dario/data/Annotation/hg19.gtf", sep = >>> '\t', stringsAsFactors=FALSE) >>> table(sapply(strsplit(hg19table[, 9], ' '), length)) >>> >> 4 >> 967118 >> >> I have R version 3.1.0 (2014-04-10) and GenomicFeatures 1.16.2 >> >> -------------------------------------- >> Dario Strbenac >> PhD Student >> University of Sydney >> Camperdown NSW 2050 >> Australia >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor@r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane. >> science.biology.informatics.conductor >> > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane. > science.biology.informatics.conductor > [[alternative HTML version deleted]]
ADD REPLY
0
Entering edit mode
Hi On 02/07/14 20:17, Michael Lawrence wrote: >> In contrast, using GTF or GFF files for making TranscriptDb objects is >> always a little risky because many of these files will not have been >> created with the intention of holding a transcriptome as data (which is the >> specific thing that a TranscriptDb object is meant to hold). This is >> because the GTF and GFF file formats were not initially intended for the >> specific purpose of holding a transcriptome but were instead intended to be >> something more general. >> >> > Actually GTF (Gene Transfer Format) files are designed specifically for > representing gene models, and we have no excuse for not parsing them > correctly. There have been some tweaks to attribute parsing (I thought > limited to GFF3) in devel, so there may be a difference between Herve's > devel result and Dario's release result. I'll try to find some time to > look into this. The problem with GTF files produced by the UCSC Table Browser is that they contain incorrect gene IDs: The gene_id attribute is always set to the same value as the transcript_id, and these files hence cannot be used to define gene models without manual correction (which would be to remove the transcript number suffix from the gene IDs). Long ago, I have asked the UCSC Genome Browser help-desk why this is and was told that it is a bug in the Table Browser which they cannot fix, for some reason. Hence, I usually advise to not use these files. Simon
ADD REPLY
0
Entering edit mode
OK, It looks like Herve may have found another problem with the code that tries to guess the order of the exon ranking when it is not provided by these files. I will look into that. But for now I think that people should instead do this (which works right now): library(GenomicFeatures) txdb <- makeTranscriptDbFromUCSC(genome="hg19",tablename="refGene") And another thing that relates to this is that I really don't think you want to use the GTF files being generated here for making TranscriptDb objects. The reason is because Simon is correct about the GTF file coming from UCSC here. It's not a good one. It may also have the problem that Simon describes with the bad IDs, but it definitely also has another serious problem that traces back to a deficiency in the GTF file format for this use case (which is: representing a transcriptome). The problem is that you can make a valid GTF file and not include any exon ranking information. That means that the file can tell you about ranges for all the exons but not ever tell you what order they should be in to make a corresponding transcript. And that's not good as it is still woefully incomplete for specifying a transcriptome. Now for some really simple organisms that have very little splicing this might be OK. But for humans it is almost certainly not OK which is why my code is throwing all those warnings in your output. This absence basically means that you then have to guess that the exons occur in the order that they occur along a chromosome. Now sometimes you can tell that it is not OK to make this kind of guess even with the limited data that you have - such as when exons are from other chromosomes - and these kinds of cases get thrown out - under protest: with even more warnings. But in any case, my feeling is that you should probably never use the arguments to make that kind of guess when dealing with something that has a transcriptome as complex as humans. So if you have a GTF file and it's for a complex organism then I really think you should insist that that file contains exon rankings. This is what I was alluding to when I said that the GTF file format has a use case that is more general than that of storing a complete transcriptome. Can you use the format to store a legimate transcriptome? Yes of course. But can you rely on the file format to ensure that a valid file will always contains a complete transcriptome? Unfortunately: the answer is no. Some GTF files will just not have the information needed to specify a transcriptome since not all the information that you need to specify that is required by the format. Hope this clarifies things, Marc On 07/02/2014 11:34 AM, Simon Anders wrote: > Hi > > On 02/07/14 20:17, Michael Lawrence wrote: >>> In contrast, using GTF or GFF files for making TranscriptDb objects is >>> always a little risky because many of these files will not have been >>> created with the intention of holding a transcriptome as data (which >>> is the >>> specific thing that a TranscriptDb object is meant to hold). This is >>> because the GTF and GFF file formats were not initially intended for >>> the >>> specific purpose of holding a transcriptome but were instead >>> intended to be >>> something more general. >>> >>> >> Actually GTF (Gene Transfer Format) files are designed specifically for >> representing gene models, and we have no excuse for not parsing them >> correctly. There have been some tweaks to attribute parsing (I thought >> limited to GFF3) in devel, so there may be a difference between Herve's >> devel result and Dario's release result. I'll try to find some time to >> look into this. > > The problem with GTF files produced by the UCSC Table Browser is that > they contain incorrect gene IDs: The gene_id attribute is always set > to the same value as the transcript_id, and these files hence cannot > be used to define gene models without manual correction (which would > be to remove the transcript number suffix from the gene IDs). > > Long ago, I have asked the UCSC Genome Browser help-desk why this is > and was told that it is a bug in the Table Browser which they cannot > fix, for some reason. > > Hence, I usually advise to not use these files. > > Simon > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor
ADD REPLY

Login before adding your answer.

Traffic: 963 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6