rtracklayer gff import
1
0
Entering edit mode
Kathi Zarnack ▴ 110
@kathi-zarnack-4596
Last seen 9.6 years ago
Hi, I am using the package rtracklayer to import transcript.gtf files produced by Cufflinks. As I understand the gff3 specification, feature coordinates are given as "start and end of the feature, in 1-based integer coordinates" (also discussed in this mailing list lately), meaning that the line below from my gtf file corresponds to an exons ranging from 1310534 to 1310771. original line from the gtf file: chr1 transcripts_C4 exon 1310534 1310771 78 - . Parent=CUFF.1065.1 However, upon rtracklayer import, the exon ends at 1310770 (see below). Thus, as I understand it, rtracklayer import.gff() interprets gtf as "1-based right-open" (upon export using export.gff3(), it also becomes 1310771 again). I tried importing with explicitly specifying version="3" and also updated to the latest rtracklayer version, but neither helped. Is this a bug in the rtracklayer function or am I interpreting the gff coordinates wrongly? Any comments will be appreciated. Thanks for your help. Best regards, Kathi > library(rtracklayer) Loading required package: RCurl Loading required package: bitops > gff=import.gff("/nfs/research2/luscombe/kathi/data/expression_data/hnR NPC_mRNAseq/cufflinks_0.9.3/cufflinks_C4/transcripts_C4.gtf", + genome="hg19",asRangedData=FALSE) > gff[177] GRanges with 1 range and 11 elementMetadata values seqnames ranges strand | type source phase <rle> <iranges> <rle> | <character> <character> <character> [1] chr1 [1310534, 1310770] - | exon Cufflinks_C4 NA conf_hi conf_lo cov FPKM frac ID Parent <numeric> <numeric> <numeric> <numeric> <numeric> <character> <character> [1] NA NA NA NA NA NA CUFF.1065.1 score <numeric> [1] 78 seqlengths chr1 chr10 chr11 chr12 chr13 chr14 ... chr7 chr8 chr9 chrM chrX chrY NA NA NA NA NA NA ... NA NA NA NA NA NA > export.gff3(gff[177],"test_export.gtf") [zarnack at ebi-001 ~]$ more test_export.gtf ##gff-version 3 ##date 2011-04-14 chr1 Cufflinks_C4 exon 1310534 1310771 78 - NA Parent=CUFF.1065.1 > sessionInfo() R version 2.12.0 (2010-10-15) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8 [5] LC_MONETARY=C LC_MESSAGES=en_GB.UTF-8 [7] LC_PAPER=en_GB.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] rtracklayer_1.10.6 RCurl_1.5-0 bitops_1.0-4.1 loaded via a namespace (and not attached): [1] Biobase_2.10.0 Biostrings_2.18.0 BSgenome_1.18.1 [4] GenomicRanges_1.2.1 IRanges_1.8.9 tools_2.12.0 [7] XML_3.2-0 -- Dr. Kathi Zarnack Luscombe Group European Bioinformatics Institute Wellcome Trust Genome Campus Hinxton, Cambridge CB10 1SD, UK tel +44 1223 494 526
rtracklayer rtracklayer • 7.2k views
ADD COMMENT
0
Entering edit mode
@michael-lawrence-3846
Last seen 2.4 years ago
United States
rtracklayer currently considers GFF3 files to be right-open. The GFF3 spec states that start is always <= end, and that zero-width intervals have start == end. To me, this suggests that they are right-open. Otherwise, you need some other way to distinguish zero vs. one width intervals, which is crazy. I can see the point though that for most use-cases zero and non-zero width intervals are not mixed, and the user usually knows which is which. But that's pretty poor design. Originally, rtracklayer used closed intervals for GFF3, but I changed that a couple of years ago after seeing the genomeIntervals documentation and re-reading the spec. Perhaps we can assume that since previous versions of GFF are clearly specified to have closed intervals, that GFF3 follows the same convention, by default. I'll make that change. Users will need to modify the imported data structure if they want to consider zero-width intervals. I've personally never used GFF, so this is all pretty vague to me. Michael On Thu, Apr 14, 2011 at 5:16 AM, Kathi Zarnack <zarnack@ebi.ac.uk> wrote: > Hi, > > I am using the package rtracklayer to import transcript.gtf files produced > by Cufflinks. > > As I understand the gff3 specification, feature coordinates are given as > "start and end of the feature, in 1-based integer coordinates" (also > discussed in this mailing list lately), meaning that the line below from my > gtf file corresponds to an exons ranging from 1310534 to 1310771. > > original line from the gtf file: > chr1 transcripts_C4 exon 1310534 1310771 78 - . > Parent=CUFF.1065.1 > > However, upon rtracklayer import, the exon ends at 1310770 (see below). > Thus, as I understand it, rtracklayer import.gff() interprets gtf as > "1-based right-open" (upon export using export.gff3(), it also becomes > 1310771 again). I tried importing with explicitly specifying version="3" and > also updated to the latest rtracklayer version, but neither helped. Is this > a bug in the rtracklayer function or am I interpreting the gff coordinates > wrongly? Any comments will be appreciated. > > Thanks for your help. > > Best regards, > Kathi > > > > library(rtracklayer) > Loading required package: RCurl > Loading required package: bitops > > > > gff=import.gff("/nfs/research2/luscombe/kathi/data/expression_data/h nRNPC_mRNAseq/cufflinks_0.9.3/cufflinks_C4/transcripts_C4.gtf", > + genome="hg19",asRangedData=FALSE) > > > gff[177] > GRanges with 1 range and 11 elementMetadata values > seqnames ranges strand | type source phase > <rle> <iranges> <rle> | <character> <character> <character> > [1] chr1 [1310534, 1310770] - | exon Cufflinks_C4 > NA > conf_hi conf_lo cov FPKM frac ID Parent > <numeric> <numeric> <numeric> <numeric> <numeric> <character> <character> > [1] NA NA NA NA NA NA > CUFF.1065.1 > score > <numeric> > [1] 78 > > seqlengths > chr1 chr10 chr11 chr12 chr13 chr14 ... chr7 chr8 chr9 chrM chrX chrY > NA NA NA NA NA NA ... NA NA NA NA NA NA > > > export.gff3(gff[177],"test_export.gtf") > > > [zarnack@ebi-001 ~]$ more test_export.gtf > ##gff-version 3 > ##date 2011-04-14 > chr1 Cufflinks_C4 exon 1310534 1310771 78 - NA > Parent=CUFF.1065.1 > > > > sessionInfo() > R version 2.12.0 (2010-10-15) > Platform: x86_64-unknown-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C [3] > LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8 [5] LC_MONETARY=C > LC_MESSAGES=en_GB.UTF-8 [7] LC_PAPER=en_GB.UTF-8 LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C [11] > LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C > attached base packages: > [1] stats graphics grDevices utils datasets methods base > other attached packages: > [1] rtracklayer_1.10.6 RCurl_1.5-0 bitops_1.0-4.1 > loaded via a namespace (and not attached): > [1] Biobase_2.10.0 Biostrings_2.18.0 BSgenome_1.18.1 [4] > GenomicRanges_1.2.1 IRanges_1.8.9 tools_2.12.0 [7] XML_3.2-0 > > > -- > Dr. Kathi Zarnack > Luscombe Group > European Bioinformatics Institute > Wellcome Trust Genome Campus > Hinxton, Cambridge > CB10 1SD, UK > tel +44 1223 494 526 > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]
ADD COMMENT
0
Entering edit mode
> rtracklayer currently considers GFF3 files to be right-open. > The GFF3 spec > states that start is always <= end, and that zero-width > intervals have start > == end. yes but 1 width intervals also have start = end > To me, this suggests that they are right-open. > Otherwise, you need > some other way to distinguish zero vs. one width intervals, > which is crazy. yes - it is crazy
ADD REPLY
0
Entering edit mode
On 04/15/2011 12:14 AM, Cook, Malcolm wrote: > >> rtracklayer currently considers GFF3 files to be right-open. >> The GFF3 spec >> states that start is always<= end, and that zero-width >> intervals have start >> == end. > > yes but 1 width intervals also have start = end > >> To me, this suggests that they are right-open. >> Otherwise, you need >> some other way to distinguish zero vs. one width intervals, >> which is crazy. > > yes - it is crazy it might be 'crazy'....but it has been always like this: GFF (and its extensions like gtf or gff3 ) are "end inclusive" (or right closed), see: http://www.sanger.ac.uk/resources/software/gff/spec.html http://genome.ucsc.edu/FAQ/FAQformat.html#format3 http://genome.ucsc.edu/FAQ/FAQformat.html#format4 and http://www.sequenceontology.org/gff3.shtml and the latest GFF3 definition explains very well how to treat :zero-length features: "For zero-length features, such as insertion sites, start equals end and the implied site is to the right of the indicated base in the direction of the landmark." yes, as a consequence, you have to pay attention to the 'value' of the third column to figure out whether this could be a zero-length feature. But in practice, this has always been obvious to me. Also, I hardly work with GFF/GTF/GFF3 files which have different kind of features, I usually split by the third column an then treat each feature according to its meaning. My two cents.... Regards, Hans > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
ADD REPLY

Login before adding your answer.

Traffic: 670 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6