rtracklayer gff import
1
0
Entering edit mode
Kathi Zarnack ▴ 110
@kathi-zarnack-4596
Last seen 10.2 years ago
Hi Michael and others, I have one more comment about import.gff() from rtracklayer: I noticed that the function behaves differently depending on whether the gtf file contains a header section. In the examples below, test1.gtf contains the header (as returned by Cufflinks), while test2.gtf doesn't. In the first case, the end positions are interpreted exclusive, and inclusive in the second (in addition, there are several warnings about the import of the header as such). Is this because they are interpreted as different versions of gff? According to the documentations that I could find, gff1, gff2 and gff3 should all use inclusive coordinates. Is there a way to allow for the header? Thanks for your help, Kathi > test1=import.gff("test1.gtf") Warning messages: 1: In matrix(unlist(strsplit(attrs, "=", fixed = TRUE)), nrow = 2) : data length [9] is not a sub-multiple or multiple of the number of rows [2] 2: In .Method(..., deparse.level = deparse.level) : number of rows of result is not a multiple of vector length (arg 2) 3: In split.default(seq_len(nrow(x)), f, drop = drop, ...) : data length is not a multiple of split variable > > test1 RangedData with 3 rows and 7 value columns across 1 space space ranges | type source phase <factor> <iranges> | <factor> <factor> <factor> 1 chr1 [564418, 570349] | aggregate_gene aggregate_genes.py NA 2 chr1 [564418, 565717] | exonic_section aggregate_genes.py NA 3 chr1 [568957, 568965] | exonic_section aggregate_genes.py NA strand aggregate_gene_id..CUFF.168931. transcripts..CUFF.168931.1. <factor> <character> <character> 1 - exonic_section_number "001" NA 2 - aggregate_gene_id "CUFF.168931" is_full_exon "True" 3 - exonic_section_number "002" is_full_exon "True" transcripts..CUFF.168931.2. <character> 1 NA 2 is_full_exon "True" 3 is_full_exon "True" > > test2=import.gff("test2.gtf") > test2 RangedData with 3 rows and 5 value columns across 1 space space ranges | type source phase <factor> <iranges> | <factor> <factor> <factor> 1 chr1 [564418, 570350] | aggregate_gene aggregate_genes.py NA 2 chr1 [564418, 565718] | exonic_section aggregate_genes.py NA 3 chr1 [568957, 568966] | exonic_section aggregate_genes.py NA strand <factor> 1 - 2 - 3 - group <character> 1 aggregate_gene_id "CUFF.168931" 2 exonic_section_number "001"; transcripts "CUFF.168931.1"; is_full_exon "True"; aggregate_gene_id "CUFF.168931" 3 exonic_section_number "002"; transcripts "CUFF.168931.2"; is_full_exon "True"; aggregate_gene_id "CUFF.168931" > > sessionInfo() R version 2.13.0 (2011-04-13) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8 [5] LC_MONETARY=C LC_MESSAGES=en_GB.UTF-8 [7] LC_PAPER=en_GB.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] GenomicRanges_1.4.3 IRanges_1.10.0 rtracklayer_1.12.0 [4] RCurl_1.5-0 bitops_1.0-4.1 loaded via a namespace (and not attached): [1] Biostrings_2.20.0 BSgenome_1.20.0 tools_2.13.0 XML_3.4-0 ---------------------------------------------------------------------- -------------------------------------------------------- :::::::::::::: test1.gtf :::::::::::::: ##gff-version 3 ##date 2011-05-17 chr1 aggregate_genes.py aggregate_gene 564418 570350 . - . aggregate_gene_id "CUFF.168931" chr1 aggregate_genes.py exonic_section 564418 565718 . - . exonic_section_number "001"; transcripts "CUFF.168931.1"; is_full_exon "True"; aggreg ate_gene_id "CUFF.168931" chr1 aggregate_genes.py exonic_section 568957 568966 . - . exonic_section_number "002"; transcripts "CUFF.168931.2"; is_full_exon "True"; aggreg ate_gene_id "CUFF.168931" :::::::::::::: test2.gtf :::::::::::::: chr1 aggregate_genes.py aggregate_gene 564418 570350 . - . aggregate_gene_id "CUFF.168931" chr1 aggregate_genes.py exonic_section 564418 565718 . - . exonic_section_number "001"; transcripts "CUFF.168931.1"; is_full_exon "True"; aggreg ate_gene_id "CUFF.168931" chr1 aggregate_genes.py exonic_section 568957 568966 . - . exonic_section_number "002"; transcripts "CUFF.168931.2"; is_full_exon "True"; aggreg ate_gene_id "CUFF.168931" On 14/04/11 13:59, Michael Lawrence wrote: > rtracklayer currently considers GFF3 files to be right-open. The GFF3 > spec states that start is always <= end, and that zero-width intervals > have start == end. To me, this suggests that they are right-open. > Otherwise, you need some other way to distinguish zero vs. one width > intervals, which is crazy. > > I can see the point though that for most use-cases zero and non-zero > width intervals are not mixed, and the user usually knows which is > which. But that's pretty poor design. > > Originally, rtracklayer used closed intervals for GFF3, but I changed > that a couple of years ago after seeing the genomeIntervals > documentation and re-reading the spec. > > Perhaps we can assume that since previous versions of GFF are clearly > specified to have closed intervals, that GFF3 follows the same > convention, by default. > > I'll make that change. Users will need to modify the imported data > structure if they want to consider zero-width intervals. > > I've personally never used GFF, so this is all pretty vague to me. > > Michael > > On Thu, Apr 14, 2011 at 5:16 AM, Kathi Zarnack <zarnack@ebi.ac.uk> <mailto:zarnack@ebi.ac.uk>> wrote: > > Hi, > > I am using the package rtracklayer to import transcript.gtf files > produced by Cufflinks. > > As I understand the gff3 specification, feature coordinates are > given as "start and end of the feature, in 1-based integer > coordinates" (also discussed in this mailing list lately), meaning > that the line below from my gtf file corresponds to an exons > ranging from 1310534 to 1310771. > > original line from the gtf file: > chr1 transcripts_C4 exon 1310534 1310771 78 - > . Parent=CUFF.1065.1 > > However, upon rtracklayer import, the exon ends at 1310770 (see > below). Thus, as I understand it, rtracklayer import.gff() > interprets gtf as "1-based right-open" (upon export using > export.gff3(), it also becomes 1310771 again). I tried importing > with explicitly specifying version="3" and also updated to the > latest rtracklayer version, but neither helped. Is this a bug in > the rtracklayer function or am I interpreting the gff coordinates > wrongly? Any comments will be appreciated. > > Thanks for your help. > > Best regards, > Kathi > > > > library(rtracklayer) > Loading required package: RCurl > Loading required package: bitops > > > > gff=import.gff("/nfs/research2/luscombe/kathi/data/expression_da ta/hnRNPC_mRNAseq/cufflinks_0.9.3/cufflinks_C4/transcripts_C4.gtf", > + genome="hg19",asRangedData=FALSE) > > > gff[177] > GRanges with 1 range and 11 elementMetadata values > seqnames ranges strand | type source > phase > <rle> <iranges> <rle> | <character> <character> <character> > [1] chr1 [1310534, 1310770] - | exon Cufflinks_C4 > NA > conf_hi conf_lo cov FPKM frac ID > Parent > <numeric> <numeric> <numeric> <numeric> <numeric> <character> > <character> > [1] NA NA NA NA NA NA > CUFF.1065.1 > score > <numeric> > [1] 78 > > seqlengths > chr1 chr10 chr11 chr12 chr13 chr14 ... chr7 chr8 chr9 chrM > chrX chrY > NA NA NA NA NA NA ... NA NA NA NA > NA NA > > > export.gff3(gff[177],"test_export.gtf") > > > [zarnack@ebi-001 ~]$ more test_export.gtf > ##gff-version 3 > ##date 2011-04-14 > chr1 Cufflinks_C4 exon 1310534 1310771 78 - > NA Parent=CUFF.1065.1 > > > > sessionInfo() > R version 2.12.0 (2010-10-15) > Platform: x86_64-unknown-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C [3] > LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8 [5] > LC_MONETARY=C LC_MESSAGES=en_GB.UTF-8 [7] > LC_PAPER=en_GB.UTF-8 LC_NAME=C [9] > LC_ADDRESS=C LC_TELEPHONE=C [11] > LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C > attached base packages: > [1] stats graphics grDevices utils datasets methods base > other attached packages: > [1] rtracklayer_1.10.6 RCurl_1.5-0 bitops_1.0-4.1 > loaded via a namespace (and not attached): > [1] Biobase_2.10.0 Biostrings_2.18.0 BSgenome_1.18.1 [4] > GenomicRanges_1.2.1 IRanges_1.8.9 tools_2.12.0 [7] > XML_3.2-0 > > > -- > Dr. Kathi Zarnack > Luscombe Group > European Bioinformatics Institute > Wellcome Trust Genome Campus > Hinxton, Cambridge > CB10 1SD, UK > tel +44 1223 494 526 > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org <mailto:bioconductor@r-project.org> > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > -- Dr. Kathi Zarnack Luscombe Group European Bioinformatics Institute Wellcome Trust Genome Campus Hinxton, Cambridge CB10 1SD, UK tel +44 1223 494 526 [[alternative HTML version deleted]]
rtracklayer rtracklayer • 1.8k views
ADD COMMENT
0
Entering edit mode
@michael-lawrence-3846
Last seen 3.0 years ago
United States
Hi Kathi, GTF is GFF2, not GFF3, so your header is confusing rtracklayer. As for the different ends, the switch to using closed intervals for GFF3 happened after the last release and has not been ported. I can do now that now, but it shouldn't affect your work. Thanks, Michael On Wed, May 25, 2011 at 2:56 AM, Kathi Zarnack <zarnack@ebi.ac.uk> wrote: > Hi Michael and others, > > I have one more comment about import.gff() from rtracklayer: I noticed that > the function behaves differently depending on whether the gtf file contains > a header section. In the examples below, test1.gtf contains the header (as > returned by Cufflinks), while test2.gtf doesn't. In the first case, the end > positions are interpreted exclusive, and inclusive in the second (in > addition, there are several warnings about the import of the header as > such). Is this because they are interpreted as different versions of gff? > According to the documentations that I could find, gff1, gff2 and gff3 > should all use inclusive coordinates. Is there a way to allow for the > header? > > Thanks for your help, > Kathi > > > > test1=import.gff("test1.gtf") > Warning messages: > 1: In matrix(unlist(strsplit(attrs, "=", fixed = TRUE)), nrow = 2) : > data length [9] is not a sub-multiple or multiple of the number of rows > [2] > 2: In .Method(..., deparse.level = deparse.level) : > number of rows of result is not a multiple of vector length (arg 2) > 3: In split.default(seq_len(nrow(x)), f, drop = drop, ...) : > data length is not a multiple of split variable > > > > test1 > RangedData with 3 rows and 7 value columns across 1 space > space ranges | type source phase > <factor> <iranges> | <factor> <factor> <factor> > 1 chr1 [564418, 570349] | aggregate_gene aggregate_genes.py NA > 2 chr1 [564418, 565717] | exonic_section aggregate_genes.py NA > 3 chr1 [568957, 568965] | exonic_section aggregate_genes.py NA > strand aggregate_gene_id..CUFF.168931. transcripts..CUFF.168931.1. > <factor> <character> <character> > 1 - exonic_section_number "001" NA > 2 - aggregate_gene_id "CUFF.168931" is_full_exon "True" > 3 - exonic_section_number "002" is_full_exon "True" > transcripts..CUFF.168931.2. > <character> > 1 NA > 2 is_full_exon "True" > 3 is_full_exon "True" > > > > test2=import.gff("test2.gtf") > > test2 > RangedData with 3 rows and 5 value columns across 1 space > space ranges | type source phase > <factor> <iranges> | <factor> <factor> <factor> > 1 chr1 [564418, 570350] | aggregate_gene aggregate_genes.py NA > 2 chr1 [564418, 565718] | exonic_section aggregate_genes.py NA > 3 chr1 [568957, 568966] | exonic_section aggregate_genes.py NA > strand > <factor> > 1 - > 2 - > 3 - > > group > > <character> > 1 > aggregate_gene_id "CUFF.168931" > 2 exonic_section_number "001"; transcripts "CUFF.168931.1"; is_full_exon > "True"; aggregate_gene_id "CUFF.168931" > 3 exonic_section_number "002"; transcripts "CUFF.168931.2"; is_full_exon > "True"; aggregate_gene_id "CUFF.168931" > > > > sessionInfo() > R version 2.13.0 (2011-04-13) > > Platform: x86_64-unknown-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8 > [5] LC_MONETARY=C LC_MESSAGES=en_GB.UTF-8 > [7] LC_PAPER=en_GB.UTF-8 LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] GenomicRanges_1.4.3 IRanges_1.10.0 rtracklayer_1.12.0 > [4] RCurl_1.5-0 bitops_1.0-4.1 > > loaded via a namespace (and not attached): > [1] Biostrings_2.20.0 BSgenome_1.20.0 tools_2.13.0 XML_3.4-0 > > > -------------------------------------------------------------------- ---------------------------------------------------------- > :::::::::::::: > test1.gtf > :::::::::::::: > ##gff-version 3 > ##date 2011-05-17 > chr1 aggregate_genes.py aggregate_gene 564418 570350 . > - . aggregate_gene_id "CUFF.168931" > chr1 aggregate_genes.py exonic_section 564418 565718 . > - . exonic_section_number "001"; transcripts "CUFF.168931.1"; > is_full_exon "True"; aggreg > ate_gene_id "CUFF.168931" > chr1 aggregate_genes.py exonic_section 568957 568966 . > - . exonic_section_number "002"; transcripts "CUFF.168931.2"; > is_full_exon "True"; aggreg > ate_gene_id "CUFF.168931" > :::::::::::::: > test2.gtf > :::::::::::::: > chr1 aggregate_genes.py aggregate_gene 564418 570350 . > - . aggregate_gene_id "CUFF.168931" > chr1 aggregate_genes.py exonic_section 564418 565718 . > - . exonic_section_number "001"; transcripts "CUFF.168931.1"; > is_full_exon "True"; aggreg > ate_gene_id "CUFF.168931" > chr1 aggregate_genes.py exonic_section 568957 568966 . > - . exonic_section_number "002"; transcripts "CUFF.168931.2"; > is_full_exon "True"; aggreg > ate_gene_id "CUFF.168931" > > > > > > On 14/04/11 13:59, Michael Lawrence wrote: > > rtracklayer currently considers GFF3 files to be right-open. The GFF3 spec > states that start is always <= end, and that zero-width intervals have start > == end. To me, this suggests that they are right-open. Otherwise, you need > some other way to distinguish zero vs. one width intervals, which is crazy. > > I can see the point though that for most use-cases zero and non-zero width > intervals are not mixed, and the user usually knows which is which. But > that's pretty poor design. > > Originally, rtracklayer used closed intervals for GFF3, but I changed that > a couple of years ago after seeing the genomeIntervals documentation and > re-reading the spec. > > Perhaps we can assume that since previous versions of GFF are clearly > specified to have closed intervals, that GFF3 follows the same convention, > by default. > > I'll make that change. Users will need to modify the imported data > structure if they want to consider zero-width intervals. > > I've personally never used GFF, so this is all pretty vague to me. > > Michael > > On Thu, Apr 14, 2011 at 5:16 AM, Kathi Zarnack <zarnack@ebi.ac.uk> wrote: > >> Hi, >> >> I am using the package rtracklayer to import transcript.gtf files produced >> by Cufflinks. >> >> As I understand the gff3 specification, feature coordinates are given as >> "start and end of the feature, in 1-based integer coordinates" (also >> discussed in this mailing list lately), meaning that the line below from my >> gtf file corresponds to an exons ranging from 1310534 to 1310771. >> >> original line from the gtf file: >> chr1 transcripts_C4 exon 1310534 1310771 78 - . >> Parent=CUFF.1065.1 >> >> However, upon rtracklayer import, the exon ends at 1310770 (see below). >> Thus, as I understand it, rtracklayer import.gff() interprets gtf as >> "1-based right-open" (upon export using export.gff3(), it also becomes >> 1310771 again). I tried importing with explicitly specifying version="3" and >> also updated to the latest rtracklayer version, but neither helped. Is this >> a bug in the rtracklayer function or am I interpreting the gff coordinates >> wrongly? Any comments will be appreciated. >> >> Thanks for your help. >> >> Best regards, >> Kathi >> >> >> > library(rtracklayer) >> Loading required package: RCurl >> Loading required package: bitops >> >> > >> gff=import.gff("/nfs/research2/luscombe/kathi/data/expression_data/ hnRNPC_mRNAseq/cufflinks_0.9.3/cufflinks_C4/transcripts_C4.gtf", >> + genome="hg19",asRangedData=FALSE) >> >> > gff[177] >> GRanges with 1 range and 11 elementMetadata values >> seqnames ranges strand | type source >> phase >> <rle> <iranges> <rle> | <character> <character> >> <character> >> [1] chr1 [1310534, 1310770] - | exon Cufflinks_C4 >> NA >> conf_hi conf_lo cov FPKM frac ID >> Parent >> <numeric> <numeric> <numeric> <numeric> <numeric> <character> >> <character> >> [1] NA NA NA NA NA NA >> CUFF.1065.1 >> score >> <numeric> >> [1] 78 >> >> seqlengths >> chr1 chr10 chr11 chr12 chr13 chr14 ... chr7 chr8 chr9 chrM chrX >> chrY >> NA NA NA NA NA NA ... NA NA NA NA NA NA >> >> > export.gff3(gff[177],"test_export.gtf") >> >> >> [zarnack@ebi-001 ~]$ more test_export.gtf >> ##gff-version 3 >> ##date 2011-04-14 >> chr1 Cufflinks_C4 exon 1310534 1310771 78 - NA >> Parent=CUFF.1065.1 >> >> >> > sessionInfo() >> R version 2.12.0 (2010-10-15) >> Platform: x86_64-unknown-linux-gnu (64-bit) >> >> locale: >> [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C [3] >> LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8 [5] LC_MONETARY=C >> LC_MESSAGES=en_GB.UTF-8 [7] LC_PAPER=en_GB.UTF-8 LC_NAME=C >> [9] LC_ADDRESS=C LC_TELEPHONE=C [11] >> LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C >> attached base packages: >> [1] stats graphics grDevices utils datasets methods base >> other attached packages: >> [1] rtracklayer_1.10.6 RCurl_1.5-0 bitops_1.0-4.1 >> loaded via a namespace (and not attached): >> [1] Biobase_2.10.0 Biostrings_2.18.0 BSgenome_1.18.1 [4] >> GenomicRanges_1.2.1 IRanges_1.8.9 tools_2.12.0 [7] XML_3.2-0 >> >> >> -- >> Dr. Kathi Zarnack >> Luscombe Group >> European Bioinformatics Institute >> Wellcome Trust Genome Campus >> Hinxton, Cambridge >> CB10 1SD, UK >> tel +44 1223 494 526 >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor@r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > > -- > Dr. Kathi Zarnack > Luscombe Group > European Bioinformatics Institute > Wellcome Trust Genome Campus > Hinxton, Cambridge > CB10 1SD, UK > tel +44 1223 494 526 > > [[alternative HTML version deleted]]
ADD COMMENT

Login before adding your answer.

Traffic: 710 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6