Entering edit mode
Kathi Zarnack
▴
110
@kathi-zarnack-4596
Last seen 10.2 years ago
Hi Michael and others,
I have one more comment about import.gff() from rtracklayer: I noticed
that the function behaves differently depending on whether the gtf
file
contains a header section. In the examples below, test1.gtf contains
the
header (as returned by Cufflinks), while test2.gtf doesn't. In the
first
case, the end positions are interpreted exclusive, and inclusive in
the
second (in addition, there are several warnings about the import of
the
header as such). Is this because they are interpreted as different
versions of gff? According to the documentations that I could find,
gff1, gff2 and gff3 should all use inclusive coordinates. Is there a
way
to allow for the header?
Thanks for your help,
Kathi
> test1=import.gff("test1.gtf")
Warning messages:
1: In matrix(unlist(strsplit(attrs, "=", fixed = TRUE)), nrow = 2) :
data length [9] is not a sub-multiple or multiple of the number of
rows [2]
2: In .Method(..., deparse.level = deparse.level) :
number of rows of result is not a multiple of vector length (arg 2)
3: In split.default(seq_len(nrow(x)), f, drop = drop, ...) :
data length is not a multiple of split variable
>
> test1
RangedData with 3 rows and 7 value columns across 1 space
space ranges | type source
phase
<factor> <iranges> | <factor> <factor> <factor>
1 chr1 [564418, 570349] | aggregate_gene aggregate_genes.py
NA
2 chr1 [564418, 565717] | exonic_section aggregate_genes.py
NA
3 chr1 [568957, 568965] | exonic_section aggregate_genes.py
NA
strand aggregate_gene_id..CUFF.168931.
transcripts..CUFF.168931.1.
<factor> <character> <character>
1 - exonic_section_number "001" NA
2 - aggregate_gene_id "CUFF.168931" is_full_exon "True"
3 - exonic_section_number "002" is_full_exon "True"
transcripts..CUFF.168931.2.
<character>
1 NA
2 is_full_exon "True"
3 is_full_exon "True"
>
> test2=import.gff("test2.gtf")
> test2
RangedData with 3 rows and 5 value columns across 1 space
space ranges | type source
phase
<factor> <iranges> | <factor> <factor> <factor>
1 chr1 [564418, 570350] | aggregate_gene aggregate_genes.py
NA
2 chr1 [564418, 565718] | exonic_section aggregate_genes.py
NA
3 chr1 [568957, 568966] | exonic_section aggregate_genes.py
NA
strand
<factor>
1 -
2 -
3 -
group
<character>
1
aggregate_gene_id "CUFF.168931"
2 exonic_section_number "001"; transcripts "CUFF.168931.1";
is_full_exon
"True"; aggregate_gene_id "CUFF.168931"
3 exonic_section_number "002"; transcripts "CUFF.168931.2";
is_full_exon
"True"; aggregate_gene_id "CUFF.168931"
>
> sessionInfo()
R version 2.13.0 (2011-04-13)
Platform: x86_64-unknown-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8
[5] LC_MONETARY=C LC_MESSAGES=en_GB.UTF-8
[7] LC_PAPER=en_GB.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] GenomicRanges_1.4.3 IRanges_1.10.0 rtracklayer_1.12.0
[4] RCurl_1.5-0 bitops_1.0-4.1
loaded via a namespace (and not attached):
[1] Biostrings_2.20.0 BSgenome_1.20.0 tools_2.13.0 XML_3.4-0
----------------------------------------------------------------------
--------------------------------------------------------
::::::::::::::
test1.gtf
::::::::::::::
##gff-version 3
##date 2011-05-17
chr1 aggregate_genes.py aggregate_gene 564418 570350 .
- . aggregate_gene_id "CUFF.168931"
chr1 aggregate_genes.py exonic_section 564418 565718 .
- . exonic_section_number "001"; transcripts "CUFF.168931.1";
is_full_exon "True"; aggreg
ate_gene_id "CUFF.168931"
chr1 aggregate_genes.py exonic_section 568957 568966 .
- . exonic_section_number "002"; transcripts "CUFF.168931.2";
is_full_exon "True"; aggreg
ate_gene_id "CUFF.168931"
::::::::::::::
test2.gtf
::::::::::::::
chr1 aggregate_genes.py aggregate_gene 564418 570350 .
- . aggregate_gene_id "CUFF.168931"
chr1 aggregate_genes.py exonic_section 564418 565718 .
- . exonic_section_number "001"; transcripts "CUFF.168931.1";
is_full_exon "True"; aggreg
ate_gene_id "CUFF.168931"
chr1 aggregate_genes.py exonic_section 568957 568966 .
- . exonic_section_number "002"; transcripts "CUFF.168931.2";
is_full_exon "True"; aggreg
ate_gene_id "CUFF.168931"
On 14/04/11 13:59, Michael Lawrence wrote:
> rtracklayer currently considers GFF3 files to be right-open. The
GFF3
> spec states that start is always <= end, and that zero-width
intervals
> have start == end. To me, this suggests that they are right-open.
> Otherwise, you need some other way to distinguish zero vs. one width
> intervals, which is crazy.
>
> I can see the point though that for most use-cases zero and non-zero
> width intervals are not mixed, and the user usually knows which is
> which. But that's pretty poor design.
>
> Originally, rtracklayer used closed intervals for GFF3, but I
changed
> that a couple of years ago after seeing the genomeIntervals
> documentation and re-reading the spec.
>
> Perhaps we can assume that since previous versions of GFF are
clearly
> specified to have closed intervals, that GFF3 follows the same
> convention, by default.
>
> I'll make that change. Users will need to modify the imported data
> structure if they want to consider zero-width intervals.
>
> I've personally never used GFF, so this is all pretty vague to me.
>
> Michael
>
> On Thu, Apr 14, 2011 at 5:16 AM, Kathi Zarnack <zarnack@ebi.ac.uk> <mailto:zarnack@ebi.ac.uk>> wrote:
>
> Hi,
>
> I am using the package rtracklayer to import transcript.gtf
files
> produced by Cufflinks.
>
> As I understand the gff3 specification, feature coordinates are
> given as "start and end of the feature, in 1-based integer
> coordinates" (also discussed in this mailing list lately),
meaning
> that the line below from my gtf file corresponds to an exons
> ranging from 1310534 to 1310771.
>
> original line from the gtf file:
> chr1 transcripts_C4 exon 1310534 1310771 78 -
> . Parent=CUFF.1065.1
>
> However, upon rtracklayer import, the exon ends at 1310770 (see
> below). Thus, as I understand it, rtracklayer import.gff()
> interprets gtf as "1-based right-open" (upon export using
> export.gff3(), it also becomes 1310771 again). I tried importing
> with explicitly specifying version="3" and also updated to the
> latest rtracklayer version, but neither helped. Is this a bug in
> the rtracklayer function or am I interpreting the gff
coordinates
> wrongly? Any comments will be appreciated.
>
> Thanks for your help.
>
> Best regards,
> Kathi
>
>
> > library(rtracklayer)
> Loading required package: RCurl
> Loading required package: bitops
>
> >
> gff=import.gff("/nfs/research2/luscombe/kathi/data/expression_da
ta/hnRNPC_mRNAseq/cufflinks_0.9.3/cufflinks_C4/transcripts_C4.gtf",
> + genome="hg19",asRangedData=FALSE)
>
> > gff[177]
> GRanges with 1 range and 11 elementMetadata values
> seqnames ranges strand | type source
> phase
> <rle> <iranges> <rle> | <character> <character> <character>
> [1] chr1 [1310534, 1310770] - | exon
Cufflinks_C4
> NA
> conf_hi conf_lo cov FPKM frac ID
> Parent
> <numeric> <numeric> <numeric> <numeric> <numeric> <character>
> <character>
> [1] NA NA NA NA NA
NA
> CUFF.1065.1
> score
> <numeric>
> [1] 78
>
> seqlengths
> chr1 chr10 chr11 chr12 chr13 chr14 ... chr7 chr8 chr9 chrM
> chrX chrY
> NA NA NA NA NA NA ... NA NA NA NA
> NA NA
>
> > export.gff3(gff[177],"test_export.gtf")
>
>
> [zarnack@ebi-001 ~]$ more test_export.gtf
> ##gff-version 3
> ##date 2011-04-14
> chr1 Cufflinks_C4 exon 1310534 1310771 78 -
> NA Parent=CUFF.1065.1
>
>
> > sessionInfo()
> R version 2.12.0 (2010-10-15)
> Platform: x86_64-unknown-linux-gnu (64-bit)
>
> locale:
> [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C [3]
> LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8 [5]
> LC_MONETARY=C LC_MESSAGES=en_GB.UTF-8 [7]
> LC_PAPER=en_GB.UTF-8 LC_NAME=C [9]
> LC_ADDRESS=C LC_TELEPHONE=C [11]
> LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
> attached base packages:
> [1] stats graphics grDevices utils datasets methods
base
> other attached packages:
> [1] rtracklayer_1.10.6 RCurl_1.5-0 bitops_1.0-4.1
> loaded via a namespace (and not attached):
> [1] Biobase_2.10.0 Biostrings_2.18.0 BSgenome_1.18.1
[4]
> GenomicRanges_1.2.1 IRanges_1.8.9 tools_2.12.0 [7]
> XML_3.2-0
>
>
> --
> Dr. Kathi Zarnack
> Luscombe Group
> European Bioinformatics Institute
> Wellcome Trust Genome Campus
> Hinxton, Cambridge
> CB10 1SD, UK
> tel +44 1223 494 526
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor@r-project.org <mailto:bioconductor@r-project.org>
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
>
http://news.gmane.org/gmane.science.biology.informatics.conductor
>
>
--
Dr. Kathi Zarnack
Luscombe Group
European Bioinformatics Institute
Wellcome Trust Genome Campus
Hinxton, Cambridge
CB10 1SD, UK
tel +44 1223 494 526
[[alternative HTML version deleted]]