gff files: how to tell if right-open interval convention used?
1
0
Entering edit mode
@karlerhardberkeleyedu-4569
Last seen 10.2 years ago
Thanks for the answers, very helpful. I was a bit confused reading the help page for the readGff3 function, which states: "When the GFF file follows the right-open interval convention (isRightOpen is TRUE), then GFF entries for which end base equals first base are recognized as zero-length features and loaded as inter_base intervals." This suggests that there are gff files out there that have this right- open interval convention. I will just have to find out the source of the particular gff file I'm using (which describes a filtered gene list for maize) to determine whether it uses this convention or not. Though based on your comments, I think I should assume that it is 1-based closed. > karl, > > GFF should always be 1-based closed, not 0-based right-open (unlike BED > format). I think this convention goes back to the original version of GFF > from Sanger up to the latest version, GFF3. > > So, it probably comes down to whether the source of the GFF output you are > using is generating the correct coordinates, not how R/BioC is processing > it. Unless the girafe method in question is allowing BED output to be > read as well (I would consider that bad). > > chris > > On Mar 30, 2011, at 3:58 PM, karlerhard at berkeley.edu wrote: > >> >> Hi all, >> >> I'm a grad student at UC Berkeley, I'm new to the list, as well as to R >> programs in general, so I hope you'll forgive my simplistic questions. >> >> I'm working with the girafe package to generate counts table which can >> be >> input into edgeR. I've noticed that the readGff3 function is sensitive >> to >> whether the gff file being read uses this "right-open interval >> convention" >> or not. I'm just not sure how to tell if the gff file I am using >> follows >> this convention. Is there a simple way to find out? >> >> Any help on this would be greatly appreciated. >> >> best, >> >> karl >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor > >
edgeR girafe edgeR girafe • 1.7k views
ADD COMMENT
0
Entering edit mode
@herve-pages-1542
Last seen 21 hours ago
Seattle, WA, United States
karl, On 03/31/2011 03:42 PM, karlerhard at berkeley.edu wrote: > > Thanks for the answers, very helpful. I was a bit confused reading the > help page for the readGff3 function, which states: > > "When the GFF file follows the right-open interval convention (isRightOpen > is TRUE), then GFF entries for which end base equals first base are > recognized as zero-length features and loaded as inter_base intervals." Also in the same man page: Usage: readGff3(file, isRightOpen=TRUE) Arguments: file: The name of the gff file to read. isRightOpen: Although a proper GFF3 file follows the convention of right-open intervals, improper GFF files following the right-closed convention are frequently found. Set ?isRightOpen = FALSE? in this case. But this looks incorrect to me. My understanding of the GFF3 specs at http://www.sequenceontology.org is different: Columns 4 & 5: "start" and "end" The start and end of the feature, in 1-based integer coordinates, relative to the landmark given in column 1. Start is always less than or equal to end. <-- omitting some stuff --> For zero-length features, such as insertion sites, start equals end and the implied site is to the right of the indicated base in the direction of the landmark. Nothing is said about using a right-open interval convention. Zero-length features are treated specially but still this is not using a right-open interval convention for them (the way they are represented could maybe be interpreted as a right-open interval if the implied site was to the *left* of the indicated base but it's not the case). > > This suggests that there are gff files out there that have this right-open > interval convention. I will just have to find out the source of the > particular gff file I'm using (which describes a filtered gene list for > maize) to determine whether it uses this convention or not. Though based > on your comments, I think I should assume that it is 1-based closed. Yes, 1-based closed on both sides. Cheers, H. > > > >> karl, >> >> GFF should always be 1-based closed, not 0-based right-open (unlike BED >> format). I think this convention goes back to the original version of GFF >> from Sanger up to the latest version, GFF3. >> >> So, it probably comes down to whether the source of the GFF output you are >> using is generating the correct coordinates, not how R/BioC is processing >> it. Unless the girafe method in question is allowing BED output to be >> read as well (I would consider that bad). >> >> chris >> >> On Mar 30, 2011, at 3:58 PM, karlerhard at berkeley.edu wrote: >> >>> >>> Hi all, >>> >>> I'm a grad student at UC Berkeley, I'm new to the list, as well as to R >>> programs in general, so I hope you'll forgive my simplistic questions. >>> >>> I'm working with the girafe package to generate counts table which can >>> be >>> input into edgeR. I've noticed that the readGff3 function is sensitive >>> to >>> whether the gff file being read uses this "right-open interval >>> convention" >>> or not. I'm just not sure how to tell if the gff file I am using >>> follows >>> this convention. Is there a simple way to find out? >>> >>> Any help on this would be greatly appreciated. >>> >>> best, >>> >>> karl >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M2-B876 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
ADD COMMENT
0
Entering edit mode
Hi I guess this all depends on which GFF specification you consider authoritative. Different institution have published slightly contradictory specs. According to the Sanger web site, the original GFF spec is due to Durbin and Haussler, and the current version (according to Sanger) is GFF2, which, at http://www.sanger.ac.uk/resources/software/gff/spec.html specifies: "<start>, <end>: Integers. <start> must be less than or equal to <end>. Sequence numbering starts at 1, so these numbers should be between 1 and the length of the relevant sequence, inclusive. " A feature is typically a stretch of consecutive base pairs which make up something, say, an exon, so I guess they did not have zero-length features in mind when writing this. So, if start is less than or equal to end, a single-base-pair feature would be denoted with start=end. Furthermore, it should be possible to indicate the total chromosome as a feature, and then, we can fulfill the second sentence of the quote above only by using closed intervals. The UCSC Genome Browser team also interpret it that they. At http://genome.ucsc.edu/FAQ/FAQformat.html#format3 they write "[column] 5. end - The ending position of the feature (inclusive)." and I guess, "inclusive" means closed. Only later, somebody came up with the idea of zero-length (a.k.a. inter-base) features such as positions of deletions, and suggested GFF3. Unfortunately, zero-length features can only be represented by a half-open convention, as otherwise, we cannot distinguish whether a feature with start equal to end means a single base pair or an inter-base position. Hence, to me, this GFF3 proposal at http://www.sequenceontology.org/resources/gff3.html, which Herv? quoted, seems to be ill-defined. It implies that it is half-open, without saying so explicitly, and so breaks backwards compatibility. <rant> Every half-way competent computer scientist knows that specifying _clearly_ whether the end is included is the very first thing one does when drafting a spec for anything involving intervals, because there is ample examples for specs using either choice. It baffles me that most genomics file format specs are so unclear in such things. Is our field really that unprofessional? </rant> Simon
ADD REPLY

Login before adding your answer.

Traffic: 444 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6