rtracklayer and UCSC

0

Entering edit mode

Kasper Daniel Hansen ★ 6.5k

@kasper-daniel-hansen-2979

Last seen 2.6 years ago

United States

As far as I know USCS uses zero-based indexing of their genomes, R uses 1-based. What kind of conversion is being used by rtracklayer - I suspect none at all? It might be worthwhile to add a discussion about this somewhere in the vignette? More specifically, I have downloaded a couple of tables from UCSC using rtracklayer and I wanted to know if I need to add 1 to the column named exonStart (after a suitable splitting - it is a comma separated character list). Kasper

rtracklayer genomes rtracklayer genomes • 2.2k views

ADD COMMENT • link updated 16.8 years ago by Michael Lawrence ▴ 620 • written 16.8 years ago by Kasper Daniel Hansen ★ 6.5k

0

Entering edit mode

Michael Lawrence ▴ 620

@michael-lawrence-2759

Last seen 11.5 years ago

On Thu, May 14, 2009 at 4:29 PM, Kasper Daniel Hansen < khansen@stat.berkeley.edu> wrote: > As far as I know USCS uses zero-based indexing of their genomes, R uses > 1-based. What kind of conversion is being used by rtracklayer - I suspect > none at all? The indexing is 1-based. rtracklayer takes care of all of this (0 vs 1 based, closed vs half-open) behind the scenes. I've found places where I've messed up before though, so please let me know if you find inconsistencies. > It might be worthwhile to add a discussion about this somewhere in the > vignette? > Yes, it should be mentioned. > More specifically, I have downloaded a couple of tables from UCSC using > rtracklayer and I wanted to know if I need to add 1 to the column named > exonStart (after a suitable splitting - it is a comma separated character > list). > If you download a table (not an actual RangedData track), then the columns have not been adjusted at all. I suggest you everything 1-based and closed if you want to use it with packages like IRanges and Biostrings. Btw, if you had obtained the data using the track() function, which returns a RangedData, you could call blocks() on it to get the block information as a RangesList. But I just found that I forgot to add 1 in that method; fixed in svn. Thanks, Michael > Kasper > > _______________________________________________ > Bioconductor mailing list > Bioconductor@stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]

ADD COMMENT • link 16.8 years ago Michael Lawrence ▴ 620

0

Entering edit mode

Sean Davis 21k

@sean-davis-490

Last seen 3 days ago

United States

On Thu, May 14, 2009 at 7:29 PM, Kasper Daniel Hansen < khansen@stat.berkeley.edu> wrote: > As far as I know USCS uses zero-based indexing of their genomes, R uses > 1-based. What kind of conversion is being used by rtracklayer - I suspect > none at all? It might be worthwhile to add a discussion about this somewhere > in the vignette? It is even slightly more complicated than that. They use zero-based starts and 1-based ends, except for graphical display: http://genome.ucsc.edu/FAQ/FAQtracks#tracks1 Sean > > > More specifically, I have downloaded a couple of tables from UCSC using > rtracklayer and I wanted to know if I need to add 1 to the column named > exonStart (after a suitable splitting - it is a comma separated character > list). > > Kasper > > _______________________________________________ > Bioconductor mailing list > Bioconductor@stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]

ADD COMMENT • link 16.8 years ago Sean Davis 21k

0

Entering edit mode

My understanding of UCSC co-ordinates is, as Sean says, zero based and one based. However I have stopped using the word "start" and "end" with UCSC co-ordinates. I believe it would be better to use "left" and "right". The UCSC data definitions of their annotation files, see: http://hgdownload.cse.ucsc.edu/goldenPath/hg18/database/refGene.sql use txStart/txEnd, cdsStart/cdsEnd, exonStarts/exonEnds. However these co-ordinates are only start and end co-ordinates for positive strand genes. They are end and start co-ordinates for negative strand genes, assuming that start means the 5 prime end of a gene. I think it is more accurate to say that LEFT end UCSC co-ordinates are zero based and RIGHT end UCSC co-ordinates are one based. However note that whenever UCSC are displaying co-ordinates to GUI users, they adjust left end co-ordinates back to being one based. If I remember correctly, if you use the DNA option in the UCSC browser to get DNA bases, the co-ordinates are all still one based, but as stated, if you download the annotation files, such as refGene.txt, from the above link, the left co-ordinates are zero based. I don't know how rtracklayer handles this issue. cheers, Keith Sean Davis wrote: > On Thu, May 14, 2009 at 7:29 PM, Kasper Daniel Hansen < > khansen at stat.berkeley.edu> wrote: > >> As far as I know USCS uses zero-based indexing of their genomes, R uses >> 1-based. What kind of conversion is being used by rtracklayer - I suspect >> none at all? It might be worthwhile to add a discussion about this somewhere >> in the vignette? > > > It is even slightly more complicated than that. They use zero-based starts > and 1-based ends, except for graphical display: > > http://genome.ucsc.edu/FAQ/FAQtracks#tracks1 > > Sean > > >> >> More specifically, I have downloaded a couple of tables from UCSC using >> rtracklayer and I wanted to know if I need to add 1 to the column named >> exonStart (after a suitable splitting - it is a comma separated character >> list). >> >> Kasper >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD REPLY • link 16.8 years ago Keith Satterley ▴ 450

0

Entering edit mode

On Thu, May 14, 2009 at 5:23 PM, Keith Satterley <keith@wehi.edu.au> wrote: > My understanding of UCSC co-ordinates is, as Sean says, zero based and one > based. However I have stopped using the word "start" and "end" with UCSC > co-ordinates. I believe it would be better to use "left" and "right". > > The UCSC data definitions of their annotation files, see: > > http://hgdownload.cse.ucsc.edu/goldenPath/hg18/database/refGene.sql > > use txStart/txEnd, cdsStart/cdsEnd, exonStarts/exonEnds. However these > co-ordinates are only start and end co-ordinates for positive strand genes. > They are end and start co-ordinates for negative strand genes, assuming that > start means the 5 prime end of a gene. > > I think it is more accurate to say that LEFT end UCSC co-ordinates are zero > based and RIGHT end UCSC co-ordinates are one based. > > However note that whenever UCSC are displaying co-ordinates to GUI users, > they adjust left end co-ordinates back to being one based. If I remember > correctly, if you use the DNA option in the UCSC browser to get DNA bases, > the co-ordinates are all still one based, but as stated, if you download the > annotation files, such as refGene.txt, from the above link, the left > co-ordinates are zero based. > > I don't know how rtracklayer handles this issue. > UCSC coordinates are 0-based half-open intervals relative to the 5' end of the positive strand. rtracklayer makes them 1-based closed intervals, also relative to the 5' end of the positive strand. Placing everything into the same frame of reference makes it easier to perform e.g. overlap queries. If you want to flip things around, see the reflect() function in IRanges. The flank() function is a convenient way to get out e.g. promoter regions taking into account the strand. > cheers, > > Keith > > > Sean Davis wrote: > >> On Thu, May 14, 2009 at 7:29 PM, Kasper Daniel Hansen < >> khansen@stat.berkeley.edu> wrote: >> >> As far as I know USCS uses zero-based indexing of their genomes, R uses >>> 1-based. What kind of conversion is being used by rtracklayer - I suspect >>> none at all? It might be worthwhile to add a discussion about this >>> somewhere >>> in the vignette? >>> >> >> >> It is even slightly more complicated than that. They use zero- based >> starts >> and 1-based ends, except for graphical display: >> >> http://genome.ucsc.edu/FAQ/FAQtracks#tracks1 >> >> Sean >> >> >> >>> More specifically, I have downloaded a couple of tables from UCSC using >>> rtracklayer and I wanted to know if I need to add 1 to the column named >>> exonStart (after a suitable splitting - it is a comma separated character >>> list). >>> >>> Kasper >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor@stat.math.ethz.ch >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >>> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor@stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > _______________________________________________ > Bioconductor mailing list > Bioconductor@stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]

ADD REPLY • link 16.8 years ago Michael Lawrence ▴ 620

Login before adding your answer.