[somehow-OT] Storing/quickly accessing "genome length" data.

0

Entering edit mode

Steve Lianoglou ★ 13k

@steve-lianoglou-2771

Last seen 14 months ago

United States

Hi, I guess a lot of us have this problem: I'm storing "genome long" integer/doubles vectors for each position along each chromosome. I want to quickly access parts of these vectors in a manner quite similar/convenient/efficient to how we can quickly access the reads in a given region of a BAM file. I'm curios what you folks are using to store this type of info? Currently I just have RData objects of Rle's or XIntegers, etc. for each strand of each chromosome. I'll load these data files, query the info over the ranges I want, then junk the (usually large) vector I just loaded. It's not the best, but it works. In the bioinformatics world, I guess these data are best stored as bigWig files, yes? And AFAIK, there's no (convenient or otherwise) way to query bigWigs from within R/Bioc, right? Then I wonder if storing these in hdf/netcdf files isn't actually the way to go ... and if so, why not go whole-hog and work on a bioc interface to the somehow-defined biohdf format? Any thoughts? -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology ?| Memorial Sloan-Kettering Cancer Center ?| Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact

GO Cancer cdf GO Cancer cdf • 702 views

ADD COMMENT • link updated 13.2 years ago by Michael Lawrence ★ 11k • written 13.2 years ago by Steve Lianoglou ★ 13k

0

Entering edit mode

Michael Lawrence ★ 11k

@michael-lawrence-3846

Last seen 2.4 years ago

United States

On Wed, Feb 9, 2011 at 1:08 PM, Steve Lianoglou < mailinglist.honeypot@gmail.com> wrote: > Hi, > > I guess a lot of us have this problem: I'm storing "genome long" > integer/doubles vectors for each position along each chromosome. > > I want to quickly access parts of these vectors in a manner quite > similar/convenient/efficient to how we can quickly access the reads in > a given region of a BAM file. I'm curios what you folks are using to > store this type of info? > > Currently I just have RData objects of Rle's or XIntegers, etc. for > each strand of each chromosome. I'll load these data files, query the > info over the ranges I want, then junk the (usually large) vector I > just loaded. It's not the best, but it works. > > In the bioinformatics world, I guess these data are best stored as > bigWig files, yes? And AFAIK, there's no (convenient or otherwise) way > to query bigWigs from within R/Bioc, right? > > Actually, rtracklayer can query bigWigs. It's very efficient. > Then I wonder if storing these in hdf/netcdf files isn't actually the > way to go ... and if so, why not go whole-hog and work on a bioc > interface to the somehow-defined biohdf format? > > Any thoughts? > > This is also a good idea, especially if you have data for many samples. There's a group of us here at Genentech looking to improve upon the netcdf4 support in R. This is the first I've heard of biohdf. Sounds kind of half-baked though. Michael -steve > > -- > Steve Lianoglou > Graduate Student: Computational Systems Biology > | Memorial Sloan-Kettering Cancer Center > | Weill Medical College of Cornell University > Contact Info: http://cbio.mskcc.org/~lianos/contact > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]

ADD COMMENT • link 13.2 years ago Michael Lawrence ★ 11k

0

Entering edit mode

On Wed, Feb 9, 2011 at 4:25 PM, Michael Lawrence <lawrence.michael at="" gene.com=""> wrote: > > > On Wed, Feb 9, 2011 at 1:08 PM, Steve Lianoglou > <mailinglist.honeypot at="" gmail.com=""> wrote: >> >> Hi, >> >> I guess a lot of us have this problem: I'm storing "genome long" >> integer/doubles vectors for each position along each chromosome. >> >> I want to quickly access parts of these vectors in a manner quite >> similar/convenient/efficient to how we can quickly access the reads in >> a given region of a BAM file. I'm curios what you folks are using to >> store this type of info? >> >> Currently I just have RData objects of Rle's or XIntegers, etc. for >> each strand of each chromosome. I'll load these data files, query the >> info over the ranges I want, then junk the (usually large) vector I >> just loaded. It's not the best, but it works. >> >> In the bioinformatics world, I guess these data are best stored as >> bigWig files, yes? And AFAIK, there's no (convenient or otherwise) way >> to query bigWigs from within R/Bioc, right? >> > > Actually, rtracklayer can query bigWigs. It's very efficient. Oh, I see ... sorry I missed that. I couldn't find info on it when searching through rtracklayer's vignette for "bigwig." I missed the BigWigSelection documentation. And ... wow, I can create a bigWig via the export.bw, nice. I'll have to play with this a bit. >> Then I wonder if storing these in hdf/netcdf files isn't actually the >> way to go ?... and if so, why not go whole-hog and work on a bioc >> interface to the somehow-defined biohdf format? >> >> Any thoughts? >> > > This is also a good idea, especially if you have data for many samples. Yes, that. But I'm also thinking of one such file per genome "release" I'm working with (things like conservation, mappability, etc. for hg18, hg19, mm9, etc). > There's a group of us here at Genentech looking to improve upon the netcdf4 > support in R. Interesting. Is your work "out in the open", or an internal project? > This is the first I've heard of biohdf. Sounds kind of > half-baked though. I also haven't found any updated information since whatever document/webpage is up from last spring (March or April(?)). I reckon it's being worked/improved on somewhere, though. Perhaps sticking with the (more) standard netcdf4 is the right way to go, anyway. -- Steve Lianoglou Graduate Student: Computational Systems Biology ?| Memorial Sloan-Kettering Cancer Center ?| Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact

ADD REPLY • link 13.2 years ago Steve Lianoglou ★ 13k

0

Entering edit mode

On Wed, Feb 9, 2011 at 1:59 PM, Steve Lianoglou < mailinglist.honeypot@gmail.com> wrote: > On Wed, Feb 9, 2011 at 4:25 PM, Michael Lawrence > <lawrence.michael@gene.com> wrote: > > > > > > On Wed, Feb 9, 2011 at 1:08 PM, Steve Lianoglou > > <mailinglist.honeypot@gmail.com> wrote: > >> > >> Hi, > >> > >> I guess a lot of us have this problem: I'm storing "genome long" > >> integer/doubles vectors for each position along each chromosome. > >> > >> I want to quickly access parts of these vectors in a manner quite > >> similar/convenient/efficient to how we can quickly access the reads in > >> a given region of a BAM file. I'm curios what you folks are using to > >> store this type of info? > >> > >> Currently I just have RData objects of Rle's or XIntegers, etc. for > >> each strand of each chromosome. I'll load these data files, query the > >> info over the ranges I want, then junk the (usually large) vector I > >> just loaded. It's not the best, but it works. > >> > >> In the bioinformatics world, I guess these data are best stored as > >> bigWig files, yes? And AFAIK, there's no (convenient or otherwise) way > >> to query bigWigs from within R/Bioc, right? > >> > > > > Actually, rtracklayer can query bigWigs. It's very efficient. > > Oh, I see ... sorry I missed that. I couldn't find info on it when > searching through rtracklayer's vignette for "bigwig." I missed the > BigWigSelection documentation. > > And ... wow, I can create a bigWig via the export.bw, nice. I'll have > to play with this a bit. > > >> Then I wonder if storing these in hdf/netcdf files isn't actually the > >> way to go ... and if so, why not go whole-hog and work on a bioc > >> interface to the somehow-defined biohdf format? > >> > >> Any thoughts? > >> > > > > This is also a good idea, especially if you have data for many samples. > > Yes, that. But I'm also thinking of one such file per genome "release" > I'm working with (things like conservation, mappability, etc. for > hg18, hg19, mm9, etc). > > > There's a group of us here at Genentech looking to improve upon the > netcdf4 > > support in R. > > Interesting. Is your work "out in the open", or an internal project? > > It will be open, as soon as we've started :) Basically want a more efficient/convenient API on top of the ncdf4 package. Richard Bourgon has an ExpressionSet class that holds any element in assayData that exposes a "rectangular" API. It's working now for mmapable files via the ff package. Just like DataFrame works for any vector-like object. Pete Haverty has extended ExpressionSet to hold a RangedData for fast interval-based subsetting. Just an overview of the different directions that we want to sort of merge together. > > This is the first I've heard of biohdf. Sounds kind of > > half-baked though. > > I also haven't found any updated information since whatever > document/webpage is up from last spring (March or April(?)). I reckon > it's being worked/improved on somewhere, though. Perhaps sticking with > the (more) standard netcdf4 is the right way to go, anyway. > > -- > Steve Lianoglou > Graduate Student: Computational Systems Biology > | Memorial Sloan-Kettering Cancer Center > | Weill Medical College of Cornell University > Contact Info: http://cbio.mskcc.org/~lianos/contact > [[alternative HTML version deleted]]

ADD REPLY • link 13.2 years ago Michael Lawrence ★ 11k

0

Entering edit mode

On Wed, Feb 9, 2011 at 5:21 PM, Michael Lawrence <lawrence.michael at="" gene.com=""> wrote: > It will be open, as soon as we've started :) Basically want a more > efficient/convenient API on top of the ncdf4 package. Richard Bourgon has an > ExpressionSet class that holds any element in assayData that exposes a > "rectangular" API. It's working now for mmapable files via the ff package. > Just like DataFrame works for any vector-like object. ?Pete Haverty has > extended ExpressionSet to hold a RangedData for fast interval-based > subsetting. Just an overview of the different directions that we want to > sort of merge together. I have been doing essentially the same at Pete for my work. Would he be interested in starting a package purely with the intention of developing the backend (in case he - like me - is still working on the application, but is interested in having a robust backend)? Of course, Richard's stuff sounds interesting as well. It seems silly to spend time developing this in parallel. Kasper

ADD REPLY • link 13.2 years ago Kasper Daniel Hansen ★ 6.5k

0

Entering edit mode

Detailed statements of use cases (i.e., want to do this with this persistent, publicly available resource...) and performance metrics/data would be nice to have as well. On Thu, Feb 10, 2011 at 10:00 AM, Kasper Daniel Hansen <kasperdanielhansen at="" gmail.com=""> wrote: > On Wed, Feb 9, 2011 at 5:21 PM, Michael Lawrence > <lawrence.michael at="" gene.com=""> wrote: >> It will be open, as soon as we've started :) Basically want a more >> efficient/convenient API on top of the ncdf4 package. Richard Bourgon has an >> ExpressionSet class that holds any element in assayData that exposes a >> "rectangular" API. It's working now for mmapable files via the ff package. >> Just like DataFrame works for any vector-like object. ?Pete Haverty has >> extended ExpressionSet to hold a RangedData for fast interval-based >> subsetting. Just an overview of the different directions that we want to >> sort of merge together. > > I have been doing essentially the same at Pete for my work. ?Would he > be interested in starting a package purely with the intention of > developing the backend (in case he - like me - is still working on the > application, but is interested in having a robust backend)? ?Of > course, Richard's stuff sounds interesting as well. > > It seems silly to spend time developing this in parallel. > > Kasper > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >

ADD REPLY • link 13.2 years ago Vincent J. Carey, Jr. 6.7k

0

Entering edit mode

On Thu, Feb 10, 2011 at 7:00 AM, Kasper Daniel Hansen < kasperdanielhansen@gmail.com> wrote: > On Wed, Feb 9, 2011 at 5:21 PM, Michael Lawrence > <lawrence.michael@gene.com> wrote: > > It will be open, as soon as we've started :) Basically want a more > > efficient/convenient API on top of the ncdf4 package. Richard Bourgon has > an > > ExpressionSet class that holds any element in assayData that exposes a > > "rectangular" API. It's working now for mmapable files via the ff > package. > > Just like DataFrame works for any vector-like object. Pete Haverty has > > extended ExpressionSet to hold a RangedData for fast interval- based > > subsetting. Just an overview of the different directions that we want to > > sort of merge together. > > I have been doing essentially the same at Pete for my work. Would he > be interested in starting a package purely with the intention of > developing the backend (in case he - like me - is still working on the > application, but is interested in having a robust backend)? Of > course, Richard's stuff sounds interesting as well. > > I think he's already got a pretty mature package. I shouldn't speak for him though, so I've cc'd him on this. It seems silly to spend time developing this in parallel. > > Kasper > [[alternative HTML version deleted]]

ADD REPLY • link 13.2 years ago Michael Lawrence ★ 11k

Login before adding your answer.