phastCon-scores

0

Entering edit mode

Johannes Waage ▴ 50

@johannes-waage-3852

Last seen 9.9 years ago

Hi all, I have a small but important challenge set before me, that I've been unable to solve. I need to aggregate all phastCon scores for 75-100 nt around all * mus* exon splicesites. I've tried different approaches, such as downloading the entire mulitz30way phastCon dataset from UCSC (too big to work with smoothly), download using intersect with UCSC table browser and Galaxy (limits me to 10 million data points, unfortunately), and fetching data trough rtracklayer (too slow). Can anyone point me towards an elegant and fast way to fetch datapoints for many genomic intervals? With around 22k genes, with an average exon count of 8 times 100 nt, it seems I need to be able to fetch around 20m data points. I need to use the data as background in comparison to select upregulated exons in a RNA-seq splice study. All the best, JW, University of Copenhagen [[alternative HTML version deleted]]

rtracklayer spliceSites rtracklayer spliceSites • 1.4k views

ADD COMMENT • link updated 14.6 years ago by Sean Davis 21k • written 14.6 years ago by Johannes Waage ▴ 50

0

Entering edit mode

Sean Davis 21k

@sean-davis-490

Last seen 7 hours ago

United States

On Tue, Dec 15, 2009 at 5:01 PM, Johannes Waage <johannes.waage at="" bric.dk=""> wrote: > Hi all, > > I have a small but important challenge set before me, that I've been unable > to solve. I need to aggregate all phastCon scores for 75-100 nt around all * > mus* exon splicesites. I've tried different approaches, such as downloading > the entire mulitz30way phastCon dataset from UCSC (too big to work with > smoothly), download using intersect with UCSC table browser and Galaxy > (limits me to 10 million data points, unfortunately), and fetching data > trough rtracklayer (too slow). Can anyone point me towards an elegant and > fast way to fetch datapoints for many genomic intervals? With around 22k > genes, with an average exon count of 8 times 100 nt, it seems I need to be > able to fetch around 20m data points. > > I need to use the data as background in comparison to select upregulated > exons in a RNA-seq splice study. Could you do this chromosome-by-chromosome by loading the per-base data one chromosome at a time from the files into an R vector and then using normal vector subsetting to get the regions of interest? Alternatively, with a little work, you could probably also build a little index file and then use random access to get the data from the files. Finally, there are probably some tools in the UCSC browser tool chain that you could download to deal with conservation data fairly quickly. Sean

ADD COMMENT • link 14.6 years ago Sean Davis 21k

0

Entering edit mode

On Tue, Dec 15, 2009 at 2:21 PM, Sean Davis <seandavi@gmail.com> wrote: > On Tue, Dec 15, 2009 at 5:01 PM, Johannes Waage <johannes.waage@bric.dk> > wrote: > > Hi all, > > > > I have a small but important challenge set before me, that I've been > unable > > to solve. I need to aggregate all phastCon scores for 75-100 nt around > all * > > mus* exon splicesites. I've tried different approaches, such as > downloading > > the entire mulitz30way phastCon dataset from UCSC (too big to work with > > smoothly), download using intersect with UCSC table browser and Galaxy > > (limits me to 10 million data points, unfortunately), and fetching data > > trough rtracklayer (too slow). Can anyone point me towards an elegant and > > fast way to fetch datapoints for many genomic intervals? With around 22k > > genes, with an average exon count of 8 times 100 nt, it seems I need to > be > > able to fetch around 20m data points. > > > > I need to use the data as background in comparison to select upregulated > > exons in a RNA-seq splice study. > > Could you do this chromosome-by-chromosome by loading the per-base > data one chromosome at a time from the files into an R vector and then > using normal vector subsetting to get the regions of interest? > > Alternatively, with a little work, you could probably also build a > little index file and then use random access to get the data from the > files. > > Finally, there are probably some tools in the UCSC browser tool chain > that you could download to deal with conservation data fairly quickly. > > This may be a decent use case for bigWig support in Bioconductor. The data is stored in a binary, indexed form, so it should be easy and efficient to bring subsets into memory/R. The mappability tracks are another example. Looks like rtracklayer may be the place for this, at least initially. The mythical common IO package would be helpful though. Michael Sean > > _______________________________________________ > Bioconductor mailing list > Bioconductor@stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]

ADD REPLY • link 14.6 years ago Michael Lawrence ★ 11k

0

Entering edit mode

On Tue, Dec 15, 2009 at 7:56 PM, Michael Lawrence <lawrence.michael at="" gene.com=""> wrote: > > > On Tue, Dec 15, 2009 at 2:21 PM, Sean Davis <seandavi at="" gmail.com=""> wrote: >> >> On Tue, Dec 15, 2009 at 5:01 PM, Johannes Waage <johannes.waage at="" bric.dk=""> >> wrote: >> > Hi all, >> > >> > I have a small but important challenge set before me, that I've been >> > unable >> > to solve. I need to aggregate all phastCon scores for 75-100 nt around >> > all * >> > mus* exon splicesites. I've tried different approaches, such as >> > downloading >> > the entire mulitz30way phastCon dataset from UCSC (too big to work with >> > smoothly), download using intersect with UCSC table browser and Galaxy >> > (limits me to 10 million data points, unfortunately), and fetching data >> > trough rtracklayer (too slow). Can anyone point me towards an elegant >> > and >> > fast way to fetch datapoints for many genomic intervals? With around 22k >> > genes, with an average exon count of 8 times 100 nt, it seems I need to >> > be >> > able to fetch around 20m data points. >> > >> > I need to use the data as background in comparison to select upregulated >> > exons in a RNA-seq splice study. >> >> Could you do this chromosome-by-chromosome by loading the per-base >> data one chromosome at a time from the files into an R vector and then >> using normal vector subsetting to get the regions of interest? >> >> Alternatively, with a little work, you could probably also build a >> little index file and then use random access to get the data from the >> files. >> >> Finally, there are probably some tools in the UCSC browser tool chain >> that you could download to deal with conservation data fairly quickly. >> > > This may be a decent use case for bigWig support in Bioconductor. The data > is stored in a binary, indexed form, so it should be easy and efficient to > bring subsets into memory/R. > > The mappability tracks are another example. Looks like rtracklayer may be > the place for this, at least initially.? The mythical common IO package > would be helpful though. I agree that bigWig support would be a useful addition to the bioconductor tool set. Sean

ADD REPLY • link 14.6 years ago Sean Davis 21k

0

Entering edit mode

On Tue, Dec 15, 2009 at 5:21 PM, Sean Davis <seandavi at="" gmail.com=""> wrote: > On Tue, Dec 15, 2009 at 5:01 PM, Johannes Waage <johannes.waage at="" bric.dk=""> wrote: >> Hi all, >> >> I have a small but important challenge set before me, that I've been unable >> to solve. I need to aggregate all phastCon scores for 75-100 nt around all * >> mus* exon splicesites. I've tried different approaches, such as downloading >> the entire mulitz30way phastCon dataset from UCSC (too big to work with >> smoothly), download using intersect with UCSC table browser and Galaxy >> (limits me to 10 million data points, unfortunately), and fetching data >> trough rtracklayer (too slow). Can anyone point me towards an elegant and >> fast way to fetch datapoints for many genomic intervals? With around 22k >> genes, with an average exon count of 8 times 100 nt, it seems I need to be >> able to fetch around 20m data points. >> >> I need to use the data as background in comparison to select upregulated >> exons in a RNA-seq splice study. > > Could you do this chromosome-by-chromosome by loading the per-base > data one chromosome at a time from the files into an R vector and then > using normal vector subsetting to get the regions of interest? OK. I looked at the files and I don't think it will work without some cleverness. The two methods below are still possible, though. Sean > Alternatively, with a little work, you could probably also build a > little index file and then use random access to get the data from the > files. > > Finally, there are probably some tools in the UCSC browser tool chain > that you could download to deal with conservation data fairly quickly. > > Sean >

ADD REPLY • link 14.6 years ago Sean Davis 21k

Login before adding your answer.