phastCon-scores
1
0
Entering edit mode
@johannes-waage-3852
Last seen 9.7 years ago
Hi all, I have a small but important challenge set before me, that I've been unable to solve. I need to aggregate all phastCon scores for 75-100 nt around all * mus* exon splicesites. I've tried different approaches, such as downloading the entire mulitz30way phastCon dataset from UCSC (too big to work with smoothly), download using intersect with UCSC table browser and Galaxy (limits me to 10 million data points, unfortunately), and fetching data trough rtracklayer (too slow). Can anyone point me towards an elegant and fast way to fetch datapoints for many genomic intervals? With around 22k genes, with an average exon count of 8 times 100 nt, it seems I need to be able to fetch around 20m data points. I need to use the data as background in comparison to select upregulated exons in a RNA-seq splice study. All the best, JW, University of Copenhagen [[alternative HTML version deleted]]
rtracklayer spliceSites rtracklayer spliceSites • 1.3k views
ADD COMMENT
0
Entering edit mode
@sean-davis-490
Last seen 4 months ago
United States
On Tue, Dec 15, 2009 at 5:01 PM, Johannes Waage <johannes.waage at="" bric.dk=""> wrote: > Hi all, > > I have a small but important challenge set before me, that I've been unable > to solve. I need to aggregate all phastCon scores for 75-100 nt around all * > mus* exon splicesites. I've tried different approaches, such as downloading > the entire mulitz30way phastCon dataset from UCSC (too big to work with > smoothly), download using intersect with UCSC table browser and Galaxy > (limits me to 10 million data points, unfortunately), and fetching data > trough rtracklayer (too slow). Can anyone point me towards an elegant and > fast way to fetch datapoints for many genomic intervals? With around 22k > genes, with an average exon count of 8 times 100 nt, it seems I need to be > able to fetch around 20m data points. > > I need to use the data as background in comparison to select upregulated > exons in a RNA-seq splice study. Could you do this chromosome-by-chromosome by loading the per-base data one chromosome at a time from the files into an R vector and then using normal vector subsetting to get the regions of interest? Alternatively, with a little work, you could probably also build a little index file and then use random access to get the data from the files. Finally, there are probably some tools in the UCSC browser tool chain that you could download to deal with conservation data fairly quickly. Sean
ADD COMMENT
0
Entering edit mode
On Tue, Dec 15, 2009 at 2:21 PM, Sean Davis <seandavi@gmail.com> wrote: > On Tue, Dec 15, 2009 at 5:01 PM, Johannes Waage <johannes.waage@bric.dk> > wrote: > > Hi all, > > > > I have a small but important challenge set before me, that I've been > unable > > to solve. I need to aggregate all phastCon scores for 75-100 nt around > all * > > mus* exon splicesites. I've tried different approaches, such as > downloading > > the entire mulitz30way phastCon dataset from UCSC (too big to work with > > smoothly), download using intersect with UCSC table browser and Galaxy > > (limits me to 10 million data points, unfortunately), and fetching data > > trough rtracklayer (too slow). Can anyone point me towards an elegant and > > fast way to fetch datapoints for many genomic intervals? With around 22k > > genes, with an average exon count of 8 times 100 nt, it seems I need to > be > > able to fetch around 20m data points. > > > > I need to use the data as background in comparison to select upregulated > > exons in a RNA-seq splice study. > > Could you do this chromosome-by-chromosome by loading the per-base > data one chromosome at a time from the files into an R vector and then > using normal vector subsetting to get the regions of interest? > > Alternatively, with a little work, you could probably also build a > little index file and then use random access to get the data from the > files. > > Finally, there are probably some tools in the UCSC browser tool chain > that you could download to deal with conservation data fairly quickly. > > This may be a decent use case for bigWig support in Bioconductor. The data is stored in a binary, indexed form, so it should be easy and efficient to bring subsets into memory/R. The mappability tracks are another example. Looks like rtracklayer may be the place for this, at least initially. The mythical common IO package would be helpful though. Michael Sean > > _______________________________________________ > Bioconductor mailing list > Bioconductor@stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]
ADD REPLY
0
Entering edit mode
On Tue, Dec 15, 2009 at 7:56 PM, Michael Lawrence <lawrence.michael at="" gene.com=""> wrote: > > > On Tue, Dec 15, 2009 at 2:21 PM, Sean Davis <seandavi at="" gmail.com=""> wrote: >> >> On Tue, Dec 15, 2009 at 5:01 PM, Johannes Waage <johannes.waage at="" bric.dk=""> >> wrote: >> > Hi all, >> > >> > I have a small but important challenge set before me, that I've been >> > unable >> > to solve. I need to aggregate all phastCon scores for 75-100 nt around >> > all * >> > mus* exon splicesites. I've tried different approaches, such as >> > downloading >> > the entire mulitz30way phastCon dataset from UCSC (too big to work with >> > smoothly), download using intersect with UCSC table browser and Galaxy >> > (limits me to 10 million data points, unfortunately), and fetching data >> > trough rtracklayer (too slow). Can anyone point me towards an elegant >> > and >> > fast way to fetch datapoints for many genomic intervals? With around 22k >> > genes, with an average exon count of 8 times 100 nt, it seems I need to >> > be >> > able to fetch around 20m data points. >> > >> > I need to use the data as background in comparison to select upregulated >> > exons in a RNA-seq splice study. >> >> Could you do this chromosome-by-chromosome by loading the per-base >> data one chromosome at a time from the files into an R vector and then >> using normal vector subsetting to get the regions of interest? >> >> Alternatively, with a little work, you could probably also build a >> little index file and then use random access to get the data from the >> files. >> >> Finally, there are probably some tools in the UCSC browser tool chain >> that you could download to deal with conservation data fairly quickly. >> > > This may be a decent use case for bigWig support in Bioconductor. The data > is stored in a binary, indexed form, so it should be easy and efficient to > bring subsets into memory/R. > > The mappability tracks are another example. Looks like rtracklayer may be > the place for this, at least initially.? The mythical common IO package > would be helpful though. I agree that bigWig support would be a useful addition to the bioconductor tool set. Sean
ADD REPLY
0
Entering edit mode
On Tue, Dec 15, 2009 at 5:21 PM, Sean Davis <seandavi at="" gmail.com=""> wrote: > On Tue, Dec 15, 2009 at 5:01 PM, Johannes Waage <johannes.waage at="" bric.dk=""> wrote: >> Hi all, >> >> I have a small but important challenge set before me, that I've been unable >> to solve. I need to aggregate all phastCon scores for 75-100 nt around all * >> mus* exon splicesites. I've tried different approaches, such as downloading >> the entire mulitz30way phastCon dataset from UCSC (too big to work with >> smoothly), download using intersect with UCSC table browser and Galaxy >> (limits me to 10 million data points, unfortunately), and fetching data >> trough rtracklayer (too slow). Can anyone point me towards an elegant and >> fast way to fetch datapoints for many genomic intervals? With around 22k >> genes, with an average exon count of 8 times 100 nt, it seems I need to be >> able to fetch around 20m data points. >> >> I need to use the data as background in comparison to select upregulated >> exons in a RNA-seq splice study. > > Could you do this chromosome-by-chromosome by loading the per-base > data one chromosome at a time from the files into an R vector and then > using normal vector subsetting to get the regions of interest? OK. I looked at the files and I don't think it will work without some cleverness. The two methods below are still possible, though. Sean > Alternatively, with a little work, you could probably also build a > little index file and then use random access to get the data from the > files. > > Finally, there are probably some tools in the UCSC browser tool chain > that you could download to deal with conservation data fairly quickly. > > Sean >
ADD REPLY

Login before adding your answer.

Traffic: 373 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6