rtracklayer problem on large number of GRanges
0
0
Entering edit mode
Janet Young ▴ 740
@janet-young-2360
Last seen 4.5 years ago
Fred Hutchinson Cancer Research Center,…
thanks - that'd be helpful Janet On Jul 31, 2013, at 12:47 PM, Michael Lawrence wrote: > Yea, I was planning on adding a warning. > > > On Wed, Jul 31, 2013 at 11:49 AM, Janet Young <jayoung@fhcrc.org> wrote: > Makes sense - thanks. I was also playing with iterating through my queries 1000 at a time, but I'm sure the downloaded bigWig would be more efficient (and more palatable to the UCSC people). > > From a user point of view, it would be helpful if ucscTableQuery could issue a warning if we try to query with >1000 ranges, rather than silently returning a partial result - does that make sense to you? I spent a little time down a rabbit hole with that one. > > Janet > > > > On Jul 31, 2013, at 11:39 AM, Michael Lawrence wrote: > >> Right, the table browser is the limitation. It only allows 1000 query regions, and limits the output for "whole genome" as well to something like 100,000. >> >> If you've got a lot of data to query, it's best to just download the bigWig or other file and query it locally. These scores should be available via the FTP as bigWig. >> >> >> >> >> On Tue, Jul 30, 2013 at 1:06 PM, Janet Young <jayoung@fhcrc.org> wrote: >> I'm guessing this issue might come from the UCSC end: using their online Table Browser, there is a query limit of 1000 regions. >> >> Janet >> >> On Jul 30, 2013, at 12:15 PM, Janet Young wrote: >> >> > Hi again, >> > >> > I think I've found another issue with rtracklyer, where if I run a query with range specified as a large number of small GRanges, it seems to only be getting results for the first 1000 ranges. Again, hopefully the code below will explain all. >> > >> > thanks, >> > >> > Janet >> > >> > ------------------------------------------------------------------- >> > >> > Dr. Janet Young >> > >> > Malik lab >> > http://research.fhcrc.org/malik/en.html >> > >> > Fred Hutchinson Cancer Research Center >> > 1100 Fairview Avenue N., A2-025, >> > P.O. Box 19024, Seattle, WA 98109-1024, USA. >> > >> > tel: (206) 667 4512 >> > email: jayoung ...at... fhcrc.org >> > >> > ------------------------------------------------------------------- >> > >> > >> > ##### works: >> > >> > library(rtracklayer) >> > library(GenomicRanges) >> > >> > session <- browserSession("UCSC") >> > genome(session) <- "hg19" >> > >> > >> > #### make some sample ranges - a large number of small ranges: >> > numRanges <- 5000 >> > rangeWidths <- 50 >> > myRanges <- GRanges( seqnames=rep("chr1",numRanges), >> > ranges=IRanges(start=1:numRanges*rangeWidths*2,width=rangeWidths) ) >> > >> > #### run a query >> > query <- ucscTableQuery (session, "cons46way", range=myRanges) >> > tableName(query) <- "phyloP46wayPrimates" >> > scores <- track(query) >> > ### seems to work with no apparent error >> > >> > ##### but the number of scores returned is much smaller than the number of positions queried: >> > length(scores) >> > # [1] 41749 >> > sum(width(myRanges)) >> > # [1] 250000 >> > >> > >> > #### use findOverlaps to see which of the ranges had scores - only 837 of the 5000 regions had scores. It's ranges 109-1000 that have scores. Not a problem that ranges 1-108 had missing scores (they're probably in an assembly gap, or don't have aligned primate seqs), but at least some of ranges 1001-5000 should have scores. >> > >> > myOverlaps <- findOverlaps(myRanges, scores) >> > length(unique(queryHits(myOverlaps))) >> > # [1] 837 >> > range(queryHits(myOverlaps)) >> > # [1] 109 1000 >> > >> > ### a separate query shows that at least ranges 1001-1010 indeed should have scores (haven't checked the rest): >> > query2 <- ucscTableQuery (session, "cons46way", range=myRanges[1001:1010]) >> > tableName(query2) <- "phyloP46wayPrimates" >> > scores2 <- track(query2) >> > >> > ### every position queried has a score: >> > sum(width(myRanges[1001:1010])) >> > # [1] 500 >> > length(scores2) >> > # [1] 500 >> > >> > myOverlaps2 <- findOverlaps(myRanges, scores2) >> > range(queryHits(myOverlaps2)) >> > # [1] 1001 1010 >> > >> > >> > >> > ################## >> > >> > >> > sessionInfo() >> > >> > R version 3.0.1 Patched (2013-07-29 r63455) >> > Platform: x86_64-unknown-linux-gnu (64-bit) >> > >> > locale: >> > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >> > [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 >> > [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 >> > [7] LC_PAPER=en_US.UTF-8 LC_NAME=C >> > [9] LC_ADDRESS=C LC_TELEPHONE=C >> > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >> > >> > attached base packages: >> > [1] parallel stats graphics grDevices utils datasets methods >> > [8] base >> > >> > other attached packages: >> > [1] rtracklayer_1.21.9 GenomicRanges_1.13.35 XVector_0.1.0 >> > [4] IRanges_1.19.19 BiocGenerics_0.7.3 >> > >> > loaded via a namespace (and not attached): >> > [1] Biostrings_2.29.14 bitops_1.0-5 BSgenome_1.29.1 RCurl_1.95-4.1 >> > [5] Rsamtools_1.13.26 stats4_3.0.1 tools_3.0.1 XML_3.98-1.1 >> > [9] zlibbioc_1.7.0 >> > >> >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor@r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > [[alternative HTML version deleted]]
Cancer Cancer • 668 views
ADD COMMENT

Login before adding your answer.

Traffic: 985 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6