Problem with nearest() in GenomicRanges
1
0
Entering edit mode
Chris Whelan ▴ 60
@chris-whelan-4779
Last seen 9.6 years ago
Hi all, I'm having an issue with the nearest() method in GenomicRanges. I've created GRanges from two data frames, from data for a species genome assembly with lots of scaffolds: > xs <- with(x.data, GRanges(seqnames=Scaffold, ranges=IRanges(start=Start, end=Stop), strand=Strand)) > xs GRanges with 989 ranges and 0 elementMetadata cols: seqnames ranges strand <rle> <iranges> <rle> [1] GL397480 [ 1477346, 1477813] - [2] GL397446 [ 869823, 870295] + ... ... ... ... [988] GL397464 [1138154, 1138427] - [989] GL397464 [1138154, 1138427] - --- seqlengths: ADFV01135143 ADFV01136404 GL397262 ... GL397832 GL397836 NA NA NA ... NA NA > cpg.islands <- with(cpg.island.data, GRanges(seqnames=V1, ranges=IRanges(start=V4, end=V5), strand='*')) > cpg.islands GRanges with 88968 ranges and 0 elementMetadata cols: seqnames ranges strand <rle> <iranges> <rle> [1] GL397261 [ 4409, 5062] * [2] GL397261 [ 66765, 67010] * ... ... ... ... [88967] ADFV01162706 [1834, 2052] * [88968] ADFV01162751 [ 226, 884] * --- seqlengths: ADFV01134425 ADFV01134436 ADFV01134442 ... GL405215 GL405216 NA NA NA ... NA NA And I'd like to use nearest to find the closest island to each x: > nearest(xs, cpg.islands) Error in Rle(values = callGeneric(runValue(e1)[which1], runValue(e2)[which2]), : error in evaluating the argument 'values' in selecting a method for function 'Rle': Error in Ops.factor(runValue(e1)[which1], runValue(e2)[which2]) : level sets of factors are different > traceback() 12: Rle(values = callGeneric(runValue(e1)[which1], runValue(e2)[which2]), lengths = diffWithInitialZero(ends), check = FALSE) 11: seqnames(x) != seqnames(y) 10: seqnames(x) != seqnames(y) 9: `seqselect<-`(`*tmp*`, seqnames(x) != seqnames(y), value = NA) 8: `seqselect<-`(`*tmp*`, seqnames(x) != seqnames(y), value = NA) 7: .local(x, y, ...) 6: distance(x[midx], subject[p[midx]]) 5: distance(x[midx], subject[p[midx]]) 4: .nearest(x, subject, ignore.strand, ...) 3: .local(x, subject, ...) 2: nearest(xs, cpg.islands) 1: nearest(xs, cpg.islands) This same data works fine with other methods like findOverlaps(): > findOverlaps(xs, cpg.islands) Hits of length 225 queryLength: 989 subjectLength: 88968 queryHits subjectHits <integer> <integer> 1 2 75585 2 6 36857 ... ... ... 224 963 84872 225 964 84872 Any idea what I might be doing wrong? Here's my sessionInfo(): > sessionInfo() R version 2.15.1 (2012-06-22) Platform: x86_64-pc-linux-gnu (64-bit) locale: [1] LC_CTYPE=C LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C [5] LC_MONETARY=C LC_MESSAGES=C [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] locfit_1.5-8 GenomicFeatures_1.8.2 AnnotationDbi_1.18.1 [4] Biobase_2.16.0 multicore_0.1-7 rtracklayer_1.16.3 [7] GenomicRanges_1.8.9 IRanges_1.14.4 BiocGenerics_0.2.0 [10] reshape2_1.2.1 ggplot2_0.9.1 BiocInstaller_1.4.7 loaded via a namespace (and not attached): [1] BSgenome_1.24.0 Biostrings_2.24.1 DBI_0.2-5 MASS_7.3-19 [5] RColorBrewer_1.0-5 RCurl_1.91-1 RSQLite_0.11.1 Rsamtools_1.8.5 [9] XML_3.9-4 biomaRt_2.12.0 bitops_1.0-4.1 colorspace_1.1-1 [13] compiler_2.15.1 dichromat_1.2-4 digest_0.5.2 grid_2.15.1 [17] labeling_0.1 lattice_0.20-6 memoise_0.1 munsell_0.3 [21] plyr_1.7.1 proto_0.3-9.2 scales_0.2.1 stats4_2.15.1 [25] stringr_0.6.1 tools_2.15.1 zlibbioc_1.2.0 Thanks in advance, Chris [[alternative HTML version deleted]]
GenomicRanges GenomicRanges • 1.2k views
ADD COMMENT
0
Entering edit mode
@valerie-obenchain-4275
Last seen 2.2 years ago
United States
Hi Chris, It's difficult to know what's going on without a reproducible example. Can you attach a small subset of your GRanges that still triggers the error? If not, is there a place I can get the data for testing? Valerie On 07/28/12 12:03, Chris Whelan wrote: > Hi all, > > I'm having an issue with the nearest() method in GenomicRanges. I've > created GRanges from two data frames, from data for a species genome > assembly with lots of scaffolds: > >> xs<- with(x.data, GRanges(seqnames=Scaffold, ranges=IRanges(start=Start, end=Stop), strand=Strand)) >> xs > GRanges with 989 ranges and 0 elementMetadata cols: > seqnames ranges strand > <rle> <iranges> <rle> > [1] GL397480 [ 1477346, 1477813] - > [2] GL397446 [ 869823, 870295] + > ... ... ... ... > [988] GL397464 [1138154, 1138427] - > [989] GL397464 [1138154, 1138427] - > --- > seqlengths: > ADFV01135143 ADFV01136404 GL397262 ... GL397832 GL397836 > NA NA NA ... NA NA > >> cpg.islands<- with(cpg.island.data, GRanges(seqnames=V1, ranges=IRanges(start=V4, end=V5), strand='*')) >> cpg.islands > GRanges with 88968 ranges and 0 elementMetadata cols: > seqnames ranges strand > <rle> <iranges> <rle> > [1] GL397261 [ 4409, 5062] * > [2] GL397261 [ 66765, 67010] * > ... ... ... ... > [88967] ADFV01162706 [1834, 2052] * > [88968] ADFV01162751 [ 226, 884] * > --- > seqlengths: > ADFV01134425 ADFV01134436 ADFV01134442 ... GL405215 GL405216 > NA NA NA ... NA NA > > And I'd like to use nearest to find the closest island to each x: > >> nearest(xs, cpg.islands) > Error in Rle(values = callGeneric(runValue(e1)[which1], > runValue(e2)[which2]), : > error in evaluating the argument 'values' in selecting a method for > function 'Rle': Error in Ops.factor(runValue(e1)[which1], > runValue(e2)[which2]) : > level sets of factors are different >> traceback() > 12: Rle(values = callGeneric(runValue(e1)[which1], runValue(e2)[which2]), > lengths = diffWithInitialZero(ends), check = FALSE) > 11: seqnames(x) != seqnames(y) > 10: seqnames(x) != seqnames(y) > 9: `seqselect<-`(`*tmp*`, seqnames(x) != seqnames(y), value = NA) > 8: `seqselect<-`(`*tmp*`, seqnames(x) != seqnames(y), value = NA) > 7: .local(x, y, ...) > 6: distance(x[midx], subject[p[midx]]) > 5: distance(x[midx], subject[p[midx]]) > 4: .nearest(x, subject, ignore.strand, ...) > 3: .local(x, subject, ...) > 2: nearest(xs, cpg.islands) > 1: nearest(xs, cpg.islands) > > This same data works fine with other methods like findOverlaps(): > >> findOverlaps(xs, cpg.islands) > Hits of length 225 > queryLength: 989 > subjectLength: 88968 > queryHits subjectHits > <integer> <integer> > 1 2 75585 > 2 6 36857 > ... ... ... > 224 963 84872 > 225 964 84872 > > Any idea what I might be doing wrong? Here's my sessionInfo(): > >> sessionInfo() > R version 2.15.1 (2012-06-22) > Platform: x86_64-pc-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=C LC_NUMERIC=C > [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C > [5] LC_MONETARY=C LC_MESSAGES=C > [7] LC_PAPER=C LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] locfit_1.5-8 GenomicFeatures_1.8.2 AnnotationDbi_1.18.1 > [4] Biobase_2.16.0 multicore_0.1-7 rtracklayer_1.16.3 > [7] GenomicRanges_1.8.9 IRanges_1.14.4 BiocGenerics_0.2.0 > [10] reshape2_1.2.1 ggplot2_0.9.1 BiocInstaller_1.4.7 > > loaded via a namespace (and not attached): > [1] BSgenome_1.24.0 Biostrings_2.24.1 DBI_0.2-5 MASS_7.3-19 > [5] RColorBrewer_1.0-5 RCurl_1.91-1 RSQLite_0.11.1 Rsamtools_1.8.5 > [9] XML_3.9-4 biomaRt_2.12.0 bitops_1.0-4.1 colorspace_1.1-1 > [13] compiler_2.15.1 dichromat_1.2-4 digest_0.5.2 grid_2.15.1 > [17] labeling_0.1 lattice_0.20-6 memoise_0.1 munsell_0.3 > [21] plyr_1.7.1 proto_0.3-9.2 scales_0.2.1 stats4_2.15.1 > [25] stringr_0.6.1 tools_2.15.1 zlibbioc_1.2.0 > > Thanks in advance, > > Chris > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
ADD COMMENT
0
Entering edit mode
Hi Valerie, I get this behavior using only a small subset of my data - attached. I am loading it like this: x.data <- read.table("x_data.txt", header=T, sep="\t") cpg.island.data <- read.table('cpg_islands_head.gff') In doing some more testing, it seems like I get this error if not all of the seqnames in the query GRanges are present in the subject GRanges. Is that data not supported? I tried setting the levels with: seqlevels(xs) <- seqlevels(cpg.islands) But then I just get back a vector of NA's from nearest(). Thanks, Chris On Jul 30, 2012, at 7:08 PM, Valerie Obenchain wrote: > Hi Chris, > > It's difficult to know what's going on without a reproducible example. > Can you attach a small subset of your GRanges that still triggers the > error? If not, is there a place I can get the data for testing? > > Valerie > > On 07/28/12 12:03, Chris Whelan wrote: >> Hi all, >> >> I'm having an issue with the nearest() method in GenomicRanges. I've >> created GRanges from two data frames, from data for a species genome >> assembly with lots of scaffolds: >> >>> xs<- with(x.data, GRanges(seqnames=Scaffold, ranges=IRanges(start=Start, end=Stop), strand=Strand)) >>> xs >> GRanges with 989 ranges and 0 elementMetadata cols: >> seqnames ranges strand >> <rle> <iranges> <rle> >> [1] GL397480 [ 1477346, 1477813] - >> [2] GL397446 [ 869823, 870295] + >> ... ... ... ... >> [988] GL397464 [1138154, 1138427] - >> [989] GL397464 [1138154, 1138427] - >> --- >> seqlengths: >> ADFV01135143 ADFV01136404 GL397262 ... GL397832 GL397836 >> NA NA NA ... NA NA >> >>> cpg.islands<- with(cpg.island.data, GRanges(seqnames=V1, ranges=IRanges(start=V4, end=V5), strand='*')) >>> cpg.islands >> GRanges with 88968 ranges and 0 elementMetadata cols: >> seqnames ranges strand >> <rle> <iranges> <rle> >> [1] GL397261 [ 4409, 5062] * >> [2] GL397261 [ 66765, 67010] * >> ... ... ... ... >> [88967] ADFV01162706 [1834, 2052] * >> [88968] ADFV01162751 [ 226, 884] * >> --- >> seqlengths: >> ADFV01134425 ADFV01134436 ADFV01134442 ... GL405215 GL405216 >> NA NA NA ... NA NA >> >> And I'd like to use nearest to find the closest island to each x: >> >>> nearest(xs, cpg.islands) >> Error in Rle(values = callGeneric(runValue(e1)[which1], >> runValue(e2)[which2]), : >> error in evaluating the argument 'values' in selecting a method for >> function 'Rle': Error in Ops.factor(runValue(e1)[which1], >> runValue(e2)[which2]) : >> level sets of factors are different >>> traceback() >> 12: Rle(values = callGeneric(runValue(e1)[which1], runValue(e2)[which2]), >> lengths = diffWithInitialZero(ends), check = FALSE) >> 11: seqnames(x) != seqnames(y) >> 10: seqnames(x) != seqnames(y) >> 9: `seqselect<-`(`*tmp*`, seqnames(x) != seqnames(y), value = NA) >> 8: `seqselect<-`(`*tmp*`, seqnames(x) != seqnames(y), value = NA) >> 7: .local(x, y, ...) >> 6: distance(x[midx], subject[p[midx]]) >> 5: distance(x[midx], subject[p[midx]]) >> 4: .nearest(x, subject, ignore.strand, ...) >> 3: .local(x, subject, ...) >> 2: nearest(xs, cpg.islands) >> 1: nearest(xs, cpg.islands) >> >> This same data works fine with other methods like findOverlaps(): >> >>> findOverlaps(xs, cpg.islands) >> Hits of length 225 >> queryLength: 989 >> subjectLength: 88968 >> queryHits subjectHits >> <integer> <integer> >> 1 2 75585 >> 2 6 36857 >> ... ... ... >> 224 963 84872 >> 225 964 84872 >> >> Any idea what I might be doing wrong? Here's my sessionInfo(): >> >>> sessionInfo() >> R version 2.15.1 (2012-06-22) >> Platform: x86_64-pc-linux-gnu (64-bit) >> >> locale: >> [1] LC_CTYPE=C LC_NUMERIC=C >> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C >> [5] LC_MONETARY=C LC_MESSAGES=C >> [7] LC_PAPER=C LC_NAME=C >> [9] LC_ADDRESS=C LC_TELEPHONE=C >> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >> >> attached base packages: >> [1] stats graphics grDevices utils datasets methods base >> >> other attached packages: >> [1] locfit_1.5-8 GenomicFeatures_1.8.2 AnnotationDbi_1.18.1 >> [4] Biobase_2.16.0 multicore_0.1-7 rtracklayer_1.16.3 >> [7] GenomicRanges_1.8.9 IRanges_1.14.4 BiocGenerics_0.2.0 >> [10] reshape2_1.2.1 ggplot2_0.9.1 BiocInstaller_1.4.7 >> >> loaded via a namespace (and not attached): >> [1] BSgenome_1.24.0 Biostrings_2.24.1 DBI_0.2-5 MASS_7.3-19 >> [5] RColorBrewer_1.0-5 RCurl_1.91-1 RSQLite_0.11.1 Rsamtools_1.8.5 >> [9] XML_3.9-4 biomaRt_2.12.0 bitops_1.0-4.1 colorspace_1.1-1 >> [13] compiler_2.15.1 dichromat_1.2-4 digest_0.5.2 grid_2.15.1 >> [17] labeling_0.1 lattice_0.20-6 memoise_0.1 munsell_0.3 >> [21] plyr_1.7.1 proto_0.3-9.2 scales_0.2.1 stats4_2.15.1 >> [25] stringr_0.6.1 tools_2.15.1 zlibbioc_1.2.0 >> >> Thanks in advance, >> >> Chris >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >
ADD REPLY
0
Entering edit mode
Thanks for sending the test data. This is now fixed in GenomicRanges release 1.8.10 and was fixed in devel a few months ago. The 1.8.10 version should be available through biocLite() ~9am PST Wednesday. The problem was that distance(), which is called from within nearest(), was not handling the case of no matching seqnames. > x.data <- read.table("x.csv", header=TRUE, sep="\t") > xs <- with(x.data, GRanges(seqnames=Scaffold, + ranges=IRanges(start=Start,end=Stop), strand=Strand)) > > cpg.island.data <- read.table('cpg_islands_head.gff') > cpg.islands <- with(cpg.island.data, GRanges(seqnames=V1, + ranges=IRanges(start=V4, end=V5), strand='*')) In these data, none of the seqnames in xs match those in cpg.islands, > runValue(seqnames(xs)) %in% runValue(seqnames(cpg.islands)) [1] FALSE FALSE FALSE Because we have no common seqnames we get a vector of NAs as the result. > nearest(xs, cpg.islands) [1] NA NA NA NA NA NA NA NA NA If there were matching seqnames you would see results for those values. For example, if we set the seqnames in cpg.islands to match one seqname in xs, > seqlevels(cpg.islands) <- "GL397446" > nearest(xs, cpg.islands) [1] NA 10 10 10 NA NA NA NA NA Valerie On 07/30/2012 08:17 PM, Chris Whelan wrote: > Hi Valerie, > > I get this behavior using only a small subset of my data - attached. I > am loading it like this: > > x.data <- read.table("x_data.txt", header=T, sep="\t") > cpg.island.data <- read.table('cpg_islands_head.gff') > > In doing some more testing, it seems like I get this error if not all > of the seqnames in the query GRanges are present in the subject > GRanges. Is that data not supported? > > I tried setting the levels with: > > seqlevels(xs) <- seqlevels(cpg.islands) > > But then I just get back a vector of NA's from nearest(). > > Thanks, > > Chris > > > > > On Jul 30, 2012, at 7:08 PM, Valerie Obenchain wrote: > > > Hi Chris, > > > > It's difficult to know what's going on without a reproducible example. > > Can you attach a small subset of your GRanges that still triggers the > > error? If not, is there a place I can get the data for testing? > > > > Valerie > > > > On 07/28/12 12:03, Chris Whelan wrote: > >> Hi all, > >> > >> I'm having an issue with the nearest() method in GenomicRanges. I've > >> created GRanges from two data frames, from data for a species genome > >> assembly with lots of scaffolds: > >> > >>> xs<- with(x.data, GRanges(seqnames=Scaffold, > ranges=IRanges(start=Start, end=Stop), strand=Strand)) > >>> xs > >> GRanges with 989 ranges and 0 elementMetadata cols: > >> seqnames ranges strand > >> <rle> <iranges> <rle> > >> [1] GL397480 [ 1477346, 1477813] - > >> [2] GL397446 [ 869823, 870295] + > >> ... ... ... ... > >> [988] GL397464 [1138154, 1138427] - > >> [989] GL397464 [1138154, 1138427] - > >> --- > >> seqlengths: > >> ADFV01135143 ADFV01136404 GL397262 ... GL397832 GL397836 > >> NA NA NA ... NA NA > >> > >>> cpg.islands<- with(cpg.island.data, GRanges(seqnames=V1, > ranges=IRanges(start=V4, end=V5), strand='*')) > >>> cpg.islands > >> GRanges with 88968 ranges and 0 elementMetadata cols: > >> seqnames ranges strand > >> <rle> <iranges> <rle> > >> [1] GL397261 [ 4409, 5062] * > >> [2] GL397261 [ 66765, 67010] * > >> ... ... ... ... > >> [88967] ADFV01162706 [1834, 2052] * > >> [88968] ADFV01162751 [ 226, 884] * > >> --- > >> seqlengths: > >> ADFV01134425 ADFV01134436 ADFV01134442 ... GL405215 GL405216 > >> NA NA NA ... NA NA > >> > >> And I'd like to use nearest to find the closest island to each x: > >> > >>> nearest(xs, cpg.islands) > >> Error in Rle(values = callGeneric(runValue(e1)[which1], > >> runValue(e2)[which2]), : > >> error in evaluating the argument 'values' in selecting a method for > >> function 'Rle': Error in Ops.factor(runValue(e1)[which1], > >> runValue(e2)[which2]) : > >> level sets of factors are different > >>> traceback() > >> 12: Rle(values = callGeneric(runValue(e1)[which1], > runValue(e2)[which2]), > >> lengths = diffWithInitialZero(ends), check = FALSE) > >> 11: seqnames(x) != seqnames(y) > >> 10: seqnames(x) != seqnames(y) > >> 9: `seqselect<-`(`*tmp*`, seqnames(x) != seqnames(y), value = NA) > >> 8: `seqselect<-`(`*tmp*`, seqnames(x) != seqnames(y), value = NA) > >> 7: .local(x, y, ...) > >> 6: distance(x[midx], subject[p[midx]]) > >> 5: distance(x[midx], subject[p[midx]]) > >> 4: .nearest(x, subject, ignore.strand, ...) > >> 3: .local(x, subject, ...) > >> 2: nearest(xs, cpg.islands) > >> 1: nearest(xs, cpg.islands) > >> > >> This same data works fine with other methods like findOverlaps(): > >> > >>> findOverlaps(xs, cpg.islands) > >> Hits of length 225 > >> queryLength: 989 > >> subjectLength: 88968 > >> queryHits subjectHits > >> <integer> <integer> > >> 1 2 75585 > >> 2 6 36857 > >> ... ... ... > >> 224 963 84872 > >> 225 964 84872 > >> > >> Any idea what I might be doing wrong? Here's my sessionInfo(): > >> > >>> sessionInfo() > >> R version 2.15.1 (2012-06-22) > >> Platform: x86_64-pc-linux-gnu (64-bit) > >> > >> locale: > >> [1] LC_CTYPE=C LC_NUMERIC=C > >> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C > >> [5] LC_MONETARY=C LC_MESSAGES=C > >> [7] LC_PAPER=C LC_NAME=C > >> [9] LC_ADDRESS=C LC_TELEPHONE=C > >> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > >> > >> attached base packages: > >> [1] stats graphics grDevices utils datasets methods base > >> > >> other attached packages: > >> [1] locfit_1.5-8 GenomicFeatures_1.8.2 AnnotationDbi_1.18.1 > >> [4] Biobase_2.16.0 multicore_0.1-7 rtracklayer_1.16.3 > >> [7] GenomicRanges_1.8.9 IRanges_1.14.4 BiocGenerics_0.2.0 > >> [10] reshape2_1.2.1 ggplot2_0.9.1 BiocInstaller_1.4.7 > >> > >> loaded via a namespace (and not attached): > >> [1] BSgenome_1.24.0 Biostrings_2.24.1 DBI_0.2-5 > MASS_7.3-19 > >> [5] RColorBrewer_1.0-5 RCurl_1.91-1 RSQLite_0.11.1 > Rsamtools_1.8.5 > >> [9] XML_3.9-4 biomaRt_2.12.0 bitops_1.0-4.1 > colorspace_1.1-1 > >> [13] compiler_2.15.1 dichromat_1.2-4 digest_0.5.2 > grid_2.15.1 > >> [17] labeling_0.1 lattice_0.20-6 memoise_0.1 > munsell_0.3 > >> [21] plyr_1.7.1 proto_0.3-9.2 scales_0.2.1 > stats4_2.15.1 > >> [25] stringr_0.6.1 tools_2.15.1 zlibbioc_1.2.0 > >> > >> Thanks in advance, > >> > >> Chris > >> > >> [[alternative HTML version deleted]] > >> > >> _______________________________________________ > >> Bioconductor mailing list > >> Bioconductor@r-project.org > >> https://stat.ethz.ch/mailman/listinfo/bioconductor > >> Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > [[alternative HTML version deleted]]
ADD REPLY

Login before adding your answer.

Traffic: 880 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6