Exclude overlapping intervals

0

Entering edit mode

Hermann Norpois ▴ 110

@hermann-norpois-5483

Last seen 9.6 years ago

Hello, having a GRanges object I was looking for a function to exclude all overlapping intervals. So I get exclusively intervals that do not overlap. But I did not find a proper function. Has anybody an idea? Thanks Hermann [[alternative HTML version deleted]]

• 4.3k views

ADD COMMENT • link updated 11.4 years ago by Michael Lawrence ★ 11k • written 11.4 years ago by Hermann Norpois ▴ 110

0

Entering edit mode

Michael Lawrence ★ 11k

@michael-lawrence-3846

Last seen 2.4 years ago

United States

Something like: gr[!gr %in% excluded] Another option is setdiff() but note that will generate a new set of sorted, non-overlapping, non-adjacent ("normal") ranges. Michael On Thu, Dec 13, 2012 at 9:01 AM, Hermann Norpois <hnorpois@googlemail.com>wrote: > Hello, > > having a GRanges object I was looking for a function to exclude all > overlapping intervals. So I get exclusively intervals that do not overlap. > But I did not find a proper function. Has anybody an idea? > Thanks > Hermann > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]

ADD COMMENT • link 11.4 years ago Michael Lawrence ★ 11k

0

Entering edit mode

As both methods did not work ... 1) > gr[!gr %in% excluded] gr[!gr %in% excluded] Fehler in gr[!gr %in% excluded] : Fehler bei der Auswertung des Argumentes 'i' bei der Methodenauswahl für Funktion '[': Fehler in gr %in% excluded : Fehler bei der Auswertung des Argumentes 'table' bei der Methodenauswahl für Funktion '%in%': Fehler: Objekt 'excluded' nicht gefunden > setdiff (gr) Fehler in as.vector(x) : Keine Methode um diese S4 Klasse in einen Vektor zu verwandeln ... I am posting my data: > gr GRanges with 5 ranges and 0 metadata columns: seqnames ranges strand <rle> <iranges> <rle> [1] chr1 [ 4, 10] * [2] chr1 [ 12, 19] * [3] chr1 [ 45, 80] * [4] chr1 [ 55, 100] * [5] chr1 [105, 200] * --- seqlengths: chr1 NA > dput (gr) new("GRanges" , seqnames = new("Rle" , values = structure(1L, .Label = "chr1", class = "factor") , lengths = 5L , elementMetadata = NULL , metadata = list() ) , ranges = new("IRanges" , start = c(4L, 12L, 45L, 55L, 105L) , width = c(7L, 8L, 36L, 46L, 96L) , NAMES = NULL , elementType = "integer" , elementMetadata = NULL , metadata = list() ) , strand = new("Rle" , values = structure(3L, .Label = c("+", "-", "*"), class = "factor") , lengths = 5L , elementMetadata = NULL , metadata = list() ) , elementMetadata = new("DataFrame" , rownames = NULL , nrows = 5L , listData = structure(list(), .Names = character(0)) , elementType = "ANY" , elementMetadata = NULL , metadata = list() ) , seqinfo = new("Seqinfo" , seqnames = "chr1" , seqlengths = NA_integer_ , is_circular = NA , genome = NA_character_ ) , metadata = list() ) Thanks Hermann 2012/12/13 Michael Lawrence <lawrence.michael@gene.com> > Something like: > > gr[!gr %in% excluded] > > Another option is setdiff() but note that will generate a new set of > sorted, non-overlapping, non-adjacent ("normal") ranges. > > Michael > > > On Thu, Dec 13, 2012 at 9:01 AM, Hermann Norpois <hnorpois@googlemail.com>wrote: > >> Hello, >> >> having a GRanges object I was looking for a function to exclude all >> overlapping intervals. So I get exclusively intervals that do not overlap. >> But I did not find a proper function. Has anybody an idea? >> Thanks >> Hermann >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor@r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > [[alternative HTML version deleted]]

ADD REPLY • link 11.4 years ago Hermann Norpois ▴ 110

0

Entering edit mode

Hi Hermann, On 12/13/2012 09:33 AM, Hermann Norpois wrote: > As both methods did not work ... > > 1) > gr[!gr %in% excluded] gr[!gr %in% excluded] > Fehler in gr[!gr %in% excluded] : > Fehler bei der Auswertung des Argumentes 'i' bei der Methodenauswahl > f?r Funktion '[': Fehler in gr %in% excluded : > Fehler bei der Auswertung des Argumentes 'table' bei der Methodenauswahl > f?r Funktion '%in%': Fehler: Objekt 'excluded' nicht gefunden My limited knowledge of Goethe's tongue tells me that the 'excluded' object was not found. Of course you need to provide a GRanges object, call it 'excluded' or 'gr2', that contains the ranges that you want to remove from 'gr' (based on overlaps). Using 'gr' itself in place of 'gr2' to remove ranges in 'gr' that overlap with other ranges in 'gr' would not produce what you want though, because that would exclude everything: > gr[!gr %in% gr] GRanges with 0 ranges and 2 metadata columns: seqnames ranges strand | score GC <rle> <iranges> <rle> | <integer> <numeric> --- seqlengths: chr1 chr2 chr3 1000 2000 1500 Try this instead: gr[countOverlaps(gr, gr) <= 1L] Cheers, H. > >> setdiff (gr) > Fehler in as.vector(x) : > Keine Methode um diese S4 Klasse in einen Vektor zu verwandeln > > ... I am posting my data: > >> gr > GRanges with 5 ranges and 0 metadata columns: > seqnames ranges strand > <rle> <iranges> <rle> > [1] chr1 [ 4, 10] * > [2] chr1 [ 12, 19] * > [3] chr1 [ 45, 80] * > [4] chr1 [ 55, 100] * > [5] chr1 [105, 200] * > --- > seqlengths: > chr1 > NA >> dput (gr) > new("GRanges" > , seqnames = new("Rle" > , values = structure(1L, .Label = "chr1", class = "factor") > , lengths = 5L > , elementMetadata = NULL > , metadata = list() > ) > , ranges = new("IRanges" > , start = c(4L, 12L, 45L, 55L, 105L) > , width = c(7L, 8L, 36L, 46L, 96L) > , NAMES = NULL > , elementType = "integer" > , elementMetadata = NULL > , metadata = list() > ) > , strand = new("Rle" > , values = structure(3L, .Label = c("+", "-", "*"), class = "factor") > , lengths = 5L > , elementMetadata = NULL > , metadata = list() > ) > , elementMetadata = new("DataFrame" > , rownames = NULL > , nrows = 5L > , listData = structure(list(), .Names = character(0)) > , elementType = "ANY" > , elementMetadata = NULL > , metadata = list() > ) > , seqinfo = new("Seqinfo" > , seqnames = "chr1" > , seqlengths = NA_integer_ > , is_circular = NA > , genome = NA_character_ > ) > , metadata = list() > ) > > Thanks > Hermann > > 2012/12/13 Michael Lawrence <lawrence.michael at="" gene.com=""> > >> Something like: >> >> gr[!gr %in% excluded] >> >> Another option is setdiff() but note that will generate a new set of >> sorted, non-overlapping, non-adjacent ("normal") ranges. >> >> Michael >> >> >> On Thu, Dec 13, 2012 at 9:01 AM, Hermann Norpois <hnorpois at="" googlemail.com="">wrote: >> >>> Hello, >>> >>> having a GRanges object I was looking for a function to exclude all >>> overlapping intervals. So I get exclusively intervals that do not overlap. >>> But I did not find a proper function. Has anybody an idea? >>> Thanks >>> Hermann >>> >>> [[alternative HTML version deleted]] >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >> >> > > [[alternative HTML version deleted]] > > > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319

ADD REPLY • link 11.4 years ago Hervé Pagès 16k

0

Entering edit mode

Hi, yall, I'm thinking Hermann is looking where the coverage of gr is 1, like this: library(functional) lapply(coverage(gr)==1,Compose(as.vector,whichAsIRanges)) $chr1 NormalIRanges of length 5 start end width [1] 4 10 7 [2] 12 19 8 [3] 45 54 10 [4] 81 100 20 [5] 105 200 96 ... which answers the question but the data is improperly shaped. We can also think of this as where gr does not overlap itself, and the result is properly shaped, viz: > grd<-disjoin(gr) > grd[1==countOverlaps(grd,gr)] GRanges with 5 ranges and 0 metadata columns: seqnames ranges strand <rle> <iranges> <rle> [1] chr1 [ 4, 10] * [2] chr1 [ 12, 19] * [3] chr1 [ 45, 54] * [4] chr1 [ 81, 100] * [5] chr1 [105, 200] * --- seqlengths: chr1 NA So, which way is 'best' for performance? Best, ~Malcolm > -----Original Message----- > From: bioconductor-bounces at r-project.org [mailto:bioconductor- bounces at r-project.org] On Behalf Of Hervé Pagès > Sent: Thursday, December 13, 2012 12:03 PM > To: Hermann Norpois > Cc: Michael Lawrence; bioconductor at r-project.org > Subject: Re: [BioC] Exclude overlapping intervals > > Hi Hermann, > > On 12/13/2012 09:33 AM, Hermann Norpois wrote: > > As both methods did not work ... > > > > 1) > gr[!gr %in% excluded] gr[!gr %in% excluded] > > Fehler in gr[!gr %in% excluded] : > > Fehler bei der Auswertung des Argumentes 'i' bei der Methodenauswahl > > f?r Funktion '[': Fehler in gr %in% excluded : > > Fehler bei der Auswertung des Argumentes 'table' bei der Methodenauswahl > > f?r Funktion '%in%': Fehler: Objekt 'excluded' nicht gefunden > > My limited knowledge of Goethe's tongue tells me that the 'excluded' > object was not found. Of course you need to provide a GRanges object, > call it 'excluded' or 'gr2', that contains the ranges that you want to > remove from 'gr' (based on overlaps). > > Using 'gr' itself in place of 'gr2' to remove ranges in 'gr' that > overlap with other ranges in 'gr' would not produce what you want > though, because that would exclude everything: > > > gr[!gr %in% gr] > GRanges with 0 ranges and 2 metadata columns: > seqnames ranges strand | score GC > <rle> <iranges> <rle> | <integer> <numeric> > --- > seqlengths: > chr1 chr2 chr3 > 1000 2000 1500 > > Try this instead: > > gr[countOverlaps(gr, gr) <= 1L] > > Cheers, > H. > > > > >> setdiff (gr) > > Fehler in as.vector(x) : > > Keine Methode um diese S4 Klasse in einen Vektor zu verwandeln > > > > ... I am posting my data: > > > >> gr > > GRanges with 5 ranges and 0 metadata columns: > > seqnames ranges strand > > <rle> <iranges> <rle> > > [1] chr1 [ 4, 10] * > > [2] chr1 [ 12, 19] * > > [3] chr1 [ 45, 80] * > > [4] chr1 [ 55, 100] * > > [5] chr1 [105, 200] * > > --- > > seqlengths: > > chr1 > > NA > >> dput (gr) > > new("GRanges" > > , seqnames = new("Rle" > > , values = structure(1L, .Label = "chr1", class = "factor") > > , lengths = 5L > > , elementMetadata = NULL > > , metadata = list() > > ) > > , ranges = new("IRanges" > > , start = c(4L, 12L, 45L, 55L, 105L) > > , width = c(7L, 8L, 36L, 46L, 96L) > > , NAMES = NULL > > , elementType = "integer" > > , elementMetadata = NULL > > , metadata = list() > > ) > > , strand = new("Rle" > > , values = structure(3L, .Label = c("+", "-", "*"), class = "factor") > > , lengths = 5L > > , elementMetadata = NULL > > , metadata = list() > > ) > > , elementMetadata = new("DataFrame" > > , rownames = NULL > > , nrows = 5L > > , listData = structure(list(), .Names = character(0)) > > , elementType = "ANY" > > , elementMetadata = NULL > > , metadata = list() > > ) > > , seqinfo = new("Seqinfo" > > , seqnames = "chr1" > > , seqlengths = NA_integer_ > > , is_circular = NA > > , genome = NA_character_ > > ) > > , metadata = list() > > ) > > > > Thanks > > Hermann > > > > 2012/12/13 Michael Lawrence <lawrence.michael at="" gene.com=""> > > > >> Something like: > >> > >> gr[!gr %in% excluded] > >> > >> Another option is setdiff() but note that will generate a new set of > >> sorted, non-overlapping, non-adjacent ("normal") ranges. > >> > >> Michael > >> > >> > >> On Thu, Dec 13, 2012 at 9:01 AM, Hermann Norpois <hnorpois at="" googlemail.com="">wrote: > >> > >>> Hello, > >>> > >>> having a GRanges object I was looking for a function to exclude all > >>> overlapping intervals. So I get exclusively intervals that do not overlap. > >>> But I did not find a proper function. Has anybody an idea? > >>> Thanks > >>> Hermann > >>> > >>> [[alternative HTML version deleted]] > >>> > >>> _______________________________________________ > >>> Bioconductor mailing list > >>> Bioconductor at r-project.org > >>> https://stat.ethz.ch/mailman/listinfo/bioconductor > >>> Search the archives: > >>> http://news.gmane.org/gmane.science.biology.informatics.conductor > >>> > >> > >> > > > > [[alternative HTML version deleted]] > > > > > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor at r-project.org > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages at fhcrc.org > Phone: (206) 667-5791 > Fax: (206) 667-1319 > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD REPLY • link 11.4 years ago Malcolm Cook ★ 1.6k

0

Entering edit mode

On 12/13/2012 12:00 PM, Cook, Malcolm wrote: > Hi, yall, > > I'm thinking Hermann is looking where the coverage of gr is 1, like this: > > library(functional) > lapply(coverage(gr)==1,Compose(as.vector,whichAsIRanges)) > $chr1 > NormalIRanges of length 5 > start end width > [1] 4 10 7 > [2] 12 19 8 > [3] 45 54 10 > [4] 81 100 20 > [5] 105 200 96 > > > ... which answers the question but the data is improperly shaped. > > We can also think of this as where gr does not overlap itself, and the result is properly shaped, viz: >> grd<-disjoin(gr) >> grd[1==countOverlaps(grd,gr)] > GRanges with 5 ranges and 0 metadata columns: > seqnames ranges strand > <rle> <iranges> <rle> > [1] chr1 [ 4, 10] * > [2] chr1 [ 12, 19] * > [3] chr1 [ 45, 54] * > [4] chr1 [ 81, 100] * > [5] chr1 [105, 200] * > --- > seqlengths: > chr1 > NA > > So, which way is 'best' for performance? Note that the 2 solutions are not equivalent if the ranges are stranded: the former will ignore the strand whereas the latter will consider it. Even if the ranges are not stranded (like in your example), you still need to reduce() the result of the latter in order to get exactly the same set of ranges as with the former. 2 alternatives to solution 1 with result properly shaped: ** Sol 1b: irl <- IRangesList(lapply(coverage(gr) == 1L, as, "IRanges")) gr1b <- as(irl, "GRanges") ** Sol 1c (using slice): gr1c <- as(slice(coverage(gr), lower=1L, upper=1L), "GRanges") Sol 1b seems to be the fastest. H. > > Best, > > ~Malcolm > > >> -----Original Message----- >> From: bioconductor-bounces at r-project.org [mailto:bioconductor- bounces at r-project.org] On Behalf Of Hervé Pagès >> Sent: Thursday, December 13, 2012 12:03 PM >> To: Hermann Norpois >> Cc: Michael Lawrence; bioconductor at r-project.org >> Subject: Re: [BioC] Exclude overlapping intervals >> >> Hi Hermann, >> >> On 12/13/2012 09:33 AM, Hermann Norpois wrote: >>> As both methods did not work ... >>> >>> 1) > gr[!gr %in% excluded] gr[!gr %in% excluded] >>> Fehler in gr[!gr %in% excluded] : >>> Fehler bei der Auswertung des Argumentes 'i' bei der Methodenauswahl >>> f?r Funktion '[': Fehler in gr %in% excluded : >>> Fehler bei der Auswertung des Argumentes 'table' bei der Methodenauswahl >>> f?r Funktion '%in%': Fehler: Objekt 'excluded' nicht gefunden >> >> My limited knowledge of Goethe's tongue tells me that the 'excluded' >> object was not found. Of course you need to provide a GRanges object, >> call it 'excluded' or 'gr2', that contains the ranges that you want to >> remove from 'gr' (based on overlaps). >> >> Using 'gr' itself in place of 'gr2' to remove ranges in 'gr' that >> overlap with other ranges in 'gr' would not produce what you want >> though, because that would exclude everything: >> >> > gr[!gr %in% gr] >> GRanges with 0 ranges and 2 metadata columns: >> seqnames ranges strand | score GC >> <rle> <iranges> <rle> | <integer> <numeric> >> --- >> seqlengths: >> chr1 chr2 chr3 >> 1000 2000 1500 >> >> Try this instead: >> >> gr[countOverlaps(gr, gr) <= 1L] >> >> Cheers, >> H. >> >>> >>>> setdiff (gr) >>> Fehler in as.vector(x) : >>> Keine Methode um diese S4 Klasse in einen Vektor zu verwandeln >>> >>> ... I am posting my data: >>> >>>> gr >>> GRanges with 5 ranges and 0 metadata columns: >>> seqnames ranges strand >>> <rle> <iranges> <rle> >>> [1] chr1 [ 4, 10] * >>> [2] chr1 [ 12, 19] * >>> [3] chr1 [ 45, 80] * >>> [4] chr1 [ 55, 100] * >>> [5] chr1 [105, 200] * >>> --- >>> seqlengths: >>> chr1 >>> NA >>>> dput (gr) >>> new("GRanges" >>> , seqnames = new("Rle" >>> , values = structure(1L, .Label = "chr1", class = "factor") >>> , lengths = 5L >>> , elementMetadata = NULL >>> , metadata = list() >>> ) >>> , ranges = new("IRanges" >>> , start = c(4L, 12L, 45L, 55L, 105L) >>> , width = c(7L, 8L, 36L, 46L, 96L) >>> , NAMES = NULL >>> , elementType = "integer" >>> , elementMetadata = NULL >>> , metadata = list() >>> ) >>> , strand = new("Rle" >>> , values = structure(3L, .Label = c("+", "-", "*"), class = "factor") >>> , lengths = 5L >>> , elementMetadata = NULL >>> , metadata = list() >>> ) >>> , elementMetadata = new("DataFrame" >>> , rownames = NULL >>> , nrows = 5L >>> , listData = structure(list(), .Names = character(0)) >>> , elementType = "ANY" >>> , elementMetadata = NULL >>> , metadata = list() >>> ) >>> , seqinfo = new("Seqinfo" >>> , seqnames = "chr1" >>> , seqlengths = NA_integer_ >>> , is_circular = NA >>> , genome = NA_character_ >>> ) >>> , metadata = list() >>> ) >>> >>> Thanks >>> Hermann >>> >>> 2012/12/13 Michael Lawrence <lawrence.michael at="" gene.com=""> >>> >>>> Something like: >>>> >>>> gr[!gr %in% excluded] >>>> >>>> Another option is setdiff() but note that will generate a new set of >>>> sorted, non-overlapping, non-adjacent ("normal") ranges. >>>> >>>> Michael >>>> >>>> >>>> On Thu, Dec 13, 2012 at 9:01 AM, Hermann Norpois <hnorpois at="" googlemail.com="">wrote: >>>> >>>>> Hello, >>>>> >>>>> having a GRanges object I was looking for a function to exclude all >>>>> overlapping intervals. So I get exclusively intervals that do not overlap. >>>>> But I did not find a proper function. Has anybody an idea? >>>>> Thanks >>>>> Hermann >>>>> >>>>> [[alternative HTML version deleted]] >>>>> >>>>> _______________________________________________ >>>>> Bioconductor mailing list >>>>> Bioconductor at r-project.org >>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>> Search the archives: >>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>> >>>> >>>> >>> >>> [[alternative HTML version deleted]] >>> >>> >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >> >> -- >> Hervé Pagès >> >> Program in Computational Biology >> Division of Public Health Sciences >> Fred Hutchinson Cancer Research Center >> 1100 Fairview Ave. N, M1-B514 >> P.O. Box 19024 >> Seattle, WA 98109-1024 >> >> E-mail: hpages at fhcrc.org >> Phone: (206) 667-5791 >> Fax: (206) 667-1319 >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319

ADD REPLY • link 11.4 years ago Hervé Pagès 16k

0

Entering edit mode

Hermann, did you get the solution you needed out of this? Herve, please check out my notes below. I think you will be amused/intersted.... .On 12/13/2012 12:00 PM, Cook, Malcolm wrote: .> Hi, yall, .> .> I'm thinking Hermann is looking where the coverage of gr is 1, like this: .> .> library(functional) .> lapply(coverage(gr)==1,Compose(as.vector,whichAsIRanges)) .> $chr1 .> NormalIRanges of length 5 .> start end width .> [1] 4 10 7 .> [2] 12 19 8 .> [3] 45 54 10 .> [4] 81 100 20 .> [5] 105 200 96 .> .> .> ... which answers the question but the data is improperly shaped. .> .> We can also think of this as where gr does not overlap itself, and the result is properly shaped, viz: .>> grd<-disjoin(gr) .>> grd[1==countOverlaps(grd,gr)] .> GRanges with 5 ranges and 0 metadata columns: .> seqnames ranges strand .> <rle> <iranges> <rle> .> [1] chr1 [ 4, 10] * .> [2] chr1 [ 12, 19] * .> [3] chr1 [ 45, 54] * .> [4] chr1 [ 81, 100] * .> [5] chr1 [105, 200] * .> --- .> seqlengths: .> chr1 .> NA .> .> So, which way is 'best' for performance? . .Note that the 2 solutions are not equivalent if the ranges are .stranded: the former will ignore the strand whereas the latter .will consider it. Even if the ranges are not stranded (like in .your example), you still need to reduce() the result of the latter .in order to get exactly the same set of ranges as with the former. . .2 alternatives to solution 1 with result properly shaped: . .** Sol 1b: . . irl <- IRangesList(lapply(coverage(gr) == 1L, as, "IRanges")) . gr1b <- as(irl, "GRanges") . .** Sol 1c (using slice): . . gr1c <- as(slice(coverage(gr), lower=1L, upper=1L), "GRanges") . .Sol 1b seems to be the fastest. . Herve, thanks for these observations re: stranded and the need to reduce the results of my method 1. In the most general case I agree, you are right, and the strand preserving method should be: grd<-disjoin(gr) reduce(grd[1==countOverlaps(grd,gr)]) Trying to appreciate and simplify your soln1, I rewrote it as a one- liner, thusly: as(as(coverage(gr) == 1L,'IRangesList'),'GRanges') which prompts me to abstract as follows: > setAs('RleList','GRanges', function(from) { as(as(from,'IRangesList'),'GRanges') }) allowing to express the problem most perspicuously like this: > as(coverage(gr) == 1L,'GRanges') GRanges with 5 ranges and 0 metadata columns: seqnames ranges strand <rle> <iranges> <rle> [1] chr1 [ 4, 10] * [2] chr1 [ 12, 19] * [3] chr1 [ 45, 54] * [4] chr1 [ 81, 100] * [5] chr1 [105, 200] * --- seqlengths: chr1 NA > Is this in the spirit? Is my SetAs making assumptions it should not? Opinions anyone? Oh! HEY! During this exploration, I discover to my great astonishment that `==` is not symetric at least as regards to name preservation !!!!! > names(1==coverage(gr)) NULL > names(coverage(gr)==1) [1] "chr1" Is there an explanation for this antisymetry? Is this to be expected? Should it be remedied? Is this another thread? I am SURE that I am not the only one to have gotten bitten by this in the past. In fact, I think that I have run up in the past and have not realized that switch the order of the operands 'fixes' the problem. As a result of this issue, for instance, my elegant (?) abstraction won't work when the comparison is reversed and will generate CRYPTIC error messages: as(1L==coverage(gr),'GRanges') Error in function (classes, fdef, mtable) (from #3) : unable to find an inherited method for function "Rle", for signature "NULL", "missing" Prompting me to change my setAs thusly: setAs('RleList','GRanges', function(from) { if (is.null(names(from))) stop( '\nHey, the parameter `from` is un-named, thus the segnames of any hoped-for `GRanges` are un-assignable!\n' ,'Possible workaround: reverse the order of a comparison which produced `from` (if any)!\n' ,'Example: re-write "1L==coverage(gr)" as "coverage(gr)==1\n' ,'Best: fix the up-stream bug that `==` is not symetric w.r.t. names' ) as(as(from,'IRangesList'),'GRanges') }) Or, or, or? ~Malcolm .H. . .> .> Best, .> .> ~Malcolm .> .> .>> -----Original Message----- .>> From: bioconductor-bounces at r-project.org [mailto:bioconductor- bounces at r-project.org] On Behalf Of Hervé Pagès .>> Sent: Thursday, December 13, 2012 12:03 PM .>> To: Hermann Norpois .>> Cc: Michael Lawrence; bioconductor at r-project.org .>> Subject: Re: [BioC] Exclude overlapping intervals .>> .>> Hi Hermann, .>> .>> On 12/13/2012 09:33 AM, Hermann Norpois wrote: .>>> As both methods did not work ... .>>> .>>> 1) > gr[!gr %in% excluded] gr[!gr %in% excluded] .>>> Fehler in gr[!gr %in% excluded] : .>>> Fehler bei der Auswertung des Argumentes 'i' bei der Methodenauswahl .>>> f?r Funktion '[': Fehler in gr %in% excluded : .>>> Fehler bei der Auswertung des Argumentes 'table' bei der Methodenauswahl .>>> f?r Funktion '%in%': Fehler: Objekt 'excluded' nicht gefunden .>> .>> My limited knowledge of Goethe's tongue tells me that the 'excluded' .>> object was not found. Of course you need to provide a GRanges object, .>> call it 'excluded' or 'gr2', that contains the ranges that you want to .>> remove from 'gr' (based on overlaps). .>> .>> Using 'gr' itself in place of 'gr2' to remove ranges in 'gr' that .>> overlap with other ranges in 'gr' would not produce what you want .>> though, because that would exclude everything: .>> .>> > gr[!gr %in% gr] .>> GRanges with 0 ranges and 2 metadata columns: .>> seqnames ranges strand | score GC .>> <rle> <iranges> <rle> | <integer> <numeric> .>> --- .>> seqlengths: .>> chr1 chr2 chr3 .>> 1000 2000 1500 .>> .>> Try this instead: .>> .>> gr[countOverlaps(gr, gr) <= 1L] .>> .>> Cheers, .>> H. .>> .>>> .>>>> setdiff (gr) .>>> Fehler in as.vector(x) : .>>> Keine Methode um diese S4 Klasse in einen Vektor zu verwandeln .>>> .>>> ... I am posting my data: .>>> .>>>> gr .>>> GRanges with 5 ranges and 0 metadata columns: .>>> seqnames ranges strand .>>> <rle> <iranges> <rle> .>>> [1] chr1 [ 4, 10] * .>>> [2] chr1 [ 12, 19] * .>>> [3] chr1 [ 45, 80] * .>>> [4] chr1 [ 55, 100] * .>>> [5] chr1 [105, 200] * .>>> --- .>>> seqlengths: .>>> chr1 .>>> NA .>>>> dput (gr) .>>> new("GRanges" .>>> , seqnames = new("Rle" .>>> , values = structure(1L, .Label = "chr1", class = "factor") .>>> , lengths = 5L .>>> , elementMetadata = NULL .>>> , metadata = list() .>>> ) .>>> , ranges = new("IRanges" .>>> , start = c(4L, 12L, 45L, 55L, 105L) .>>> , width = c(7L, 8L, 36L, 46L, 96L) .>>> , NAMES = NULL .>>> , elementType = "integer" .>>> , elementMetadata = NULL .>>> , metadata = list() .>>> ) .>>> , strand = new("Rle" .>>> , values = structure(3L, .Label = c("+", "-", "*"), class = "factor") .>>> , lengths = 5L .>>> , elementMetadata = NULL .>>> , metadata = list() .>>> ) .>>> , elementMetadata = new("DataFrame" .>>> , rownames = NULL .>>> , nrows = 5L .>>> , listData = structure(list(), .Names = character(0)) .>>> , elementType = "ANY" .>>> , elementMetadata = NULL .>>> , metadata = list() .>>> ) .>>> , seqinfo = new("Seqinfo" .>>> , seqnames = "chr1" .>>> , seqlengths = NA_integer_ .>>> , is_circular = NA .>>> , genome = NA_character_ .>>> ) .>>> , metadata = list() .>>> ) .>>> .>>> Thanks .>>> Hermann .>>> .>>> 2012/12/13 Michael Lawrence <lawrence.michael at="" gene.com=""> .>>> .>>>> Something like: .>>>> .>>>> gr[!gr %in% excluded] .>>>> .>>>> Another option is setdiff() but note that will generate a new set of .>>>> sorted, non-overlapping, non-adjacent ("normal") ranges. .>>>> .>>>> Michael .>>>> .>>>> .>>>> On Thu, Dec 13, 2012 at 9:01 AM, Hermann Norpois <hnorpois at="" googlemail.com="">wrote: .>>>> .>>>>> Hello, .>>>>> .>>>>> having a GRanges object I was looking for a function to exclude all .>>>>> overlapping intervals. So I get exclusively intervals that do not overlap. .>>>>> But I did not find a proper function. Has anybody an idea? .>>>>> Thanks .>>>>> Hermann .>>>>> .>>>>> [[alternative HTML version deleted]] .>>>>> .>>>>> _______________________________________________ .>>>>> Bioconductor mailing list .>>>>> Bioconductor at r-project.org .>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor .>>>>> Search the archives: .>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor .>>>>> .>>>> .>>>> .>>> .>>> [[alternative HTML version deleted]] .>>> .>>> .>>> .>>> _______________________________________________ .>>> Bioconductor mailing list .>>> Bioconductor at r-project.org .>>> https://stat.ethz.ch/mailman/listinfo/bioconductor .>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor .>>> .>> .>> -- .>> Hervé Pagès .>> .>> Program in Computational Biology .>> Division of Public Health Sciences .>> Fred Hutchinson Cancer Research Center .>> 1100 Fairview Ave. N, M1-B514 .>> P.O. Box 19024 .>> Seattle, WA 98109-1024 .>> .>> E-mail: hpages at fhcrc.org .>> Phone: (206) 667-5791 .>> Fax: (206) 667-1319 .>> .>> _______________________________________________ .>> Bioconductor mailing list .>> Bioconductor at r-project.org .>> https://stat.ethz.ch/mailman/listinfo/bioconductor .>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor . .-- .Hervé Pagès . .Program in Computational Biology .Division of Public Health Sciences .Fred Hutchinson Cancer Research Center .1100 Fairview Ave. N, M1-B514 .P.O. Box 19024 .Seattle, WA 98109-1024 . .E-mail: hpages at fhcrc.org .Phone: (206) 667-5791 .Fax: (206) 667-1319

ADD REPLY • link 11.4 years ago Malcolm Cook ★ 1.6k

0

Entering edit mode

Hi Malcolm, On 12/14/2012 11:24 AM, Cook, Malcolm wrote: > Hermann, did you get the solution you needed out of this? > > Herve, please check out my notes below. I think you will be amused/intersted.... > > .On 12/13/2012 12:00 PM, Cook, Malcolm wrote: > .> Hi, yall, > .> > .> I'm thinking Hermann is looking where the coverage of gr is 1, like this: > .> > .> library(functional) > .> lapply(coverage(gr)==1,Compose(as.vector,whichAsIRanges)) > .> $chr1 > .> NormalIRanges of length 5 > .> start end width > .> [1] 4 10 7 > .> [2] 12 19 8 > .> [3] 45 54 10 > .> [4] 81 100 20 > .> [5] 105 200 96 > .> > .> > .> ... which answers the question but the data is improperly shaped. > .> > .> We can also think of this as where gr does not overlap itself, and the result is properly shaped, viz: > .>> grd<-disjoin(gr) > .>> grd[1==countOverlaps(grd,gr)] > .> GRanges with 5 ranges and 0 metadata columns: > .> seqnames ranges strand > .> <rle> <iranges> <rle> > .> [1] chr1 [ 4, 10] * > .> [2] chr1 [ 12, 19] * > .> [3] chr1 [ 45, 54] * > .> [4] chr1 [ 81, 100] * > .> [5] chr1 [105, 200] * > .> --- > .> seqlengths: > .> chr1 > .> NA > .> > .> So, which way is 'best' for performance? > . > .Note that the 2 solutions are not equivalent if the ranges are > .stranded: the former will ignore the strand whereas the latter > .will consider it. Even if the ranges are not stranded (like in > .your example), you still need to reduce() the result of the latter > .in order to get exactly the same set of ranges as with the former. > . > .2 alternatives to solution 1 with result properly shaped: > . > .** Sol 1b: > . > . irl <- IRangesList(lapply(coverage(gr) == 1L, as, "IRanges")) > . gr1b <- as(irl, "GRanges") > . > .** Sol 1c (using slice): > . > . gr1c <- as(slice(coverage(gr), lower=1L, upper=1L), "GRanges") > . > .Sol 1b seems to be the fastest. > . > > Herve, thanks for these observations re: stranded and the need to reduce the results of my method 1. > In the most general case I agree, you are right, and the strand preserving method should be: > > grd<-disjoin(gr) > reduce(grd[1==countOverlaps(grd,gr)]) > > Trying to appreciate and simplify your soln1, I rewrote it as a one- liner, thusly: > > as(as(coverage(gr) == 1L,'IRangesList'),'GRanges') > > which prompts me to abstract as follows: > >> setAs('RleList','GRanges', > function(from) { > as(as(from,'IRangesList'),'GRanges') > }) There is already a coercion method for going from RleList to GRanges and it does something different: > rlelist SimpleRleList of length 3 $chr1 logical-Rle of length 1000 with 2 runs Lengths: 4 996 Values : TRUE FALSE $chr2 logical-Rle of length 2000 with 3 runs Lengths: 1 1 1998 Values : FALSE TRUE FALSE $chr3 logical-Rle of length 1500 with 3 runs Lengths: 6 1 1493 Values : FALSE TRUE FALSE > as(rlelist, "GRanges") GRanges with 8 ranges and 1 metadata column: seqnames ranges strand | score <rle> <iranges> <rle> | <logical> [1] chr1 [1, 4] * | TRUE [2] chr1 [5, 1000] * | FALSE [3] chr2 [1, 1] * | FALSE [4] chr2 [2, 2] * | TRUE [5] chr2 [3, 2000] * | FALSE [6] chr3 [1, 6] * | FALSE [7] chr3 [7, 7] * | TRUE [8] chr3 [8, 1500] * | FALSE --- seqlengths: chr1 chr2 chr3 1000 2000 1500 Yours is probably more useful on *logical*-RleList's but the current one is more general because it works on any RleList. One of the challenges of implementing a good coercion system between the many classes we have is to pick up the "most natural" semantic for a given coercion. However, what seems to be the "most natural" is necessarily biased by the context and by people's use-cases and personal backgrounds. This is a situation where use-case driven semantic is not necessarily helping. What is more helpful for choosing a semantic is to look at properties like: (a) Consistency with coercion between ordinary vectors. (b) Are two-step coercions equivalent to the corresponding 1-step coercion? That is, is 'as(as(x, "B"), "C")' identical to 'as(x, "C")' for any intermediate representation "B"? This might be hard to achieve though. For example, right now, doing logical-RleList -> RangedData -> GRanges doesn't produce the same as doing logical-RleList -> IRangesList -> GRanges. (c) No-loss/reversible coercion, that is, if 'class(x)' is "A", and "B" is a "higher-level" class than "A", coercing 'x' to "B" should try to preserve all the information in 'x', and coercing back to "A" should give back 'x' (maybe modulo some reordering). I don't think the coercion system we have in the IRanges/GenomicRanges infrastructure with its 150+ coercion methods currently satisfies (a), (b), (c), but I'm sure we could improve it. I'll modify the current coercion from RleList to GRanges to treat logical-RleList in the way you propose if nobody objects. > > allowing to express the problem most perspicuously like this: > >> as(coverage(gr) == 1L,'GRanges') > GRanges with 5 ranges and 0 metadata columns: > seqnames ranges strand > <rle> <iranges> <rle> > [1] chr1 [ 4, 10] * > [2] chr1 [ 12, 19] * > [3] chr1 [ 45, 54] * > [4] chr1 [ 81, 100] * > [5] chr1 [105, 200] * > --- > seqlengths: > chr1 > NA >> > > Is this in the spirit? Is my SetAs making assumptions it should not? > > Opinions anyone? > > > Oh! HEY! During this exploration, I discover to my great astonishment that `==` is not symetric at least as regards to name preservation !!!!! Unfortunately there is no way it can be symmetric with regards to name preservation. Now what we should try to have is more consistency with name preservation on ordinary vectors where the rules *seem* to be (I couldn't find them documented in the man page for ==): (1) The longest operand wins. (2) In case of tie (i.e. same length), the operand with names wins. (3) In case of tie (i.e. both have names), the left operand wins. Note that there is no reason those rules would be only for ==, they could apply to any vectorized binary operator that does recycling of the shortest operand (I only checked they apply to all the basic arithmetic and comparison operators in base R). > >> names(1==coverage(gr)) > NULL >> names(coverage(gr)==1) > [1] "chr1" > > Is there an explanation for this antisymetry? Is this to be expected? Should it be remedied? Is this another thread? Clearly we are not doing (1), (2), (3) but we probably should. I'll put this on my list. > > I am SURE that I am not the only one to have gotten bitten by this in the past. In fact, I think that I have run up in the past and have not realized that switch the order of the operands 'fixes' the problem. > > As a result of this issue, for instance, my elegant (?) abstraction won't work when the comparison is reversed and will generate CRYPTIC error messages: > > as(1L==coverage(gr),'GRanges') > Error in function (classes, fdef, mtable) (from #3) : > unable to find an inherited method for function "Rle", for signature "NULL", "missing" > > Prompting me to change my setAs thusly: > > setAs('RleList','GRanges', > function(from) { > if (is.null(names(from))) stop( > '\nHey, the parameter `from` is un-named, thus the segnames of any hoped-for `GRanges` are un-assignable!\n' > ,'Possible workaround: reverse the order of a comparison which produced `from` (if any)!\n' > ,'Example: re-write "1L==coverage(gr)" as "coverage(gr)==1\n' > ,'Best: fix the up-stream bug that `==` is not symetric w.r.t. names' > ) > as(as(from,'IRangesList'),'GRanges') > }) > > Or, or, or? If the RleList is not named what we usually do is create automatic chromosome names 1, 2, 3, etc...: > as(unname(rlelist), "GRanges") GRanges with 8 ranges and 1 metadata column: seqnames ranges strand | score <rle> <iranges> <rle> | <logical> [1] 1 [1, 4] * | TRUE [2] 1 [5, 1000] * | FALSE [3] 2 [1, 1] * | FALSE [4] 2 [2, 2] * | TRUE [5] 2 [3, 2000] * | FALSE [6] 3 [1, 6] * | FALSE [7] 3 [7, 7] * | TRUE [8] 3 [8, 1500] * | FALSE --- seqlengths: 1 2 3 1000 2000 1500 Thanks for all the valuable feedback! H. > > ~Malcolm > > > > .H. > . > .> > .> Best, > .> > .> ~Malcolm > .> > .> > .>> -----Original Message----- > .>> From: bioconductor-bounces at r-project.org [mailto :bioconductor-bounces at r-project.org] On Behalf Of Hervé Pagès > .>> Sent: Thursday, December 13, 2012 12:03 PM > .>> To: Hermann Norpois > .>> Cc: Michael Lawrence; bioconductor at r-project.org > .>> Subject: Re: [BioC] Exclude overlapping intervals > .>> > .>> Hi Hermann, > .>> > .>> On 12/13/2012 09:33 AM, Hermann Norpois wrote: > .>>> As both methods did not work ... > .>>> > .>>> 1) > gr[!gr %in% excluded] gr[!gr %in% excluded] > .>>> Fehler in gr[!gr %in% excluded] : > .>>> Fehler bei der Auswertung des Argumentes 'i' bei der Methodenauswahl > .>>> f?r Funktion '[': Fehler in gr %in% excluded : > .>>> Fehler bei der Auswertung des Argumentes 'table' bei der Methodenauswahl > .>>> f?r Funktion '%in%': Fehler: Objekt 'excluded' nicht gefunden > .>> > .>> My limited knowledge of Goethe's tongue tells me that the 'excluded' > .>> object was not found. Of course you need to provide a GRanges object, > .>> call it 'excluded' or 'gr2', that contains the ranges that you want to > .>> remove from 'gr' (based on overlaps). > .>> > .>> Using 'gr' itself in place of 'gr2' to remove ranges in 'gr' that > .>> overlap with other ranges in 'gr' would not produce what you want > .>> though, because that would exclude everything: > .>> > .>> > gr[!gr %in% gr] > .>> GRanges with 0 ranges and 2 metadata columns: > .>> seqnames ranges strand | score GC > .>> <rle> <iranges> <rle> | <integer> <numeric> > .>> --- > .>> seqlengths: > .>> chr1 chr2 chr3 > .>> 1000 2000 1500 > .>> > .>> Try this instead: > .>> > .>> gr[countOverlaps(gr, gr) <= 1L] > .>> > .>> Cheers, > .>> H. > .>> > .>>> > .>>>> setdiff (gr) > .>>> Fehler in as.vector(x) : > .>>> Keine Methode um diese S4 Klasse in einen Vektor zu verwandeln > .>>> > .>>> ... I am posting my data: > .>>> > .>>>> gr > .>>> GRanges with 5 ranges and 0 metadata columns: > .>>> seqnames ranges strand > .>>> <rle> <iranges> <rle> > .>>> [1] chr1 [ 4, 10] * > .>>> [2] chr1 [ 12, 19] * > .>>> [3] chr1 [ 45, 80] * > .>>> [4] chr1 [ 55, 100] * > .>>> [5] chr1 [105, 200] * > .>>> --- > .>>> seqlengths: > .>>> chr1 > .>>> NA > .>>>> dput (gr) > .>>> new("GRanges" > .>>> , seqnames = new("Rle" > .>>> , values = structure(1L, .Label = "chr1", class = "factor") > .>>> , lengths = 5L > .>>> , elementMetadata = NULL > .>>> , metadata = list() > .>>> ) > .>>> , ranges = new("IRanges" > .>>> , start = c(4L, 12L, 45L, 55L, 105L) > .>>> , width = c(7L, 8L, 36L, 46L, 96L) > .>>> , NAMES = NULL > .>>> , elementType = "integer" > .>>> , elementMetadata = NULL > .>>> , metadata = list() > .>>> ) > .>>> , strand = new("Rle" > .>>> , values = structure(3L, .Label = c("+", "-", "*"), class = "factor") > .>>> , lengths = 5L > .>>> , elementMetadata = NULL > .>>> , metadata = list() > .>>> ) > .>>> , elementMetadata = new("DataFrame" > .>>> , rownames = NULL > .>>> , nrows = 5L > .>>> , listData = structure(list(), .Names = character(0)) > .>>> , elementType = "ANY" > .>>> , elementMetadata = NULL > .>>> , metadata = list() > .>>> ) > .>>> , seqinfo = new("Seqinfo" > .>>> , seqnames = "chr1" > .>>> , seqlengths = NA_integer_ > .>>> , is_circular = NA > .>>> , genome = NA_character_ > .>>> ) > .>>> , metadata = list() > .>>> ) > .>>> > .>>> Thanks > .>>> Hermann > .>>> > .>>> 2012/12/13 Michael Lawrence <lawrence.michael at="" gene.com=""> > .>>> > .>>>> Something like: > .>>>> > .>>>> gr[!gr %in% excluded] > .>>>> > .>>>> Another option is setdiff() but note that will generate a new set of > .>>>> sorted, non-overlapping, non-adjacent ("normal") ranges. > .>>>> > .>>>> Michael > .>>>> > .>>>> > .>>>> On Thu, Dec 13, 2012 at 9:01 AM, Hermann Norpois <hnorpois at="" googlemail.com="">wrote: > .>>>> > .>>>>> Hello, > .>>>>> > .>>>>> having a GRanges object I was looking for a function to exclude all > .>>>>> overlapping intervals. So I get exclusively intervals that do not overlap. > .>>>>> But I did not find a proper function. Has anybody an idea? > .>>>>> Thanks > .>>>>> Hermann > .>>>>> > .>>>>> [[alternative HTML version deleted]] > .>>>>> > .>>>>> _______________________________________________ > .>>>>> Bioconductor mailing list > .>>>>> Bioconductor at r-project.org > .>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor > .>>>>> Search the archives: > .>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor > .>>>>> > .>>>> > .>>>> > .>>> > .>>> [[alternative HTML version deleted]] > .>>> > .>>> > .>>> > .>>> _______________________________________________ > .>>> Bioconductor mailing list > .>>> Bioconductor at r-project.org > .>>> https://stat.ethz.ch/mailman/listinfo/bioconductor > .>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > .>>> > .>> > .>> -- > .>> Hervé Pagès > .>> > .>> Program in Computational Biology > .>> Division of Public Health Sciences > .>> Fred Hutchinson Cancer Research Center > .>> 1100 Fairview Ave. N, M1-B514 > .>> P.O. Box 19024 > .>> Seattle, WA 98109-1024 > .>> > .>> E-mail: hpages at fhcrc.org > .>> Phone: (206) 667-5791 > .>> Fax: (206) 667-1319 > .>> > .>> _______________________________________________ > .>> Bioconductor mailing list > .>> Bioconductor at r-project.org > .>> https://stat.ethz.ch/mailman/listinfo/bioconductor > .>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > . > .-- > .Hervé Pagès > . > .Program in Computational Biology > .Division of Public Health Sciences > .Fred Hutchinson Cancer Research Center > .1100 Fairview Ave. N, M1-B514 > .P.O. Box 19024 > .Seattle, WA 98109-1024 > . > .E-mail: hpages at fhcrc.org > .Phone: (206) 667-5791 > .Fax: (206) 667-1319 > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319

ADD REPLY • link 11.4 years ago Hervé Pagès 16k

0

Entering edit mode

Herve, .On 12/14/2012 11:24 AM, Cook, Malcolm wrote: .> Hermann, did you get the solution you needed out of this? .> .> Herve, please check out my notes below. I think you will be amused/intersted.... I SHOULD have included 'instructive/helpful' as well, for, as always you have been. .> .On 12/13/2012 12:00 PM, Cook, Malcolm wrote: .> .> Hi, yall, .> .> .> .> I'm thinking Hermann is looking where the coverage of gr is 1, like this: .> .> .> .> library(functional) .> .> lapply(coverage(gr)==1,Compose(as.vector,whichAsIRanges)) .> .> $chr1 .> .> NormalIRanges of length 5 .> .> start end width .> .> [1] 4 10 7 .> .> [2] 12 19 8 .> .> [3] 45 54 10 .> .> [4] 81 100 20 .> .> [5] 105 200 96 .> .> .> .> .> .> ... which answers the question but the data is improperly shaped. .> .> .> .> We can also think of this as where gr does not overlap itself, and the result is properly shaped, viz: .> .>> grd<-disjoin(gr) .> .>> grd[1==countOverlaps(grd,gr)] .> .> GRanges with 5 ranges and 0 metadata columns: .> .> seqnames ranges strand .> .> <rle> <iranges> <rle> .> .> [1] chr1 [ 4, 10] * .> .> [2] chr1 [ 12, 19] * .> .> [3] chr1 [ 45, 54] * .> .> [4] chr1 [ 81, 100] * .> .> [5] chr1 [105, 200] * .> .> --- .> .> seqlengths: .> .> chr1 .> .> NA .> .> .> .> So, which way is 'best' for performance? .> . .> .Note that the 2 solutions are not equivalent if the ranges are .> .stranded: the former will ignore the strand whereas the latter .> .will consider it. Even if the ranges are not stranded (like in .> .your example), you still need to reduce() the result of the latter .> .in order to get exactly the same set of ranges as with the former. .> . .> .2 alternatives to solution 1 with result properly shaped: .> . .> .** Sol 1b: .> . .> . irl <- IRangesList(lapply(coverage(gr) == 1L, as, "IRanges")) .> . gr1b <- as(irl, "GRanges") .> . .> .** Sol 1c (using slice): .> . .> . gr1c <- as(slice(coverage(gr), lower=1L, upper=1L), "GRanges") .> . .> .Sol 1b seems to be the fastest. .> . .> .> Herve, thanks for these observations re: stranded and the need to reduce the results of my method 1. .> In the most general case I agree, you are right, and the strand preserving method should be: .> .> grd<-disjoin(gr) .> reduce(grd[1==countOverlaps(grd,gr)]) .> .> Trying to appreciate and simplify your soln1, I rewrote it as a one-liner, thusly: .> .> as(as(coverage(gr) == 1L,'IRangesList'),'GRanges') .> .> which prompts me to abstract as follows: .> .>> setAs('RleList','GRanges', .> function(from) { .> as(as(from,'IRangesList'),'GRanges') .> }) . .There is already a coercion method for going from RleList to GRanges .and it does something different: . . > rlelist . SimpleRleList of length 3 . $chr1 . logical-Rle of length 1000 with 2 runs . Lengths: 4 996 . Values : TRUE FALSE . . $chr2 . logical-Rle of length 2000 with 3 runs . Lengths: 1 1 1998 . Values : FALSE TRUE FALSE . . $chr3 . logical-Rle of length 1500 with 3 runs . Lengths: 6 1 1493 . Values : FALSE TRUE FALSE . . > as(rlelist, "GRanges") . GRanges with 8 ranges and 1 metadata column: . seqnames ranges strand | score . <rle> <iranges> <rle> | <logical> . [1] chr1 [1, 4] * | TRUE . [2] chr1 [5, 1000] * | FALSE . [3] chr2 [1, 1] * | FALSE . [4] chr2 [2, 2] * | TRUE . [5] chr2 [3, 2000] * | FALSE . [6] chr3 [1, 6] * | FALSE . [7] chr3 [7, 7] * | TRUE . [8] chr3 [8, 1500] * | FALSE . --- . seqlengths: . chr1 chr2 chr3 . 1000 2000 1500 . .Yours is probably more useful on *logical*-RleList's but the current one .is more general because it works on any RleList. . .One of the challenges of implementing a good coercion system between .the many classes we have is to pick up the "most natural" semantic .for a given coercion. However, what seems to be the "most natural" .is necessarily biased by the context and by people's use-cases and .personal backgrounds. This is a situation where use-case driven .semantic is not necessarily helping. What is more helpful for .choosing a semantic is to look at properties like: . . (a) Consistency with coercion between ordinary vectors. . . (b) Are two-step coercions equivalent to the corresponding 1-step . coercion? That is, . . is 'as(as(x, "B"), "C")' identical to 'as(x, "C")' . . for any intermediate representation "B"? . . This might be hard to achieve though. For example, right . now, doing logical-RleList -> RangedData -> GRanges doesn't produce . the same as doing logical-RleList -> IRangesList -> GRanges. . . (c) No-loss/reversible coercion, that is, if 'class(x)' is "A", . and "B" is a "higher-level" class than "A", coercing 'x' to . "B" should try to preserve all the information in 'x', and . coercing back to "A" should give back 'x' (maybe modulo some . reordering). . .I don't think the coercion system we have in the IRanges/GenomicRanges .infrastructure with its 150+ coercion methods currently satisfies (a), .(b), (c), but I'm sure we could improve it. . I agree with your assessment but find it hard to evaluate options without listing out all the coercions and their effect. .I'll modify the current coercion from RleList to GRanges to treat .logical-RleList in the way you propose if nobody objects. I'd give it a little while to stew. Hard to know whom you'd effect by such a change. In any event, I'd suggest you RFC (request for comments) on the proposed change to the list under separate cover/subject. I expect any lurkers who might have an opinion have already given up on this thread. One benefit of the current approach is it preserves seqlengths, viz: seqlengths(gr)<-205 > gr GRanges with 5 ranges and 0 metadata columns: seqnames ranges strand <rle> <iranges> <rle> [1] chr1 [ 4, 10] * [2] chr1 [ 12, 19] * [3] chr1 [ 45, 80] * [4] chr1 [ 55, 100] * [5] chr1 [105, 200] * --- seqlengths: chr1 205 > seqlengths(as(coverage(gr)==1,'GRanges')) chr1 205 I'm not sure how to use (a)(b)(c) to argue one way of the other..... On a somewhat related note, I've seen some discussion that argues for syntaxs allowing subsetting a GRange/Vector based on values of its metadata attributes (i.e. 'subset' parameter gets evaled in the environment of the DataFrame). Such would allow writing, for instance: subset(as(coverage(gr)==1,'GRanges'),score) or subset(as(coverage(gr),'GRanges'),score==1) both which 'scan' pretty well to my coding eye. How about yours? . .> allowing to express the problem most perspicuously like this: .> .>> as(coverage(gr) == 1L,'GRanges') .> GRanges with 5 ranges and 0 metadata columns: .> seqnames ranges strand .> <rle> <iranges> <rle> .> [1] chr1 [ 4, 10] * .> [2] chr1 [ 12, 19] * .> [3] chr1 [ 45, 54] * .> [4] chr1 [ 81, 100] * .> [5] chr1 [105, 200] * .> --- .> seqlengths: .> chr1 .> NA .>> .> .> Is this in the spirit? Is my SetAs making assumptions it should not? .> .> Opinions anyone? .> .> .> Oh! HEY! During this exploration, I discover to my great astonishment that `==` is not symetric at least as regards to name .preservation !!!!! . .Unfortunately there is no way it can be symmetric with regards to .name preservation. Of course, because names as an attribute are not even compared when establishing `==`ity. . .Now what we should try to have is more consistency with name .preservation on ordinary vectors where the rules *seem* to be .(I couldn't find them documented in the man page for ==): . . (1) The longest operand wins. . . (2) In case of tie (i.e. same length), the operand with names wins. . . (3) In case of tie (i.e. both have names), the left operand wins. A sensible summary of what I agree would provide a more 'natural' semantics. Is this in fact documented somewhere? . .Note that there is no reason those rules would be only for ==, they .could apply to any vectorized binary operator that does recycling of .the shortest operand (I only checked they apply to all the basic .arithmetic and comparison operators in base R). . .> .>> names(1==coverage(gr)) .> NULL .>> names(coverage(gr)==1) .> [1] "chr1" .> .> Is there an explanation for this antisymetry? Is this to be expected? Should it be remedied? Is this another thread? . .Clearly we are not doing (1), (2), (3) but we probably should. I'll put .this on my list. . GREAT! Fewer fellow travelers will stumble in the future.... To my sensibility, this is backward compatible change not benefitting from RFC. Just my two bits (01), ~Malcolm .> I am SURE that I am not the only one to have gotten bitten by this in the past. In fact, I think that I have run up in the past and have .not realized that switch the order of the operands 'fixes' the problem. .> .> As a result of this issue, for instance, my elegant (?) abstraction won't work when the comparison is reversed and will generate .CRYPTIC error messages: .> .> as(1L==coverage(gr),'GRanges') .> Error in function (classes, fdef, mtable) (from #3) : .> unable to find an inherited method for function "Rle", for signature "NULL", "missing" .> .> Prompting me to change my setAs thusly: .> .> setAs('RleList','GRanges', .> function(from) { .> if (is.null(names(from))) stop( .> '\nHey, the parameter `from` is un-named, thus the segnames of any hoped-for `GRanges` are un-assignable!\n' .> ,'Possible workaround: reverse the order of a comparison which produced `from` (if any)!\n' .> ,'Example: re-write "1L==coverage(gr)" as "coverage(gr)==1\n' .> ,'Best: fix the up-stream bug that `==` is not symetric w.r.t. names' .> ) .> as(as(from,'IRangesList'),'GRanges') .> }) .> .> Or, or, or? . .If the RleList is not named what we usually do is create automatic .chromosome names 1, 2, 3, etc...: . . > as(unname(rlelist), "GRanges") . GRanges with 8 ranges and 1 metadata column: . seqnames ranges strand | score . <rle> <iranges> <rle> | <logical> . [1] 1 [1, 4] * | TRUE . [2] 1 [5, 1000] * | FALSE . [3] 2 [1, 1] * | FALSE . [4] 2 [2, 2] * | TRUE . [5] 2 [3, 2000] * | FALSE . [6] 3 [1, 6] * | FALSE . [7] 3 [7, 7] * | TRUE . [8] 3 [8, 1500] * | FALSE . --- . seqlengths: . 1 2 3 . 1000 2000 1500 . .Thanks for all the valuable feedback! . .H. . .> .> ~Malcolm .> .> .> .> .H. .> . .> .> .> .> Best, .> .> .> .> ~Malcolm .> .> .> .> .> .>> -----Original Message----- .> .>> From: bioconductor-bounces at r-project.org [mailto :bioconductor-bounces at r-project.org] On Behalf Of Hervé Pagès .> .>> Sent: Thursday, December 13, 2012 12:03 PM .> .>> To: Hermann Norpois .> .>> Cc: Michael Lawrence; bioconductor at r-project.org .> .>> Subject: Re: [BioC] Exclude overlapping intervals .> .>> .> .>> Hi Hermann, .> .>> .> .>> On 12/13/2012 09:33 AM, Hermann Norpois wrote: .> .>>> As both methods did not work ... .> .>>> .> .>>> 1) > gr[!gr %in% excluded] gr[!gr %in% excluded] .> .>>> Fehler in gr[!gr %in% excluded] : .> .>>> Fehler bei der Auswertung des Argumentes 'i' bei der Methodenauswahl .> .>>> f?r Funktion '[': Fehler in gr %in% excluded : .> .>>> Fehler bei der Auswertung des Argumentes 'table' bei der Methodenauswahl .> .>>> f?r Funktion '%in%': Fehler: Objekt 'excluded' nicht gefunden .> .>> .> .>> My limited knowledge of Goethe's tongue tells me that the 'excluded' .> .>> object was not found. Of course you need to provide a GRanges object, .> .>> call it 'excluded' or 'gr2', that contains the ranges that you want to .> .>> remove from 'gr' (based on overlaps). .> .>> .> .>> Using 'gr' itself in place of 'gr2' to remove ranges in 'gr' that .> .>> overlap with other ranges in 'gr' would not produce what you want .> .>> though, because that would exclude everything: .> .>> .> .>> > gr[!gr %in% gr] .> .>> GRanges with 0 ranges and 2 metadata columns: .> .>> seqnames ranges strand | score GC .> .>> <rle> <iranges> <rle> | <integer> <numeric> .> .>> --- .> .>> seqlengths: .> .>> chr1 chr2 chr3 .> .>> 1000 2000 1500 .> .>> .> .>> Try this instead: .> .>> .> .>> gr[countOverlaps(gr, gr) <= 1L] .> .>> .> .>> Cheers, .> .>> H. .> .>> .> .>>> .> .>>>> setdiff (gr) .> .>>> Fehler in as.vector(x) : .> .>>> Keine Methode um diese S4 Klasse in einen Vektor zu verwandeln .> .>>> .> .>>> ... I am posting my data: .> .>>> .> .>>>> gr .> .>>> GRanges with 5 ranges and 0 metadata columns: .> .>>> seqnames ranges strand .> .>>> <rle> <iranges> <rle> .> .>>> [1] chr1 [ 4, 10] * .> .>>> [2] chr1 [ 12, 19] * .> .>>> [3] chr1 [ 45, 80] * .> .>>> [4] chr1 [ 55, 100] * .> .>>> [5] chr1 [105, 200] * .> .>>> --- .> .>>> seqlengths: .> .>>> chr1 .> .>>> NA .> .>>>> dput (gr) .> .>>> new("GRanges" .> .>>> , seqnames = new("Rle" .> .>>> , values = structure(1L, .Label = "chr1", class = "factor") .> .>>> , lengths = 5L .> .>>> , elementMetadata = NULL .> .>>> , metadata = list() .> .>>> ) .> .>>> , ranges = new("IRanges" .> .>>> , start = c(4L, 12L, 45L, 55L, 105L) .> .>>> , width = c(7L, 8L, 36L, 46L, 96L) .> .>>> , NAMES = NULL .> .>>> , elementType = "integer" .> .>>> , elementMetadata = NULL .> .>>> , metadata = list() .> .>>> ) .> .>>> , strand = new("Rle" .> .>>> , values = structure(3L, .Label = c("+", "-", "*"), class = "factor") .> .>>> , lengths = 5L .> .>>> , elementMetadata = NULL .> .>>> , metadata = list() .> .>>> ) .> .>>> , elementMetadata = new("DataFrame" .> .>>> , rownames = NULL .> .>>> , nrows = 5L .> .>>> , listData = structure(list(), .Names = character(0)) .> .>>> , elementType = "ANY" .> .>>> , elementMetadata = NULL .> .>>> , metadata = list() .> .>>> ) .> .>>> , seqinfo = new("Seqinfo" .> .>>> , seqnames = "chr1" .> .>>> , seqlengths = NA_integer_ .> .>>> , is_circular = NA .> .>>> , genome = NA_character_ .> .>>> ) .> .>>> , metadata = list() .> .>>> ) .> .>>> .> .>>> Thanks .> .>>> Hermann .> .>>> .> .>>> 2012/12/13 Michael Lawrence <lawrence.michael at="" gene.com=""> .> .>>> .> .>>>> Something like: .> .>>>> .> .>>>> gr[!gr %in% excluded] .> .>>>> .> .>>>> Another option is setdiff() but note that will generate a new set of .> .>>>> sorted, non-overlapping, non-adjacent ("normal") ranges. .> .>>>> .> .>>>> Michael .> .>>>> .> .>>>> .> .>>>> On Thu, Dec 13, 2012 at 9:01 AM, Hermann Norpois <hnorpois at="" googlemail.com="">wrote: .> .>>>> .> .>>>>> Hello, .> .>>>>> .> .>>>>> having a GRanges object I was looking for a function to exclude all .> .>>>>> overlapping intervals. So I get exclusively intervals that do not overlap. .> .>>>>> But I did not find a proper function. Has anybody an idea? .> .>>>>> Thanks .> .>>>>> Hermann .> .>>>>> .> .>>>>> [[alternative HTML version deleted]] .> .>>>>> .> .>>>>> _______________________________________________ .> .>>>>> Bioconductor mailing list .> .>>>>> Bioconductor at r-project.org .> .>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor .> .>>>>> Search the archives: .> .>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor .> .>>>>> .> .>>>> .> .>>>> .> .>>> .> .>>> [[alternative HTML version deleted]] .> .>>> .> .>>> .> .>>> .> .>>> _______________________________________________ .> .>>> Bioconductor mailing list .> .>>> Bioconductor at r-project.org .> .>>> https://stat.ethz.ch/mailman/listinfo/bioconductor .> .>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor .> .>>> .> .>> .> .>> -- .> .>> Hervé Pagès .> .>> .> .>> Program in Computational Biology .> .>> Division of Public Health Sciences .> .>> Fred Hutchinson Cancer Research Center .> .>> 1100 Fairview Ave. N, M1-B514 .> .>> P.O. Box 19024 .> .>> Seattle, WA 98109-1024 .> .>> .> .>> E-mail: hpages at fhcrc.org .> .>> Phone: (206) 667-5791 .> .>> Fax: (206) 667-1319 .> .>> .> .>> _______________________________________________ .> .>> Bioconductor mailing list .> .>> Bioconductor at r-project.org .> .>> https://stat.ethz.ch/mailman/listinfo/bioconductor .> .>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor .> . .> .-- .> .Hervé Pagès .> . .> .Program in Computational Biology .> .Division of Public Health Sciences .> .Fred Hutchinson Cancer Research Center .> .1100 Fairview Ave. N, M1-B514 .> .P.O. Box 19024 .> .Seattle, WA 98109-1024 .> . .> .E-mail: hpages at fhcrc.org .> .Phone: (206) 667-5791 .> .Fax: (206) 667-1319 .> . .-- .Hervé Pagès . .Program in Computational Biology .Division of Public Health Sciences .Fred Hutchinson Cancer Research Center .1100 Fairview Ave. N, M1-B514 .P.O. Box 19024 .Seattle, WA 98109-1024 . .E-mail: hpages at fhcrc.org .Phone: (206) 667-5791 .Fax: (206) 667-1319

ADD REPLY • link 11.4 years ago Malcolm Cook ★ 1.6k

0

Entering edit mode

This fits my purpose best. > gr[countOverlaps(gr, gr) <= 1L] GRanges with 3 ranges and 0 metadata columns: seqnames ranges strand <rle> <iranges> <rle> [1] chr1 [ 4, 10] * [2] chr1 [ 12, 19] * [3] chr1 [105, 200] * --- seqlengths: chr1 NA Could you recommend some further help for countOverlaps. Just because I did not understand the syntax. Thanks Hermann [[alternative HTML version deleted]]

ADD REPLY • link 11.4 years ago Hermann Norpois ▴ 110

Login before adding your answer.