suppressing reduce() function when applying set operations to GRanges

1

Entering edit mode

Bill Gibb ▴ 20

@bill-gibb-5755

Last seen 9.7 years ago

Hello, I noticed that when applying set operations to GRanges objects, the returned range is reduced (by default), e.g. r1 <- GRanges(seqnames=c(1,1,1), ranges=IRanges(start=c(1,3,5), end=c(1,3,5)), strand='*') length(r1) r2 <- GRanges(seqnames=c(1,1,1), ranges=IRanges(start=c(1,3,4), end=c(1,3,4)), strand='*') length(r2) r3 <- union(r1,r2) length(r3) sum(width(ranges(r3))) Both r1 and r2 have length 3, whereas r3 has length 2, due to the implicit reduce applied to the result of union(). I can see that reducing the ranges would normally be desired when applying set operations, however there are occasions when one might want to keep a list of singleton ranges (e.g. when sub-sampling at the genomic coordinate level). Is there a way to suppress the reduce() operation when applying set operations to GRanges? Something like union(r1, r2, reduce.ranges=FALSE) would be nice. I also tried: resize(r3,width=1) however it appears to simply truncate multi-base sequences. Thank you. Bill Gibb Genomic Health, Inc. Redwood City, CA ______________________________________________________________________ The contents of this electronic message, including any attachments, are intended only for the use of the individual or entity to which they are addressed and may contain confidential information. If you are not the intended recipient, you are hereby notified that any use, dissemination, distribution, or copying of this message or any attachment is strictly prohibited. If you have received this transmission in error, please send an e-mail to postmaster at genomichealth.com and delete this message, along with any attachments, from your computer.

• 1.4k views

ADD COMMENT • link updated 11.2 years ago by Hervé Pagès 16k • written 11.2 years ago by Bill Gibb ▴ 20

1

Entering edit mode

Hervé Pagès 16k

@herve-pages-1542

Last seen 1 day ago

Seattle, WA, United States

Hi Bill, On 02/10/2013 03:00 PM, Bill Gibb wrote: > Hello, > > I noticed that when applying set operations to GRanges objects, the returned range is reduced (by default), e.g. > > r1 <- GRanges(seqnames=c(1,1,1), ranges=IRanges(start=c(1,3,5), end=c(1,3,5)), strand='*') > length(r1) > r2 <- GRanges(seqnames=c(1,1,1), ranges=IRanges(start=c(1,3,4), end=c(1,3,4)), strand='*') > length(r2) > r3 <- union(r1,r2) > length(r3) > sum(width(ranges(r3))) > > Both r1 and r2 have length 3, whereas r3 has length 2, due to the implicit reduce applied to the result of union(). I can see that reducing the ranges would normally be desired when applying set operations, however there are occasions when one might want to keep a list of singleton ranges (e.g. when sub-sampling at the genomic coordinate level). Is there a way to suppress the reduce() operation when applying set operations to GRanges? You could do: > unique(c(r1, r2)) GRanges with 4 ranges and 0 metadata columns: seqnames ranges strand <rle> <iranges> <rle> [1] 1 [1, 1] * [2] 1 [3, 3] * [3] 1 [5, 5] * [4] 1 [4, 4] * --- seqlengths: 1 NA which is another form of union() that is certainly more in the spirit of what base::union() does on ordinary vectors. However union(x, y) on GRanges objects goes one step further by reducing the above result. Yes we could add the 'reduce.ranges' arg to control this, considering that it's not the 1st time that users seem to be confused by this. Cheers, H. > Something like union(r1, r2, reduce.ranges=FALSE) would be nice. > > I also tried: > > resize(r3,width=1) > > however it appears to simply truncate multi-base sequences. > > Thank you. > > Bill Gibb > Genomic Health, Inc. > Redwood City, CA > > ______________________________________________________________________ > The contents of this electronic message, including any attachments, are intended only for the use of the individual or entity to which they are addressed and may contain confidential information. If you are not the intended recipient, you are hereby notified that any use, dissemination, distribution, or copying of this message or any attachment is strictly prohibited. If you have received this transmission in error, please send an e-mail to postmaster at genomichealth.com and delete this message, along with any attachments, from your computer. > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319

ADD COMMENT • link 11.2 years ago Hervé Pagès 16k

0

Entering edit mode

Dear Herve,

I have a similar problem, but this time with setdiff: I have two GRanges objects, of which one is the genome in 100k bins (bin.grange):

> bin.grange
GRanges object with 30946 ranges and 0 metadata columns:
          seqnames               ranges strand
             <Rle>            <IRanges>  <Rle>
      [1]        1     [     0, 100000]      *
      [2]        1     [100000, 200000]      *
      [3]        1     [200000, 300000]      *
      [4]        1     [300000, 400000]      *
      [5]        1     [400000, 500000]      *
      ...      ...                  ...    ...
  [30942]        Y [58800000, 58900000]      *
  [30943]        Y [58900000, 59000000]      *
  [30944]        Y [59000000, 59100000]      *
  [30945]        Y [59100000, 59200000]      *
  [30946]        Y [59200000, 59300000]      *
  -------
  seqinfo: 24 sequences from an unspecified genome; no seqlengths

The other GRange is a GRanges object (called peak.grange) that I would like to remove from bin.grange:

> peak.grange
GRanges object with 77526 ranges and 0 metadata columns:
          seqnames               ranges strand
             <Rle>            <IRanges>  <Rle>
      [1]        1     [721223, 721768]      *
      [2]        1     [808497, 809975]      *
      [3]        1     [879120, 879638]      *
      [4]        1     [888891, 889692]      *
      [5]        1     [891102, 891624]      *
      ...      ...                  ...    ...
  [77522]        Y [58979198, 58979974]      *
  [77523]        Y [58980148, 58981354]      *
  [77524]        Y [58981384, 58982127]      *
  [77525]        Y [58982821, 58983574]      *
  [77526]        Y [59001388, 59001995]      *
  -------
  seqinfo: 25 sequences from an unspecified genome; no seqlengths

Similar to the above example, a setdiff() between the two GRanges will automatically evoke the reduce method on the result:

> setdiff(bin.grange, peak.grange)
GRanges object with 77529 ranges and 0 metadata columns:
          seqnames               ranges strand
             <Rle>            <IRanges>  <Rle>
      [1]        1     [     1, 721222]      *
      [2]        1     [721769, 808496]      *
      [3]        1     [809976, 879119]      *
      [4]        1     [879639, 888890]      *
      [5]        1     [889693, 891101]      *
      ...      ...                  ...    ...
  [77525]        Y [58979975, 58980147]      *
  [77526]        Y [58981355, 58981383]      *
  [77527]        Y [58982128, 58982820]      *
  [77528]        Y [58983575, 59001387]      *
  [77529]        Y [59001996, 59300000]      *
  -------
  seqinfo: 25 sequences from an unspecified genome; no seqlengths

I would, however, like to maintain the 100k binned nature of the result. Is there a way to avoid evoking the reduce method?

Thank you very much,

Thomas Kuilman

--------------------------------------------------
Thomas Kuilman, PhD
Department of Molecular Oncology
Netherlands Cancer Institute
1066 CX Amsterdam
The Netherlands

ADD REPLY • link 9.5 years ago • updated 5.6 years ago t.kuilman ▴ 170

0

Entering edit mode

Just after writing this post I found out how to get around this problem and I thought I should share this piece of code (see below). The trick I used is to use split() to split odd and even indices over two separate GRanges objects in a GRangesList. The example code below does setdiff(bin.grange, peak.grange) without reducing the result:

  ## Split into GRangesList object based on index: odd go in one, even in another element
  bin.grange <- split(bin.grange, rep_len(c(1, 2), length.out = length(bin.grange)))
  ## Apply the setdiff on the independent GRanges objects in the GRangesList
  bin.grange <- lapply(bin.grange, function(x) {
    setdiff(x, peak.grange)
  })
  ## Merge into one GRanges object again
  bin.grange <- c(bin.grange[[1]], bin.grange[[2]])
  ## Order ranges
  bin.grange <- bin.grange[order(bin.grange)]

Regards,

Thomas

ADD REPLY • link 9.5 years ago • updated 5.6 years ago t.kuilman ▴ 170

Login before adding your answer.