Question

IRanges::findOverlapPairs cannot take inputs if they are too large

0

Entering edit mode

Yixin • 0

@33a90a56

Last seen 12 months ago

United States

Hi,

I have a set of bins in a GrangesList, and a set of regions I would like to mask. My goal is to get a list of vectors, where each vector indicates the fractions of the bins not overlapped with the masked regions, so I can use these vectors for further scaling other things.

I've implemented a small function to achieve my goal. Although it did work on small testing data, it won't work in real and much larger data. Below is the testing data and the function I wrote to illustrate my goal.

The set of bins in a GrangesList, bin size is set as 200bp here.

grp_bin1 <-
    GenomicRanges::GRanges(
        seqnames = S4Vectors::Rle(c("1"), c(4)),
        ranges = IRanges::IRanges(start = seq(from = 1, by = 200, length.out = 4),
                                  end = seq(from = 200, by = 200, length.out = 4)),
        strand = S4Vectors::Rle(BiocGenerics::strand(c("+")), c(4)))

grp_bin2 <-
    GenomicRanges::GRanges(
        seqnames = S4Vectors::Rle(c("2"), c(4)),
        ranges = IRanges::IRanges(start = seq(from = 1, by = 200, length.out = 4),
                                  end = seq(from = 200, by = 200, length.out = 4)),
        strand = S4Vectors::Rle(BiocGenerics::strand(c("-")), c(4)))

grp_bins <-
    GenomicRanges::GRangesList("bin1" = grp_bin1, "bin2" = grp_bin2)

> grp_bins
GRangesList object of length 2:
$bin1
GRanges object with 4 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]        1     1-200      +
  [2]        1   201-400      +
  [3]        1   401-600      +
  [4]        1   601-800      +
  -------
  seqinfo: 2 sequences from an unspecified genome; no seqlengths

$bin2
GRanges object with 4 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]        2     1-200      -
  [2]        2   201-400      -
  [3]        2   401-600      -
  [4]        2   601-800      -
  -------
  seqinfo: 2 sequences from an unspecified genome; no seqlengths

The set of regions I would like to mask

add_mask1 <-
    GenomicRanges::GRanges(
        seqnames = S4Vectors::Rle(c("1", "2"), c(2, 1)),
        ranges = IRanges::IRanges(start = c(51, 251, 151),
                                  end = c(200, 400, 250)),
        strand = S4Vectors::Rle(BiocGenerics::strand(c("*")), c(3)))

> add_mask1
GRanges object with 3 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]        1    51-200      *
  [2]        1   251-400      *
  [3]        2   151-250      *
  -------
  seqinfo: 2 sequences from an unspecified genome; no seqlengths

Then function I wrote and the expected output

scale_additional_masks <- function(bins, add_mask, bin_size) {
    # Find overlap
    ovr <- IRanges::findOverlapPairs(add_mask, bins)
    # Compute the percentages of the bins overlap with the masks
    ovr_intersect <-
        IRanges::pintersect(ovr, drop.nohit.ranges = FALSE)
    ovr_val <- GenomicRanges::width(ovr_intersect) / bin_size
    # Convert the overlap fractions into vectors
    ovr_val_ls <- base::split(ovr_val, names(ovr_val))
    add_scale <- lapply(ovr_val_ls, function(x) {
        1 - colSums(as.matrix(x))
    })
    return(add_scale)
}

> scale_additional_masks(bins = grp_bins, add_mask = add_mask1, bin_size = 200)
$bin1
[1] 0.25 0.25 1.00 1.00

$bin2
[1] 0.75 0.75 1.00 1.00

Everthing works fine until I move on to my real data, which has ~18K groups of bins, and ~7M non-overlapped regions need to be masked.

Browse[2]> bins
GRangesList object of length 18101:
$`10000_-`
GRanges object with 250 ranges and 0 metadata columns:
        seqnames        ranges strand
           <Rle>     <IRanges>  <Rle>
    [1]       11 644233-644482      -
    [2]       11 644483-644732      -
    [3]       11 644733-644982      -
    [4]       11 644983-645232      -
    [5]       11 645233-645482      -
    ...      ...           ...    ...
  [246]       11 705483-705732      -
  [247]       11 705733-705982      -
  [248]       11 705983-706232      -
  [249]       11 706233-706482      -
  [250]       11 706483-706732      -
  -------
  seqinfo: 23 sequences from an unspecified genome; no seqlengths

...
<18100 more elements>

Browse[2]> add_mask
GRanges object with 7238684 ranges and 0 metadata columns:
            seqnames              ranges strand
               <Rle>           <IRanges>  <Rle>
        [1]        1            1-382433      *
        [2]        1       382512-382513      *
        [3]        1       382529-382530      *
        [4]        1       382631-611597      *
        [5]        1       611655-611656      *
        ...      ...                 ...    ...
  [7238680]        X 156010147-156010167      *
  [7238681]        X           156010554      *
  [7238682]        X 156011799-156011812      *
  [7238683]        X 156012646-156012649      *
  [7238684]        X 156013711-156040895      *
  -------
  seqinfo: 23 sequences from an unspecified genome; no seqlengths

When I run and debug the function, it pops an error at this line.

ovr <- IRanges::findOverlapPairs(add_mask, bins)

 Error in METHOD(x, i) : 
  Subsetting operation on CompressedGRangesList object 'x' produces a result that is too big to be
  represented as a CompressedList object. Please try to coerce 'x' to a SimpleList object first (with 'as(x,
  "SimpleList")').

I cannot convert my mask regions to a SimpleList as suggested by the error

> as(add_mask1,
+   "SimpleList")
Error in getListElement(x, i, ...) : 
  GRanges objects don't support [[, as.list(), lapply(), or unlist() at the moment

I also tried to truncated the mask regions, and it seems working ok when the number of rows is around 3e6, but no more than 4e6. When it runs at this scale, it takes more than 100 GB RAM so I have to run it on a server rather than my local computer.

I'm wondering if there is any solutions could help me when data come to this scale? We have high memory node on server (1.5TB) so it might be ok to use high memory for a short time, but I guess it's also my implementation too naive so it make things so costly.

Any suggestions are highly appreciated!

Best,

Yixin

IRanges GenomicRanges • 1.4k views

ADD COMMENT • link 3.0 years ago Yixin • 0

score 1 · Answer 1 · 2021-04-29

1

Entering edit mode

Michael Lawrence ★ 11k

@michael-lawrence-3846

Last seen 2.4 years ago

United States

You're generating a lot (billions) of overlaps. A more practical approach is to compute the positions of the genome that is not covered by any masks, and then compute the fraction of each bin consisting of those positions.

Something like:

all_bins <- stack(grp_bins, "grp")
cov <- coverage(add_mask1)
views <- Views(cov == 0L, all_bins)
viewSums(views) / width(views)

But please be aware that it's critical to specify the seqinfo representing the full genome (or whatever your universe is) on the masks before calling coverage().

ADD COMMENT • link 3.0 years ago Michael Lawrence ★ 11k

0

Entering edit mode

Thanks a lot Michael! Your solution is fantastic and much much faster and memory efficient than mine! I play with the code you provide and I'm very close to get a solution for my real data. The only issue I have for now is:

I notice the vector output by your method is based on chromosomes, I was trying to split it into my original groups but haven't figured out a proper way to do it. What I mean is as followed, in addition to the two groups of bins I have at the beginning, now I have two more on chr1 to make it more like real data.

grp_bin3 <-
    GenomicRanges::GRanges(
        seqnames = S4Vectors::Rle(c("1"), c(4)),
        ranges = IRanges::IRanges(start = seq(from = 1, by = 200, length.out = 4),
                                  end = seq(from = 200, by = 200, length.out = 4)),
        strand = S4Vectors::Rle(BiocGenerics::strand(c("-")), c(4)))

grp_bin4 <-
    GenomicRanges::GRanges(
        seqnames = S4Vectors::Rle(c("1"), c(4)),
        ranges = IRanges::IRanges(start = seq(from = 301, by = 200, length.out = 4),
                                  end = seq(from = 500, by = 200, length.out = 4)),
        strand = S4Vectors::Rle(BiocGenerics::strand(c("+")), c(4)))

grp_bins <-
    GenomicRanges::GRangesList("bin1" = grp_bin1, "bin2" = grp_bin2,
                               "bin3" = grp_bin3, "bin4" = grp_bin4)

My naive function will produce list of vector like,

> scale_additional_masks(bins = grp_bins, add_mask = add_mask1, bin_size = 200)
$bin1
[1] 0.25 0.25 1.00 1.00

$bin2
[1] 0.75 0.75 1.00 1.00

$bin3
[1] 0.25 0.25 1.00 1.00

$bin4
[1] 0.5 1.0 1.0 1.0

If I understand correctly, to ensure your method works, I need to make sure the seqlengths are the same for my bins and masks, and cover my Granges entirely (in reality I should download the genome version matched seqinfo and apply it to both the bins and masks)? Here I just set them manually.

seqlengths(grp_bins) <- c(1200, 1200)
seqinfo(add_mask1) <- seqinfo(grp_bins)

And then,

scale_additional_masks_2 <- function(bins, add_mask) {
    all_bins <- stack(bins, "grp")
    cov <- coverage(add_mask)
    views <- Views(cov == 0L, all_bins)
    add_scale <- viewSums(views) / width(views)
    # this doesn't work since add_scale is based on chromosomes
    # names(add_scale) <- names(bins)
    return(as(add_scale, "list"))
}

> scale_additional_masks_2(bins = grp_bins, add_mask = add_mask1)
$`1`
 [1] 0.25 0.25 1.00 1.00 0.25 0.25 1.00 1.00 0.50 1.00 1.00 1.00

$`2`
[1] 0.75 0.75 1.00 1.00

Notice that the bin1, bin3 and bin4 are now all in 1. Do you have any suggestions about how I should do the split so the vectots can be converted back to the form as my naive function?

Thanks again for your prompt reply!

ADD REPLY • link 3.0 years ago Yixin • 0

0

Entering edit mode

It would generally be easiest to flatten the list of window groups and instead use a factor to represent the grouping, as done in the initial answer via stack(). If you ensure that the data are sorted in chromosome order via sort(), you can simply unlist() and relist() the result via the grouping factor, like:

all_bins <- sort(stack(grp_bins, "grp"))
cov <- coverage(add_mask1)
views <- Views(cov == 0L, all_bins)
ans <- viewSums(views) / width(views)
split(unlist(ans, use.names=FALSE), all_bins$grp)

ADD REPLY • link 3.0 years ago Michael Lawrence ★ 11k

0

Entering edit mode

Your solution is amazing! unlist() then split() really do the trick. I didn't know split() can relist the results perfectly via the grouping factor, and get stuck in trying to use relist().

Highly appreciate your help and indeed learn a lot about how to deal with GRanges objects!

ADD REPLY • link 3.0 years ago Yixin • 0