Question

Creating all possible k combinations of n objects with a constraint for large n

2

Entering edit mode

mrk.carty ▴ 30

@mrkcarty-7442

Last seen 8.5 years ago

United States

Hi all,

If I have an ordered vector of genomic positions on chr1

> head(positions)
[1] 11161 12412 12462 12687 12830 13316
> length(positions)
[1] 576356

The distances between consecutive positions is represented by

> d <- positions[-1] - positions[-length(positions)]
> head(d)
[1] 1251   50  225  143  486  105

I want to enumerate all possible pairs of genomic locations such that the linear distance between the two points is <= D. I want an alternative to combn function from base package. What is fastest and least memory intensity way to do this in R to do this?

sessionInfo()
R version 3.1.2 (2014-10-31)
Platform: x86_64-apple-darwin13.4.0 (64-bit)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets
[8] methods   base     

other attached packages:
 [1] BSgenome.Hsapiens.UCSC.hg19_1.4.0 data.table_1.9.4                 
 [3] BSgenome_1.34.1                   rtracklayer_1.26.2               
 [5] GenomicAlignments_1.2.2           Rsamtools_1.18.3                 
 [7] Biostrings_2.34.1                 XVector_0.6.0                    
 [9] GenomicRanges_1.18.4              GenomeInfoDb_1.2.4               
[11] IRanges_2.0.1                     S4Vectors_0.4.0                  
[13] Biobase_2.26.0                    BiocGenerics_0.12.1              
[15] Matrix_1.1-5                     

loaded via a namespace (and not attached):
 [1] base64enc_0.1-2    BatchJobs_1.5      BBmisc_1.9         BiocParallel_1.0.3
 [5] bitops_1.0-6       brew_1.0-6         checkmate_1.5.1    chron_2.3-45      
 [9] codetools_0.2-10   DBI_0.3.1          digest_0.6.8       fail_1.2          
[13] foreach_1.4.2      grid_3.1.2         iterators_1.0.7    lattice_0.20-30   
[17] plyr_1.8.1         Rcpp_0.11.4        RCurl_1.95-4.5     reshape2_1.4.1    
[21] RSQLite_1.0.0      sendmailR_1.2-1    stringr_0.6.2      tools_3.1.2       
[25] XML_3.98-1.1       zlibbioc_1.12.0   
>

base • 903 views

ADD COMMENT • link 9.1 years ago mrk.carty ▴ 30

score 2 · Answer 1 · 2015-03-08

Use the GenomicRanges functionality. Create a toy example:

> library(GenomicRanges)
> locs = GRanges(seqnames=rep('chr1',50000),ranges=IRanges(start=sample(1:2.5e8,50000),width=1))
> locs
GRanges object with 50000 ranges and 0 metadata columns:
          seqnames                 ranges strand
             <Rle>              <IRanges>  <Rle>
      [1]     chr1 [233308084, 233308084]      *
      [2]     chr1 [195830003, 195830003]      *
      [3]     chr1 [ 70145590,  70145590]      *
      [4]     chr1 [130363241, 130363241]      *
      [5]     chr1 [173211123, 173211123]      *
      ...      ...                    ...    ...
  [49996]     chr1 [ 78981164,  78981164]      *
  [49997]     chr1 [175125447, 175125447]      *
  [49998]     chr1 [210573156, 210573156]      *
  [49999]     chr1 [109916498, 109916498]      *
  [50000]     chr1 [111139737, 111139737]      *
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

Now, choose your distance, D and find all locations that overlap a region defined by D upstream and downstream.

> D=500
> ovl = findOverlaps(locs,resize(locs,width = D*2,fix = 'center'))
> ovl
Hits object with 60259 hits and 0 metadata columns:
          queryHits subjectHits
          <integer>   <integer>
      [1]         1           1
      [2]         1       36954
      [3]         2           2
      [4]         3           3
      [5]         4       24922
      ...       ...         ...
  [60255]     49996       49996
  [60256]     49997       49997
  [60257]     49998       49998
  [60258]     49999       49999
  [60259]     50000       50000
  -------
  queryLength: 50000
  subjectLength: 50000

This contains all the pairs of locations that are within D of each other. You may want to remove all the "identity" hits where the query and subject are the same. With 50k regions, this runs in less than 2 seconds on my laptop (including generating the toy example).

score 0 · Answer 2 · 2015-03-09

0

Entering edit mode

mrk.carty ▴ 30

@mrkcarty-7442

Last seen 8.5 years ago

United States

Hi Sean,

Thanks a lot! This solution has speed up pipeline.

Best,

Mark

ADD COMMENT • link 9.1 years ago mrk.carty ▴ 30