Question: looping through a GRangesList object
4
gravatar for Simon Anders
3.8 years ago by
Simon Anders3.6k
Zentrum für Molekularbiologie, Universität Heidelberg
Simon Anders3.6k wrote:

Hi

Given a GRangesList object grl, running

lapply( grl, length )

is very slow. Why is that? What should I do instead?

 

Background and example: I want to get, for all human genes, the range that their transcripts cover. So I did

> library( GenomicRanges )
> library( TxDb.Hsapiens.UCSC.hg19.knownGene )
> transcripts <- transcriptsBy( TxDb.Hsapiens.UCSC.hg19.knownGene )
> genes <- reduce( transcripts )

I was happy to notice that reduce seems to do what I want:

> genes
GRangesList object of length 23459:
$1 
GRanges object with 1 range and 0 metadata columns:
      seqnames               ranges strand
         <Rle>            <IRanges>  <Rle>
  [1]    chr19 [58858172, 58874214]      -

$10 
GRanges object with 1 range and 0 metadata columns:
      seqnames               ranges strand
  [1]     chr8 [18248755, 18258723]      +

$100 
GRanges object with 1 range and 0 metadata columns:
      seqnames               ranges strand
  [1]    chr20 [43248163, 43280376]      -

...
<23456 more elements>
-------
seqinfo: 93 sequences (1 circular) from hg19 genome

However, I want a GRanges object, not a GRangesList with each list element containing only a single range. Hence, I used unlist:

> unlist( genes )
GRanges object with 25004 ranges and 0 metadata columns:
        seqnames                 ranges strand
           <Rle>              <IRanges>  <Rle>
      1    chr19 [ 58858172,  58874214]      -
     10     chr8 [ 18248755,  18258723]      +
    100    chr20 [ 43248163,  43280376]      -
   1000    chr18 [ 25530930,  25757445]      -
  10000     chr1 [243651535, 244006886]      -
    ...      ...                    ...    ...
   9991     chr9 [114979995, 115095944]      -
   9992    chr21 [ 35736323,  35743440]      +
   9993    chr22 [ 19023795,  19109967]      -
   9994     chr6 [ 90539619,  90584155]      +
   9997    chr22 [ 50961997,  50964905]      -
  -------
  seqinfo: 93 sequences (1 circular) from hg19 genome

However, the  GenomicRange returned by unlist contains more rows than genes had list elements. Some of the list elements must have contained more than one range. I would like to do

> lengths <- lapply( genes, length )
> which( lengths > 1 )

However, the lapply taskes ages. Why?

 

Cheers

   Simon

 



> sessionInfo()
R version 3.2.2 (2015-08-14)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.3 LTS

locale:
 [1] LC_CTYPE=fi_FI.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
 [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
 [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
[1] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2
[2] GenomicFeatures_1.22.8                 
[3] AnnotationDbi_1.32.3                   
[4] Biobase_2.30.0                         
[5] GenomicRanges_1.22.3                   
[6] GenomeInfoDb_1.6.1                     
[7] IRanges_2.4.6                          
[8] S4Vectors_0.8.7                        
[9] BiocGenerics_0.16.1                    

loaded via a namespace (and not attached):
 [1] XVector_0.10.0             zlibbioc_1.16.0           
 [3] GenomicAlignments_1.6.3    BiocParallel_1.4.3        
 [5] tools_3.2.2                SummarizedExperiment_1.0.2
 [7] DBI_0.3.1                  lambda.r_1.1.7            
 [9] futile.logger_1.4.1        rtracklayer_1.30.1        
[11] futile.options_1.0.0       bitops_1.0-6              
[13] RCurl_1.95-4.7             biomaRt_2.26.1            
[15] RSQLite_1.0.0              Biostrings_2.38.3         
[17] Rsamtools_1.22.0           XML_3.98-1.3              
genomicranges grangeslist • 2.1k views
ADD COMMENTlink modified 3.8 years ago by Julian Gehring1.3k • written 3.8 years ago by Simon Anders3.6k

BTW, I know by now why I had more than one range in some of the object returned by reduce: that was simply because the original object contained alignments to alternate MHC versions and the like. For example:

> reduce( transcripts )[ "3115" ]
GRangesList object of length 1:
$3115 
GRanges object with 8 ranges and 0 metadata columns:
            seqnames               ranges strand
               <Rle>            <IRanges>  <Rle>
  [1]           chr6 [33043703, 33057473]      +
  [2]  chr6_apd_hap1 [ 4330504,  4344264]      +
  [3]  chr6_cox_hap2 [ 4487883,  4501576]      +
  [4]  chr6_dbb_hap3 [ 4325042,  4338804]      +
  [5] chr6_mann_hap4 [ 4501243,  4554476]      +
  [6]  chr6_mcf_hap5 [ 4380558,  4394318]      +
  [7]  chr6_qbl_hap6 [ 4276154,  4289926]      +
  [8] chr6_ssto_hap7 [ 4524207,  4537977]      +

But I still would like to know why lapply is so slow, and how to better inspect the elements in a GRangesList object.

ADD REPLYlink modified 3.8 years ago • written 3.8 years ago by Simon Anders3.6k

You could try elementLengths(grl) for this case.

ADD REPLYlink written 3.8 years ago by Sean Davis21k
Answer: looping through a GRangesList object
2
gravatar for Julian Gehring
3.8 years ago by
Julian Gehring1.3k
Julian Gehring1.3k wrote:

Hi Simon,

A way faster way for

lengths <- lapply( genes, length ) ## runtime: ~115s

would be a dedicated function like elementLengths:

lengths <- elementLengths( genes ) ## runtime: practically none

This is of course not a general solution for *applying arbitrary functions efficiently over each element, but should at least remove the bottleneck.

ADD COMMENTlink written 3.8 years ago by Julian Gehring1.3k
3

Use lengths() instead of elementLengths(). lengths() is a base R function that works on list() as well as *List, and is generally useful. I'm sure others will chime in with details, but lapply() is slow because it creates a new GRanges for each list element. Generally the pattern for efficient operations on GRangesList, when no existing functionality exists, is gr = unlist(grl); answer = f(gr); relist(answer, grl).

ADD REPLYlink written 3.8 years ago by Martin Morgan ♦♦ 24k

Just throwing this out there as a challenge: it would be cool to experiment with the data.table strategy of memcpy()ing slots from one GRanges to one that is re-used for each iteration. This requires modifying the lengths of vectors to truncate them.   It could be made to work for specific high-level vector types, like Rle. If an unfamiliar column type is encountered, fall back to the existing loop. Should start with CompressedSplitDataFrameList. Even better would be changing R so that the DATAPTR is not a fixed offset from the SEXP header. Then, no memcpy() required.

ADD REPLYlink written 3.8 years ago by Michael Lawrence11k

Wow, after so many years of using R, I still hear about new functions in base. Had never heard of lengths.

The unlist/relist idiom is useful, thanks. This should be mention in the GenomicRanges introduction vignette (in Section 3.5, "Looping over GRangesList objects").

ADD REPLYlink written 3.8 years ago by Simon Anders3.6k

That function was added in only the last year or so.

ADD REPLYlink written 3.8 years ago by Michael Lawrence11k

Hi Simon,

Yes, this should definitely be mentioned there. I'll add it.

lapply( grl, length ) is slow because it does [[ in the loop and [[ is slow on GRangesList objects and on CompressedList derivatives in general. For various reasons. One of them is the complexity of the implementation of the [[ method for CompressedList objects. Another one is method dispatch (there are many levels of dispatch involved when doing [[ on a CompressedList object). There is probably room for some optimization but I don't expect the gain to be dramatic. Maybe by a factor 2 or 3... at best. So that wouldn't really solve the problem. I've learned to live with the idea that lapply() on CompressedList derivatives is almost never an option :-/

Cheers,

H.

ADD REPLYlink modified 3.8 years ago • written 3.8 years ago by Hervé Pagès ♦♦ 14k

I need to correct myself about lengths(), which returns the equivalent of sapply(x, length), versus elementLengths(), which returns (from it's help page) the equivalent of sapply(x, NROW). So for instance l = list(data.frame(x=1, y=1)); lengths(l) returns 2 (since length(data.frame(x=1, y=1)) == 2) whereas elementLengths(l) returns 1 (the data.frame has a single row). This does not make a difference for Simon's case, but does in general.

ADD REPLYlink written 3.8 years ago by Martin Morgan ♦♦ 24k

Thanks!    

ADD REPLYlink written 3.8 years ago by Simon Anders3.6k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 331 users visited in the last hour