Easy way to sort GRangesList by seqname/start position?
1
3
Entering edit mode
Johannes Rainer ★ 2.1k
@johannes-rainer-6987
Last seen 8 weeks ago
Italy

dear all!

I was wondering whether there is an easy way to sort a GRangesList by seqname and chomosomal start position. My GRangesList looks like:

GRangesList object of length 196786:
$ENST00000000233
GRanges object with 6 ranges and 2 metadata columns:
      seqnames                 ranges strand |         exon_id exon_rank
         <Rle>              <IRanges>  <Rle> |     <character> <integer>
  [1]        7 [127228399, 127228619]      + | ENSE00001872691         1
  [2]        7 [127229137, 127229217]      + | ENSE00003494180         2
  [3]        7 [127229539, 127229648]      + | ENSE00003504066         3
  [4]        7 [127230120, 127230191]      + | ENSE00003678978         4
  [5]        7 [127231017, 127231142]      + | ENSE00003676786         5
  [6]        7 [127231267, 127231759]      + | ENSE00000882271         6

$ENST00000000412
GRanges object with 7 ranges and 2 metadata columns:
      seqnames             ranges strand |         exon_id exon_rank
  [1]       12 [9102084, 9102551]      - | ENSE00002286327         1
  [2]       12 [9098825, 9099001]      - | ENSE00003523177         2
  [3]       12 [9098014, 9098180]      - | ENSE00003631241         3
  [4]       12 [9096397, 9096506]      - | ENSE00003492441         4
  [5]       12 [9096001, 9096131]      - | ENSE00000717490         5
  [6]       12 [9095012, 9095138]      - | ENSE00003610229         6
  [7]       12 [9092961, 9094536]      - | ENSE00002254457         7

$ENST00000000442
GRanges object with 7 ranges and 2 metadata columns:
      seqnames               ranges strand |         exon_id exon_rank
  [1]       11 [64073050, 64073208]      + | ENSE00001884684         1
  [2]       11 [64074640, 64074976]      + | ENSE00001195360         2
  [3]       11 [64081423, 64081539]      + | ENSE00003589560         3
  [4]       11 [64081711, 64081839]      + | ENSE00003471121         4
  [5]       11 [64082213, 64082383]      + | ENSE00001195335         5
  [6]       11 [64082473, 64082742]      + | ENSE00000727249         6
  [7]       11 [64083179, 64084210]      + | ENSE00001271942         7

...
<196783 more elements>
-------
seqinfo: 24 sequences from GRCh37 genome


, so it's sorted by transcript id, but I would like to re-order the list by seqname and start positions. Is there an easy and fast way to do that (didn't find it in the help pages...)?

thanks in advance for any help or hints!

 

cheers, jo

grangeslist sort • 2.7k views
ADD COMMENT
1
Entering edit mode
@herve-pages-1542
Last seen 3 days ago
Seattle, WA, United States

Hi Johannes,

There is no easy way because generally speaking a GRangesList is for storing groups of genomic ranges and there is no guarantee that all the ranges in a given group are on the same seqname. Even if it is the case that they are on the same seqname (which is a reasonable assumption if the GRangesList contains exon or cds ranges grouped by transcript), what does it mean exactly to sort by start position given that there is more than 1 start position per GRangesList element? Like here:

> grl <- split(GRanges("chr1", IRanges(c(8, 3:1), 9)), c("b", "a", "a", "b"))
> grl
GRangesList object of length 2:
$a 
GRanges object with 2 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     chr1    [3, 9]      *
  [2]     chr1    [2, 9]      *

$b 
GRanges object with 2 ranges and 0 metadata columns:
      seqnames ranges strand
  [1]     chr1 [8, 9]      *
  [2]     chr1 [1, 9]      *

-------
seqinfo: 1 sequence from an unspecified genome; no seqlengths

Do you want the sorting to put b before a because the smallest start in b (1) is smaller than the smallest start in a (2)? Or do you want the sorting to put b after a because its first start (8) is greater than the first start in a (3)?

The following code assumes you want the former:

library(GenomicRanges)
sortGRangesListBySeqnameAndSmallestStart <- function(grl)
{
    stopifnot(is(grl, "GRangesList"))
    list_elt_seqnames <- as.character(runValue(seqnames(grl)))
    list_elt_seqnames <- factor(list_elt_seqnames, levels=seqlevels(grl))
    list_elt_smallest_start <- min(start(grl))
    oo <- order(as.integer(list_elt_seqnames), list_elt_smallest_start)
    grl[oo]
}

Then:

> sortGRangesListBySeqnameAndSmallestStart(grl)
GRangesList object of length 2:
$b 
GRanges object with 2 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     chr1    [8, 9]      *
  [2]     chr1    [1, 9]      *

$a 
GRanges object with 2 ranges and 0 metadata columns:
      seqnames ranges strand
  [1]     chr1 [3, 9]      *
  [2]     chr1 [2, 9]      *

-------
seqinfo: 1 sequence from an unspecified genome; no seqlengths

Note that:

  1. The sortGRangesListBySeqnameAndSmallestStart() function only sorts the top-level elements (a.k.a. list elements). It doesn't try to sort the ranges within each top-level element.
  2. The function will fail if the ranges within a given top-level element are not all on the same seqname.
  3. Empty top-level elements will be put last (and the function will emit some ugly warnings when this happens).

Hope this helps,
H.

ADD COMMENT
0
Entering edit mode

dear Herve

thanks for your answer and the function! That's a very good starting point; I just wanted to make sure that I didn't miss on some already implemented function somewhere. Thanks!

jo

ADD REPLY

Login before adding your answer.

Traffic: 782 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6