Question

Easy way to sort GRangesList by seqname/start position?

3

Entering edit mode

Johannes Rainer ★ 2.1k

@johannes-rainer-6987

Last seen 16 months ago

Italy

dear all!

I was wondering whether there is an easy way to sort a GRangesList by seqname and chomosomal start position. My GRangesList looks like:

GRangesList object of length 196786:
$ENST00000000233
GRanges object with 6 ranges and 2 metadata columns:
      seqnames                 ranges strand |         exon_id exon_rank
         <Rle>              <IRanges>  <Rle> |     <character> <integer>
  [1]        7 [127228399, 127228619]      + | ENSE00001872691         1
  [2]        7 [127229137, 127229217]      + | ENSE00003494180         2
  [3]        7 [127229539, 127229648]      + | ENSE00003504066         3
  [4]        7 [127230120, 127230191]      + | ENSE00003678978         4
  [5]        7 [127231017, 127231142]      + | ENSE00003676786         5
  [6]        7 [127231267, 127231759]      + | ENSE00000882271         6

$ENST00000000412
GRanges object with 7 ranges and 2 metadata columns:
      seqnames             ranges strand |         exon_id exon_rank
  [1]       12 [9102084, 9102551]      - | ENSE00002286327         1
  [2]       12 [9098825, 9099001]      - | ENSE00003523177         2
  [3]       12 [9098014, 9098180]      - | ENSE00003631241         3
  [4]       12 [9096397, 9096506]      - | ENSE00003492441         4
  [5]       12 [9096001, 9096131]      - | ENSE00000717490         5
  [6]       12 [9095012, 9095138]      - | ENSE00003610229         6
  [7]       12 [9092961, 9094536]      - | ENSE00002254457         7

$ENST00000000442
GRanges object with 7 ranges and 2 metadata columns:
      seqnames               ranges strand |         exon_id exon_rank
  [1]       11 [64073050, 64073208]      + | ENSE00001884684         1
  [2]       11 [64074640, 64074976]      + | ENSE00001195360         2
  [3]       11 [64081423, 64081539]      + | ENSE00003589560         3
  [4]       11 [64081711, 64081839]      + | ENSE00003471121         4
  [5]       11 [64082213, 64082383]      + | ENSE00001195335         5
  [6]       11 [64082473, 64082742]      + | ENSE00000727249         6
  [7]       11 [64083179, 64084210]      + | ENSE00001271942         7

...
<196783 more elements>
-------
seqinfo: 24 sequences from GRCh37 genome

, so it's sorted by transcript id, but I would like to re-order the list by seqname and start positions. Is there an easy and fast way to do that (didn't find it in the help pages...)?

thanks in advance for any help or hints!

cheers, jo

grangeslist sort • 3.0k views

ADD COMMENT • link updated 11.1 years ago by Hervé Pagès 16k • written 11.1 years ago by Johannes Rainer ★ 2.1k

score 1 · Accepted Answer · 2015-01-21

Hi Johannes,

There is no easy way because generally speaking a GRangesList is for storing groups of genomic ranges and there is no guarantee that all the ranges in a given group are on the same seqname. Even if it is the case that they are on the same seqname (which is a reasonable assumption if the GRangesList contains exon or cds ranges grouped by transcript), what does it mean exactly to sort by start position given that there is more than 1 start position per GRangesList element? Like here:

> grl <- split(GRanges("chr1", IRanges(c(8, 3:1), 9)), c("b", "a", "a", "b"))
> grl
GRangesList object of length 2:
$a 
GRanges object with 2 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     chr1    [3, 9]      *
  [2]     chr1    [2, 9]      *

$b 
GRanges object with 2 ranges and 0 metadata columns:
      seqnames ranges strand
  [1]     chr1 [8, 9]      *
  [2]     chr1 [1, 9]      *

-------
seqinfo: 1 sequence from an unspecified genome; no seqlengths

Do you want the sorting to put b before a because the smallest start in b (1) is smaller than the smallest start in a (2)? Or do you want the sorting to put b after a because its first start (8) is greater than the first start in a (3)?

The following code assumes you want the former:

library(GenomicRanges)
sortGRangesListBySeqnameAndSmallestStart <- function(grl)
{
    stopifnot(is(grl, "GRangesList"))
    list_elt_seqnames <- as.character(runValue(seqnames(grl)))
    list_elt_seqnames <- factor(list_elt_seqnames, levels=seqlevels(grl))
    list_elt_smallest_start <- min(start(grl))
    oo <- order(as.integer(list_elt_seqnames), list_elt_smallest_start)
    grl[oo]
}

Then:

> sortGRangesListBySeqnameAndSmallestStart(grl)
GRangesList object of length 2:
$b 
GRanges object with 2 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     chr1    [8, 9]      *
  [2]     chr1    [1, 9]      *

$a 
GRanges object with 2 ranges and 0 metadata columns:
      seqnames ranges strand
  [1]     chr1 [3, 9]      *
  [2]     chr1 [2, 9]      *

-------
seqinfo: 1 sequence from an unspecified genome; no seqlengths

Note that:

The sortGRangesListBySeqnameAndSmallestStart() function only sorts the top-level elements (a.k.a. list elements). It doesn't try to sort the ranges within each top-level element.
The function will fail if the ranges within a given top-level element are not all on the same seqname.
Empty top-level elements will be put last (and the function will emit some ugly warnings when this happens).

Hope this helps,
H.