Search
Question: Easy way to sort GRangesList by seqname/start position?
3
gravatar for Johannes Rainer
2.8 years ago by
Johannes Rainer1.1k
Italy
Johannes Rainer1.1k wrote:

dear all!

I was wondering whether there is an easy way to sort a GRangesList by seqname and chomosomal start position. My GRangesList looks like:

GRangesList object of length 196786:
$ENST00000000233
GRanges object with 6 ranges and 2 metadata columns:
      seqnames                 ranges strand |         exon_id exon_rank
         <Rle>              <IRanges>  <Rle> |     <character> <integer>
  [1]        7 [127228399, 127228619]      + | ENSE00001872691         1
  [2]        7 [127229137, 127229217]      + | ENSE00003494180         2
  [3]        7 [127229539, 127229648]      + | ENSE00003504066         3
  [4]        7 [127230120, 127230191]      + | ENSE00003678978         4
  [5]        7 [127231017, 127231142]      + | ENSE00003676786         5
  [6]        7 [127231267, 127231759]      + | ENSE00000882271         6

$ENST00000000412
GRanges object with 7 ranges and 2 metadata columns:
      seqnames             ranges strand |         exon_id exon_rank
  [1]       12 [9102084, 9102551]      - | ENSE00002286327         1
  [2]       12 [9098825, 9099001]      - | ENSE00003523177         2
  [3]       12 [9098014, 9098180]      - | ENSE00003631241         3
  [4]       12 [9096397, 9096506]      - | ENSE00003492441         4
  [5]       12 [9096001, 9096131]      - | ENSE00000717490         5
  [6]       12 [9095012, 9095138]      - | ENSE00003610229         6
  [7]       12 [9092961, 9094536]      - | ENSE00002254457         7

$ENST00000000442
GRanges object with 7 ranges and 2 metadata columns:
      seqnames               ranges strand |         exon_id exon_rank
  [1]       11 [64073050, 64073208]      + | ENSE00001884684         1
  [2]       11 [64074640, 64074976]      + | ENSE00001195360         2
  [3]       11 [64081423, 64081539]      + | ENSE00003589560         3
  [4]       11 [64081711, 64081839]      + | ENSE00003471121         4
  [5]       11 [64082213, 64082383]      + | ENSE00001195335         5
  [6]       11 [64082473, 64082742]      + | ENSE00000727249         6
  [7]       11 [64083179, 64084210]      + | ENSE00001271942         7

...
<196783 more elements>
-------
seqinfo: 24 sequences from GRCh37 genome


, so it's sorted by transcript id, but I would like to re-order the list by seqname and start positions. Is there an easy and fast way to do that (didn't find it in the help pages...)?

thanks in advance for any help or hints!

 

cheers, jo

ADD COMMENTlink modified 2.8 years ago by Hervé Pagès ♦♦ 13k • written 2.8 years ago by Johannes Rainer1.1k
1
gravatar for Hervé Pagès
2.8 years ago by
Hervé Pagès ♦♦ 13k
United States
Hervé Pagès ♦♦ 13k wrote:

Hi Johannes,

There is no easy way because generally speaking a GRangesList is for storing groups of genomic ranges and there is no guarantee that all the ranges in a given group are on the same seqname. Even if it is the case that they are on the same seqname (which is a reasonable assumption if the GRangesList contains exon or cds ranges grouped by transcript), what does it mean exactly to sort by start position given that there is more than 1 start position per GRangesList element? Like here:

> grl <- split(GRanges("chr1", IRanges(c(8, 3:1), 9)), c("b", "a", "a", "b"))
> grl
GRangesList object of length 2:
$a 
GRanges object with 2 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     chr1    [3, 9]      *
  [2]     chr1    [2, 9]      *

$b 
GRanges object with 2 ranges and 0 metadata columns:
      seqnames ranges strand
  [1]     chr1 [8, 9]      *
  [2]     chr1 [1, 9]      *

-------
seqinfo: 1 sequence from an unspecified genome; no seqlengths

Do you want the sorting to put b before a because the smallest start in b (1) is smaller than the smallest start in a (2)? Or do you want the sorting to put b after a because its first start (8) is greater than the first start in a (3)?

The following code assumes you want the former:

library(GenomicRanges)
sortGRangesListBySeqnameAndSmallestStart <- function(grl)
{
    stopifnot(is(grl, "GRangesList"))
    list_elt_seqnames <- as.character(runValue(seqnames(grl)))
    list_elt_seqnames <- factor(list_elt_seqnames, levels=seqlevels(grl))
    list_elt_smallest_start <- min(start(grl))
    oo <- order(as.integer(list_elt_seqnames), list_elt_smallest_start)
    grl[oo]
}

Then:

> sortGRangesListBySeqnameAndSmallestStart(grl)
GRangesList object of length 2:
$b 
GRanges object with 2 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     chr1    [8, 9]      *
  [2]     chr1    [1, 9]      *

$a 
GRanges object with 2 ranges and 0 metadata columns:
      seqnames ranges strand
  [1]     chr1 [3, 9]      *
  [2]     chr1 [2, 9]      *

-------
seqinfo: 1 sequence from an unspecified genome; no seqlengths

Note that:

  1. The sortGRangesListBySeqnameAndSmallestStart() function only sorts the top-level elements (a.k.a. list elements). It doesn't try to sort the ranges within each top-level element.
  2. The function will fail if the ranges within a given top-level element are not all on the same seqname.
  3. Empty top-level elements will be put last (and the function will emit some ugly warnings when this happens).

Hope this helps,
H.

ADD COMMENTlink modified 2.8 years ago • written 2.8 years ago by Hervé Pagès ♦♦ 13k

dear Herve

thanks for your answer and the function! That's a very good starting point; I just wanted to make sure that I didn't miss on some already implemented function somewhere. Thanks!

jo

ADD REPLYlink written 2.8 years ago by Johannes Rainer1.1k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 146 users visited in the last hour