Question

summarize scores of GRanges into bins

1

Entering edit mode

Assa Yeroslaviz ★ 1.5k

@assa-yeroslaviz-1597

Last seen 14 days ago

Germany

As a third part of my subsetByOverlaps with Multimapping, I would to further manipulate my GRangesList object. I have created a GRangesList objaect after subsetting my intensities to only the regions of the genes in the gtf file.

This gave me a GRangesList object looks like that:

GRangesList object of length 7126:
$15S_rRNA 
GRanges object with 1649 ranges and 1 metadata column:
         seqnames       ranges strand |            score
            <Rle>    <IRanges>  <Rle> |        <numeric>
     [1]       MT [6546, 6546]      * | 48.0121515363538
     [2]       MT [6547, 6547]      * | 48.0456043815372
     [3]       MT [6548, 6548]      * | 48.0865514607159
     [4]       MT [6549, 6549]      * | 48.1089269541849
     [5]       MT [6550, 6550]      * | 48.1350452590915
     ...      ...          ...    ... .              ...
  [1645]       MT [8190, 8190]      * | 145.573432997664
  [1646]       MT [8191, 8191]      * | 145.573432997664
  [1647]       MT [8192, 8192]      * | 145.573432997664
  [1648]       MT [8193, 8193]      * | 145.573432997664
  [1649]       MT [8194, 8194]      * | 118.114351592435
...
<7125 more elements>
-------
seqinfo: 17 sequences from an unspecified genome

The first 10 elements of the list can be found here (it is too big to be added here).

I would like to know, if there a way to summarize the scores into bins.The GRanges in the list are each of different sizes.

>unlist(lapply(gr.list[1:10], length))
15S_rRNA 21S_rRNA     HRA1     ICR1     LSR1     NME1     PWR1    Q0010    Q0017    Q0032 
    1649     4439      564     3199     1175      340      941      387      162      291

I would like to know if it's possible to summarize all elements in each object of the list separately into e.g. 100 bins with the average score value in the metadata column.

The number of bins should be fixed and identical to all objects in the GRangesList, independent of how many ranges were in the original GRanges. In other word I would like to create for each gene (=GRanges object) in the list a new GRanges object of 100 ranges. The score of each of the ranges would be calculated from the summarized scores in the original GRanges object.

I would appreciate any help in doing so.

Thanks in advance Assa

genomicranges grangeslist bin findoverlaps • 4.8k views

ADD COMMENT • link updated 6.7 years ago by Jenny HT Lee • 0 • written 7.9 years ago by Assa Yeroslaviz ★ 1.5k

score 6 · Answer 1 · 2017-09-09

6

Entering edit mode

Michael Lawrence ★ 11k

@michael-lawrence-3846

Last seen 3.6 years ago

United States

I think you're aiming to generate 100 windows over the span of each gene. First, find the gene spans, then unlist to GRanges, and generate the tiles:

tiles <- tile(unlist(range(gr.list)), n=100)

We now need to figure out which positions overlap which tile. It is simplest to first simplify the data to a GRanges where each position occurs only once.

gr.unique <- unlist(gr.list)[!duplicated(unlist(gr.list)]

Note that using unique() above would drop the score column, so we use duplicated() instead. Now, find the positions overlapping each tile:

tiles.gr <- unlist(tiles)
hits <- findOverlaps(tiles.gr, gr.unique)

We have to unlist the tiles since we are interested in overlap per tile. The Hits object implies a grouping on the positions by tile, so we can use it to aggregate the scores:

agg <- aggregate(gr.unique, hits, score=mean(score))

The result contains the aggregated scores, and a column with references back to the positions that were summarized. You could then, for example, attach the scores to the tiles and convert the tiles back to a list by gene:

tiles.gr$score <- agg$score
relist(tiles.gr, tiles)

ADD COMMENT • link 7.9 years ago Michael Lawrence ★ 11k

0

Entering edit mode

+1 more. This is amazing, how you can fit anything in R. Thanks a lot for the explanation.

ADD REPLY • link 7.8 years ago Assa Yeroslaviz ★ 1.5k

0

Entering edit mode

Thanks for the detailed solution, Michael.

I have similar data as Assa, except that my "scores" are "frequency".

I get an error when trying to attach the scores back to each tile, because my agg have lesser rows than tiles.gr:

Error in`[[&lt;-`(`*tmp*`, name, value = c(2, 2, 12, 3, 72, 150, 513, 1, : 4671 elements in value to replace 6396 elements

Some of the tiles have zero frequency as my SNPs are scattered across the chromosome:

> t.gr.list
GRangesList object of length 19:
$chrA01 
GRanges object with 16018 ranges and 1 metadata column:
          seqnames            ranges strand |     score
             <Rle>         <IRanges>  <Rle> | <numeric>
      [1]   chrA01       82924-82925      * |         1
      [2]   chrA01       83837-83838      * |         1
      [3]   chrA01     897554-897555      * |         1
      [4]   chrA01     897875-897876      * |         1
      [5]   chrA01   1657291-1657292      * |         1

Is there a way to include the zero scores? Or maybe drop the zero tiles?

Thanks

Jenny

ADD REPLY • link 6.7 years ago Jenny HT Lee • 0

0

Entering edit mode

Maybe something like

tiles.gr$score[countQueryHits(hits) > 0L] <- agg$score