Question

Aggregate read counts by non-uniform overlapping sliding windows

0

Entering edit mode

s1437643 ▴ 20

@s1437643-9524

Last seen 4.6 years ago

I have a matrix of read counts which represent the number of reads aligned in bins along the genome. There are millions of bins and they are not all the same length. For each bin I would like to count the number of reads within 5,000 bp either side of the bin's centre. Reads should only be counted from bins which are entirely within the 5,000 bp range.

So far, I've been able to take my bins (queryRanges) and create two vectors which represent the first (firstHits) and last bin (lastHits) within the total 10,000 bp search space (subjectRanges):

## Show query ranges
> queryRanges
GRanges object with 3913116 ranges and 0 metadata columns:
            seqnames               ranges strand
               <Rle>            <IRanges>  <Rle>
        [1]     chr1   [3000193, 3000814]      *
        [2]     chr1   [3001121, 3001796]      *
        [3]     chr1   [3003603, 3004014]      *
        [4]     chr1   [3004127, 3004339]      *
        [5]     chr1   [3004340, 3005438]      *
        ...      ...                  ...    ...
  [3913112]     chrY [90818058, 90821170]      *
  [3913113]     chrY [90823995, 90828869]      *
  [3913114]     chrY [90828938, 90829083]      *
  [3913115]     chrY [90829136, 90829941]      *
  [3913116]     chrY [90831793, 90832280]      *
  -------
  seqinfo: 21 sequences from mm10 genome

## Create window size subject ranges
> windowSize <- 10000
> subjectRanges <- suppressWarnings(trim(resize(queryRanges, windowSize, fix="center")))

## Show subjectRanges
> subjectRanges
GRanges object with 3913116 ranges and 0 metadata columns:
            seqnames               ranges strand
               <Rle>            <IRanges>  <Rle>
        [1]     chr1   [2995504, 3005503]      *
        [2]     chr1   [2996459, 3006458]      *
        [3]     chr1   [2998809, 3008808]      *
        [4]     chr1   [2999233, 3009232]      *
        [5]     chr1   [2999889, 3009888]      *
        ...      ...                  ...    ...
  [3913112]     chrY [90814614, 90824613]      *
  [3913113]     chrY [90821432, 90831431]      *
  [3913114]     chrY [90824011, 90834010]      *
  [3913115]     chrY [90824539, 90834538]      *
  [3913116]     chrY [90827037, 90837036]      *
  -------
  seqinfo: 21 sequences from mm10 genome

## Get first query within subject ranges
> firstHits <- findOverlaps(queryRanges, subjectRanges, type="within", select="first")

## Replace NA hits with index of query bin
> indexNA <- which(is.na(firstHits))
> firstHits[indexNA] <- indexNA

## Get last query within subject ranges
> lastHits <- findOverlaps(queryRanges, subjectRanges, type="within", select="last")

## Replace NA hits with index of query bin
> indexNA <- which(is.na(lastHits))
> lastHits[indexNA] <- indexNA

# Show index of first query bin within each subject bin
> head(firstHits, n=25)
[1]  1  1  1  1  1  2  3  3  3  3  4  4  5  5  6  6  7  7  9 10 11 12 15 15 16

# Show index of last query bin within each subject bin
> head(lastHits, n=25)
[1]  5  6 10 12 14 15 18 18 20 20 21 22 22 23 23 26 26 27 27 27 27 27 28 29 30

From here I could take each column of my count matrix and use these indexes to subset the column, sum the counts, and create a new matrix. This however seems to be a very slow process, and memory inefficient. Is there a better alternative to this approach? Are they any facets of GRanges transformations/structures which I could take advantage of? I could also just use the 10,000bp bins to count the reads from the BAM files using Rsubread's featureCounts, but I wondered since I already have the information if I could get my result faster...

granges genomicranges featurecounts • 1.3k views

ADD COMMENT • link updated 7.3 years ago by Michael Lawrence ★ 11k • written 7.3 years ago by s1437643 ▴ 20

2

Entering edit mode

Michael Lawrence ★ 11k

@michael-lawrence-3846

Last seen 2.4 years ago

United States

I guess the simple way would be to just generate all of the hits and aggregate the matrix by the subject hits:

hits <- findOverlaps(queryRanges, subjectRanges, type="within")
rowsum(countMatrix[queryHits(hits),], subjectHits(hits))

If that is too slow, we can work on it. If this was just a vector instead of a matrix, there would be faster ways.

ADD COMMENT • link 7.3 years ago Michael Lawrence ★ 11k

1

Entering edit mode

Gordon Smyth 50k

@gordon-smyth

Last seen 12 minutes ago

WEHI, Melbourne, Australia

In terms of speed, counting everything fresh using featureCounts will be much faster than than anything else you might do. It's a very fast function.

As Aaron has indicated, it is also easier to get it right that way: to be sure that you're counting the correct reads.

ADD COMMENT • link 7.3 years ago Gordon Smyth 50k

score 3 · Accepted Answer · 2017-01-15

I would just count everything fresh, using featureCounts or a similar function. Your current approach is sort-of-but-not-quite-right, because you're ignoring bins (and reads in those bins) that are only partially overlapped by the expanded 10kbp interval. Your bin regions in queryRanges don't seem to be contiguous either, so some of the reads will fall through the gaps and be missed. If you want to expand the bins, then you'll need to re-assign reads to the resized intervals to get correct counts. Remember; accuracy first, efficiency later. I'm not even sure your approach is more efficient, from either an analyst or developer perspective - if nothing else, in the time taken to answer this question, you could have gotten new counts already.