Aggregate read counts by non-uniform overlapping sliding windows
3
0
Entering edit mode
s1437643 ▴ 20
@s1437643-9524
Last seen 4.6 years ago

I have a matrix of read counts which represent the number of reads aligned in bins along the genome. There are millions of bins and they are not all the same length. For each bin I would like to count the number of reads within 5,000 bp either side of the bin's centre. Reads should only be counted from bins which are entirely within the 5,000 bp range.

So far, I've been able to take my bins (queryRanges) and create two vectors which represent the first (firstHits) and last bin (lastHits) within the total 10,000 bp search space (subjectRanges): 

## Show query ranges
> queryRanges
GRanges object with 3913116 ranges and 0 metadata columns:
            seqnames               ranges strand
               <Rle>            <IRanges>  <Rle>
        [1]     chr1   [3000193, 3000814]      *
        [2]     chr1   [3001121, 3001796]      *
        [3]     chr1   [3003603, 3004014]      *
        [4]     chr1   [3004127, 3004339]      *
        [5]     chr1   [3004340, 3005438]      *
        ...      ...                  ...    ...
  [3913112]     chrY [90818058, 90821170]      *
  [3913113]     chrY [90823995, 90828869]      *
  [3913114]     chrY [90828938, 90829083]      *
  [3913115]     chrY [90829136, 90829941]      *
  [3913116]     chrY [90831793, 90832280]      *
  -------
  seqinfo: 21 sequences from mm10 genome

## Create window size subject ranges
> windowSize <- 10000
> subjectRanges <- suppressWarnings(trim(resize(queryRanges, windowSize, fix="center")))

## Show subjectRanges
> subjectRanges
GRanges object with 3913116 ranges and 0 metadata columns:
            seqnames               ranges strand
               <Rle>            <IRanges>  <Rle>
        [1]     chr1   [2995504, 3005503]      *
        [2]     chr1   [2996459, 3006458]      *
        [3]     chr1   [2998809, 3008808]      *
        [4]     chr1   [2999233, 3009232]      *
        [5]     chr1   [2999889, 3009888]      *
        ...      ...                  ...    ...
  [3913112]     chrY [90814614, 90824613]      *
  [3913113]     chrY [90821432, 90831431]      *
  [3913114]     chrY [90824011, 90834010]      *
  [3913115]     chrY [90824539, 90834538]      *
  [3913116]     chrY [90827037, 90837036]      *
  -------
  seqinfo: 21 sequences from mm10 genome

## Get first query within subject ranges
> firstHits <- findOverlaps(queryRanges, subjectRanges, type="within", select="first")

## Replace NA hits with index of query bin
> indexNA <- which(is.na(firstHits))
> firstHits[indexNA] <- indexNA

## Get last query within subject ranges
> lastHits <- findOverlaps(queryRanges, subjectRanges, type="within", select="last")

## Replace NA hits with index of query bin
> indexNA <- which(is.na(lastHits))
> lastHits[indexNA] <- indexNA

# Show index of first query bin within each subject bin
> head(firstHits, n=25)
[1]  1  1  1  1  1  2  3  3  3  3  4  4  5  5  6  6  7  7  9 10 11 12 15 15 16

# Show index of last query bin within each subject bin
> head(lastHits, n=25)
[1]  5  6 10 12 14 15 18 18 20 20 21 22 22 23 23 26 26 27 27 27 27 27 28 29 30

​From here I could take each column of my count matrix and use these indexes to subset the column, sum the counts, and create a new matrix. This however seems to be a very slow process, and memory inefficient. Is there a better alternative to this approach? Are they any facets of GRanges transformations/structures which I could take advantage of? I could also just use the 10,000bp bins to count the reads from the BAM files using Rsubread's featureCounts, but I wondered since I already have the information if I could get my result faster...

granges genomicranges featurecounts • 1.3k views
ADD COMMENT
3
Entering edit mode
Aaron Lun ★ 28k
@alun
Last seen 20 hours ago
The city by the bay

I would just count everything fresh, using featureCounts or a similar function. Your current approach is sort-of-but-not-quite-right, because you're ignoring bins (and reads in those bins) that are only partially overlapped by the expanded 10kbp interval. Your bin regions in queryRanges don't seem to be contiguous either, so some of the reads will fall through the gaps and be missed. If you want to expand the bins, then you'll need to re-assign reads to the resized intervals to get correct counts. Remember; accuracy first, efficiency later. I'm not even sure your approach is more efficient, from either an analyst or developer perspective - if nothing else, in the time taken to answer this question, you could have gotten new counts already.

ADD COMMENT
0
Entering edit mode

Thanks Aaron, I'll stick with featureCounts to generate the window counts.

ADD REPLY
2
Entering edit mode
@michael-lawrence-3846
Last seen 2.4 years ago
United States

I guess the simple way would be to just generate all of the hits and aggregate the matrix by the subject hits:

hits <- findOverlaps(queryRanges, subjectRanges, type="within")
rowsum(countMatrix[queryHits(hits),], subjectHits(hits))

If that is too slow, we can work on it. If this was just a vector instead of a matrix, there would be faster ways.

ADD COMMENT
1
Entering edit mode
@gordon-smyth
Last seen 46 minutes ago
WEHI, Melbourne, Australia

In terms of speed, counting everything fresh using featureCounts will be much faster than than anything else you might do. It's a very fast function.

As Aaron has indicated, it is also easier to get it right that way: to be sure that you're counting the correct reads.

ADD COMMENT

Login before adding your answer.

Traffic: 499 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6