Question: Aggregate read counts by non-uniform overlapping sliding windows
0
gravatar for s1437643
2.8 years ago by
s14376430
s14376430 wrote:

I have a matrix of read counts which represent the number of reads aligned in bins along the genome. There are millions of bins and they are not all the same length. For each bin I would like to count the number of reads within 5,000 bp either side of the bin's centre. Reads should only be counted from bins which are entirely within the 5,000 bp range.

So far, I've been able to take my bins (queryRanges) and create two vectors which represent the first (firstHits) and last bin (lastHits) within the total 10,000 bp search space (subjectRanges): 

## Show query ranges
> queryRanges
GRanges object with 3913116 ranges and 0 metadata columns:
            seqnames               ranges strand
               <Rle>            <IRanges>  <Rle>
        [1]     chr1   [3000193, 3000814]      *
        [2]     chr1   [3001121, 3001796]      *
        [3]     chr1   [3003603, 3004014]      *
        [4]     chr1   [3004127, 3004339]      *
        [5]     chr1   [3004340, 3005438]      *
        ...      ...                  ...    ...
  [3913112]     chrY [90818058, 90821170]      *
  [3913113]     chrY [90823995, 90828869]      *
  [3913114]     chrY [90828938, 90829083]      *
  [3913115]     chrY [90829136, 90829941]      *
  [3913116]     chrY [90831793, 90832280]      *
  -------
  seqinfo: 21 sequences from mm10 genome

## Create window size subject ranges
> windowSize <- 10000
> subjectRanges <- suppressWarnings(trim(resize(queryRanges, windowSize, fix="center")))

## Show subjectRanges
> subjectRanges
GRanges object with 3913116 ranges and 0 metadata columns:
            seqnames               ranges strand
               <Rle>            <IRanges>  <Rle>
        [1]     chr1   [2995504, 3005503]      *
        [2]     chr1   [2996459, 3006458]      *
        [3]     chr1   [2998809, 3008808]      *
        [4]     chr1   [2999233, 3009232]      *
        [5]     chr1   [2999889, 3009888]      *
        ...      ...                  ...    ...
  [3913112]     chrY [90814614, 90824613]      *
  [3913113]     chrY [90821432, 90831431]      *
  [3913114]     chrY [90824011, 90834010]      *
  [3913115]     chrY [90824539, 90834538]      *
  [3913116]     chrY [90827037, 90837036]      *
  -------
  seqinfo: 21 sequences from mm10 genome

## Get first query within subject ranges
> firstHits <- findOverlaps(queryRanges, subjectRanges, type="within", select="first")

## Replace NA hits with index of query bin
> indexNA <- which(is.na(firstHits))
> firstHits[indexNA] <- indexNA

## Get last query within subject ranges
> lastHits <- findOverlaps(queryRanges, subjectRanges, type="within", select="last")

## Replace NA hits with index of query bin
> indexNA <- which(is.na(lastHits))
> lastHits[indexNA] <- indexNA

# Show index of first query bin within each subject bin
> head(firstHits, n=25)
[1]  1  1  1  1  1  2  3  3  3  3  4  4  5  5  6  6  7  7  9 10 11 12 15 15 16

# Show index of last query bin within each subject bin
> head(lastHits, n=25)
[1]  5  6 10 12 14 15 18 18 20 20 21 22 22 23 23 26 26 27 27 27 27 27 28 29 30

​From here I could take each column of my count matrix and use these indexes to subset the column, sum the counts, and create a new matrix. This however seems to be a very slow process, and memory inefficient. Is there a better alternative to this approach? Are they any facets of GRanges transformations/structures which I could take advantage of? I could also just use the 10,000bp bins to count the reads from the BAM files using Rsubread's featureCounts, but I wondered since I already have the information if I could get my result faster...

ADD COMMENTlink modified 2.8 years ago by Michael Lawrence11k • written 2.8 years ago by s14376430
Answer: Aggregate read counts by non-uniform overlapping sliding windows
3
gravatar for Aaron Lun
2.8 years ago by
Aaron Lun25k
Cambridge, United Kingdom
Aaron Lun25k wrote:

I would just count everything fresh, using featureCounts or a similar function. Your current approach is sort-of-but-not-quite-right, because you're ignoring bins (and reads in those bins) that are only partially overlapped by the expanded 10kbp interval. Your bin regions in queryRanges don't seem to be contiguous either, so some of the reads will fall through the gaps and be missed. If you want to expand the bins, then you'll need to re-assign reads to the resized intervals to get correct counts. Remember; accuracy first, efficiency later. I'm not even sure your approach is more efficient, from either an analyst or developer perspective - if nothing else, in the time taken to answer this question, you could have gotten new counts already.

ADD COMMENTlink modified 2.8 years ago • written 2.8 years ago by Aaron Lun25k

Thanks Aaron, I'll stick with featureCounts to generate the window counts.

ADD REPLYlink written 2.8 years ago by s14376430
Answer: Aggregate read counts by non-uniform overlapping sliding windows
2
gravatar for Michael Lawrence
2.8 years ago by
United States
Michael Lawrence11k wrote:

I guess the simple way would be to just generate all of the hits and aggregate the matrix by the subject hits:

hits <- findOverlaps(queryRanges, subjectRanges, type="within")
rowsum(countMatrix[queryHits(hits),], subjectHits(hits))

If that is too slow, we can work on it. If this was just a vector instead of a matrix, there would be faster ways.

ADD COMMENTlink written 2.8 years ago by Michael Lawrence11k
Answer: Aggregate read counts by non-uniform overlapping sliding windows
1
gravatar for Gordon Smyth
2.8 years ago by
Gordon Smyth38k
Walter and Eliza Hall Institute of Medical Research, Melbourne, Australia
Gordon Smyth38k wrote:

In terms of speed, counting everything fresh using featureCounts will be much faster than than anything else you might do. It's a very fast function.

As Aaron has indicated, it is also easier to get it right that way: to be sure that you're counting the correct reads.

ADD COMMENTlink modified 2.8 years ago • written 2.8 years ago by Gordon Smyth38k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 182 users visited in the last hour