How to randomly sample bam files obtaining a subset of information (ex: binning locations)
1
1
Entering edit mode
@victoriaccorey-11035
Last seen 8.5 years ago

I am attempting to randomly sample a bam file the most efficient way possible.  I understand this can be done using the GenomicAlignments::reduceByYield, and I have done so:

individual_subsampling <- function(filename, sample_size) {
  bf <- BamFile(filename, yieldSize=1000000)
  yield <- function(x) {
    readGAlignments(x)
  }
  map <- identity
  gr <- GRanges(reduceByYield(bf, yield, map, REDUCEsampler(sample_size,FALSE)))
  return(gr)
}

My end goal, however, it to only obtain countOverlap() information for ~1/20th of the sample.  AKA, if the bam file consists of 34 million reads, my goal is to randomly sample 2 million reads and count where they overlap with a supplied GRange object.  While I can run the reduceByYield as shown above, it's pulling out way more information that I need and in turn takes a decent amount of time.   I would much rather supply the countOverlaps() as the map function, however this results in being unable to apply the REDUCEsampler option.  The code below does work, however it has to count all of the reads:

counting_overlaps <- function(filename) {
  bf <- BamFile(filename, yieldSize = 1000000)
  yield <- function(x) {
    readGAlignments(x)
  }
  map <- function(x) {
    gr_overlapRegions <- overlap_regions_generator("GR", "ALL") ##This is just defining my bins
    gr <- GRanges(x)
    countOverlaps(gr_overlapRegions, gr)
  }
  reduce = `+`
  final_data <- reduceByYield(bf, yield, map, reduce)
  return(final_data)
}

To put this in perspective, the individual_subsampling applied to a bam file with 34 million reads and sample_size = 1*10^6 took an elapsed time of 93 seconds.  Applied to the same original bam file, counting_overlaps() took 71 seconds.  I would love to apply something like counting_overlaps() therefore, but subsample simultaneously to make the process more efficient.  Is there any way to do this?

Any help or pointers would be fantastic.  Thanks!

genomicfiles rsamtools • 1.7k views
ADD COMMENT
0
Entering edit mode
@martin-morgan-1513
Last seen 4 months ago
United States

You could do a full count (e.g., via summarizeOverlaps()), and then sample from the counts -- for some vector of counts x, you'd like to draw multinomial deviates rmultinom(1, 0.2 * sum(x), x / sum(x)). This will be relatively efficient -- the bam file gets input but immediately reduced to a count vector, which is small.

 

ADD COMMENT

Login before adding your answer.

Traffic: 310 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6