Iterating over BAM file by genomic chunk
1
1
Entering edit mode
@james-w-macdonald-5106
Last seen 1 day ago
United States

While you can iterate over an open BamFile, extracting a certain number of reads at each iteration using the yieldSize argument, it appears you cannot iterate over genomic regions, using the which argument

> bf <- BamFile("tmp_sorted.bam")
> gr2 <- as(GRanges(rep("chr1", 3), IRanges(c(3000000,3005000, 3010000), c(3005000, 3010000, 3015000))), "GRangesList")
> open(bf)
> lapply(1:3, function(x) readGAlignments(bf, param = ScanBamParam(which = gr2[[x]])))
[[1]]
GAlignments object with 98 alignments and 0 metadata columns:
       seqnames strand       cigar    qwidth     start       end     width
          <Rle>  <Rle> <character> <integer> <integer> <integer> <integer>
   [1]     chr1      -      13S88M       101   3004617   3004704        88
   [2]     chr1      +      88M13S       101   3004617   3004704        88
   [3]     chr1      +        101M       101   3003681   3003781       101
   [4]     chr1      -        101M       101   3003791   3003891       101
   [5]     chr1      +      80M21S       101   3001469   3001548        80
   ...      ...    ...         ...       ...       ...       ...       ...
  [94]     chr2      -        101M       101   4949942   4950042       101
  [95]     chr1      +        101M       101   3004952   3005052       101
  [96]     chr1      -        101M       101   3005023   3005123       101
  [97]     chr1      +        101M       101   3001837   3001937       101
  [98]     chr1      +        101M       101   3001614   3001714       101
           njunc
       <integer>
   [1]         0
   [2]         0
   [3]         0
   [4]         0
   [5]         0
   ...       ...
  [94]         0
  [95]         0
  [96]         0
  [97]         0
  [98]         0
  -------
  seqinfo: 66 sequences from an unspecified genome

[[2]]
GAlignments object with 0 alignments and 0 metadata columns:
   seqnames strand       cigar    qwidth     start       end     width
      <Rle>  <Rle> <character> <integer> <integer> <integer> <integer>
       njunc
   <integer>
  -------
  seqinfo: 66 sequences from an unspecified genome

[[3]]
GAlignments object with 0 alignments and 0 metadata columns:
   seqnames strand       cigar    qwidth     start       end     width
      <Rle>  <Rle> <character> <integer> <integer> <integer> <integer>
       njunc
   <integer>
  -------
  seqinfo: 66 sequences from an unspecified genome

> close(bf)

But you can repeatedly open the BamFile and get chunks

> lapply(1:3, function(x) readGAlignments(bf, param = ScanBamParam(which = gr2[[x]])))
[[1]]
GAlignments object with 98 alignments and 0 metadata columns:
       seqnames strand       cigar    qwidth     start       end     width
          <Rle>  <Rle> <character> <integer> <integer> <integer> <integer>
   [1]     chr1      -      13S88M       101   3004617   3004704        88
   [2]     chr1      +      88M13S       101   3004617   3004704        88
   [3]     chr1      +        101M       101   3003681   3003781       101
   [4]     chr1      -        101M       101   3003791   3003891       101
   [5]     chr1      +      80M21S       101   3001469   3001548        80
   ...      ...    ...         ...       ...       ...       ...       ...
  [94]     chr2      -        101M       101   4949942   4950042       101
  [95]     chr1      +        101M       101   3004952   3005052       101
  [96]     chr1      -        101M       101   3005023   3005123       101
  [97]     chr1      +        101M       101   3001837   3001937       101
  [98]     chr1      +        101M       101   3001614   3001714       101
           njunc
       <integer>
   [1]         0
   [2]         0
   [3]         0
   [4]         0
   [5]         0
   ...       ...
  [94]         0
  [95]         0
  [96]         0
  [97]         0
  [98]         0
  -------
  seqinfo: 66 sequences from an unspecified genome

[[2]]
GAlignments object with 112 alignments and 0 metadata columns:
        seqnames strand       cigar    qwidth     start       end     width
           <Rle>  <Rle> <character> <integer> <integer> <integer> <integer>
    [1]     chr1      +        101M       101   3009689   3009789       101
    [2]     chr1      -        101M       101   3009692   3009792       101
    [3]     chr1      +      40M61S       101   3006254   3006293        40
    [4]     chr1      -      61S40M       101   3006254   3006293        40
    [5]     chr1      -      36S65M       101   3005668   3005732        65
    ...      ...    ...         ...       ...       ...       ...       ...
  [108]     chr1      -      1S100M       101   3007853   3007952       100
  [109]     chr1      +        101M       101   3006226   3006326       101
  [110]     chr1      -        101M       101   3009131   3009231       101
  [111]     chr1      -        101M       101   3006865   3006965       101
  [112]     chr1      +        101M       101   3005263   3005363       101
            njunc
        <integer>
    [1]         0
    [2]         0
    [3]         0
    [4]         0
    [5]         0
    ...       ...
  [108]         0
  [109]         0
  [110]         0
  [111]         0
  [112]         0
  -------
  seqinfo: 66 sequences from an unspecified genome

[[3]]
GAlignments object with 89 alignments and 0 metadata columns:
       seqnames strand       cigar    qwidth     start       end     width
          <Rle>  <Rle> <character> <integer> <integer> <integer> <integer>
   [1]     chr1      +      61M40S       101   3013078   3013138        61
   [2]     chr1      -    35S61M5S       101   3013078   3013138        61
   [3]     chr1      +       94M7S       101   3014248   3014341        94
   [4]     chr1      -       6S95M       101   3014247   3014341        95
   [5]     chr1      +        101M       101   3010502   3010602       101
   ...      ...    ...         ...       ...       ...       ...       ...
  [85]     chr1      +    51M2I48M       101   3014701   3014799        99
  [86]     chr7      +  20M5I68M8S       101 124099832 124099919        88
  [87]     chr1      +      82M19S       101   3010739   3010820        82
  [88]     chr1      -        101M       101   3012152   3012252       101
  [89]     chr1      -        101M       101   3010107   3010207       101
           njunc
       <integer>
   [1]         0
   [2]         0
   [3]         0
   [4]         0
   [5]         0
   ...       ...
  [85]         0
  [86]         0
  [87]         0
  [88]         0
  [89]         0
  -------
  seqinfo: 66 sequences from an unspecified genome

Is there nothing to be gained by reading in chunks vs repeatedly opening the BamFile (as there presumably is when using the yieldSize argument to BamFile)?

GenomicAlignments • 1.2k views
ADD COMMENT
1
Entering edit mode
@martin-morgan-1513
Last seen 5 months ago
United States

Using yieldSize=1L with a ScanBamParam() having a which= field iterates over ranges

fl = system.file(package="Rsamtools", "extdata", "ex1.bam")
bam = BamFile(fl, yieldSize = 1)
param = ScanBamParam(
    which = GRanges(c("seq2:1-1000", "seq1:1-1000"))
)

open(bam)
repeat {
    aln = readGAlignments(bam, param = param)
    if (length(aln) == 0L)
        break
    print(table(seqnames(aln)))
}
close(bam)

resulting in output

seq1 seq2
   0 1214

seq1 seq2
 907    0

This pattern might be expressed at a higher level using GenomicFiles

yield = function(x) readGAlignments(x, param = param)
map = function(x) table(seqnames(x))
reduce = list
reduceByYield(BamFile(fl, yieldSize = 1), yield, map, reduce)

It's not easily possible to yield more than one range at a time, or to mix range and yield size. When there are many (10,000's) of ranges, it's often faster to iterate over the entire bam file and filter each chunk; the cost of finding ranges can be expensive, especially with deep coverage.

ADD COMMENT

Login before adding your answer.

Traffic: 538 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6