Answer: Coverage of reads per base position from BAM files
7.3 years ago by
On 07/17/2012 12:56 PM, Maintainer wrote:
> I am trying to obtain per reference base position coverage for a BAM
file. The BAM file has been generated by Samtools and contains
alignment info for 10 million reads to a 7kb reference plasmid genome.
What is the best way to go about this? I have explored the
"readBamGappedAlignments" and "readAligned" functions but I am not
sure of the way forward from there.
> Also should I be using a BAM file or the sorted BAM file with its
corresponding index file? I am currently using the later.
For simple depth of coverage follow Michaels' advice wrt
readGappedAlignments (where 10 million reads might fit into memory
trouble, so no need to sort / index / query the BAM file though that
If by 'per reference base position coverage' you mean the number of A,
C, G, T above each position, you'll want Rsamtools::applyPileups. See
?applyPileups for a worked example; this isn't as easy as it should
Normally, you'll want to use a sorted, indexed BAM file (?sortBam,
?indexBam) so that you can query just the regions you're interested
One oddity of the BAM 'index' is that it indexes intervals on the
reference genome, rather than say dividing the reads into bins of
numbers. The smallest interval for indexing is 2^14 = 16kb, so in your
case all your 10 million reads fall in a single bin and any search is
linear(!) -- the first read at the first position is found fast, the
last read at the last position only after looking at 9,999,999 other
reads. If you want to process the whole file then iterating through it
(via PileupParam yieldSize) works; random access will be terrible.
Very recent samtools can generate more flexible indexes, but I think
resolution (with respect to the reference genome) is still relatively
course and I have not tested files with these indexes in Rsamtools.
> Thanks much.
> -- output of sessionInfo():
> R version 2.14.2 (2012-02-29)
> Platform: x86_64-unknown-linux-gnu (64-bit)
>  LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
>  LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
>  LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
>  LC_PAPER=C LC_NAME=C
>  LC_ADDRESS=C LC_TELEPHONE=C
>  LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
> attached base packages:
>  stats graphics grDevices utils datasets methods base
> other attached packages:
>  ShortRead_1.12.4 latticeExtra_0.6-19 RColorBrewer_1.0-5
>  lattice_0.20-6 Rsamtools_1.6.3 Biostrings_2.22.0
>  GenomicRanges_1.6.7 IRanges_1.12.6
> loaded via a namespace (and not attached):
>  Biobase_2.14.0 bitops_1.0-4.1 BSgenome_1.22.0
>  hwriter_1.3 RCurl_1.91-1 rtracklayer_1.14.4
>  XML_3.9-4 zlibbioc_1.0.1
> Sent via the guest posting facility at bioconductor.org.
> devteam-bioc mailing list
> To unsubscribe from this mailing list send a blank email to
> devteam-bioc-leave at lists.fhcrc.org
> You can also unsubscribe or change your personal options at
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109
Location: Arnold Building M1 B861
Phone: (206) 667-2793