Coverage of reads per base position from BAM files
2
0
Entering edit mode
Guest User ★ 13k
@guest-user-4897
Last seen 10.3 years ago
I am trying to obtain per reference base position coverage for a BAM file. The BAM file has been generated by Samtools and contains alignment info for 10 million reads to a 7kb reference plasmid genome. What is the best way to go about this? I have explored the "readBamGappedAlignments" and "readAligned" functions but I am not sure of the way forward from there. Also should I be using a BAM file or the sorted BAM file with its corresponding index file? I am currently using the later. Thanks much. -- output of sessionInfo(): R version 2.14.2 (2012-02-29) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] ShortRead_1.12.4 latticeExtra_0.6-19 RColorBrewer_1.0-5 [4] lattice_0.20-6 Rsamtools_1.6.3 Biostrings_2.22.0 [7] GenomicRanges_1.6.7 IRanges_1.12.6 loaded via a namespace (and not attached): [1] Biobase_2.14.0 bitops_1.0-4.1 BSgenome_1.22.0 grid_2.14.2 [5] hwriter_1.3 RCurl_1.91-1 rtracklayer_1.14.4 tools_2.14.2 [9] XML_3.9-4 zlibbioc_1.0.1 -- Sent via the guest posting facility at bioconductor.org.
Coverage GO Coverage GO • 2.7k views
ADD COMMENT
0
Entering edit mode
@michael-lawrence-3846
Last seen 3.1 years ago
United States
You can readGappedAlignments to load the file as a GappedAlignments. Then call coverage() on that. Michael On Tue, Jul 17, 2012 at 12:56 PM, Rithun Mukherjee [guest] < guest@bioconductor.org> wrote: > > I am trying to obtain per reference base position coverage for a BAM file. > The BAM file has been generated by Samtools and contains alignment info for > 10 million reads to a 7kb reference plasmid genome. What is the best way to > go about this? I have explored the "readBamGappedAlignments" and > "readAligned" functions but I am not sure of the way forward from there. > > Also should I be using a BAM file or the sorted BAM file with its > corresponding index file? I am currently using the later. > > Thanks much. > > > -- output of sessionInfo(): > > R version 2.14.2 (2012-02-29) > Platform: x86_64-unknown-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 > [7] LC_PAPER=C LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] ShortRead_1.12.4 latticeExtra_0.6-19 RColorBrewer_1.0-5 > [4] lattice_0.20-6 Rsamtools_1.6.3 Biostrings_2.22.0 > [7] GenomicRanges_1.6.7 IRanges_1.12.6 > > loaded via a namespace (and not attached): > [1] Biobase_2.14.0 bitops_1.0-4.1 BSgenome_1.22.0 grid_2.14.2 > [5] hwriter_1.3 RCurl_1.91-1 rtracklayer_1.14.4 tools_2.14.2 > [9] XML_3.9-4 zlibbioc_1.0.1 > > > -- > Sent via the guest posting facility at bioconductor.org. > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]
ADD COMMENT
0
Entering edit mode
Thanks very much, it worked! Regards, Rithun ----- Original Message ----- From: "Michael Lawrence" <lawrence.michael@gene.com> To: "Rithun Mukherjee [guest]" <guest at="" bioconductor.org=""> Cc: bioconductor at r-project.org, rmukherj at fhcrc.org, "Rsamtools Maintainer" <maintainer at="" bioconductor.org=""> Sent: Tuesday, July 17, 2012 2:11:33 PM Subject: Re: [BioC] Coverage of reads per base position from BAM files You can readGappedAlignments to load the file as a GappedAlignments. Then call coverage() on that. Michael On Tue, Jul 17, 2012 at 12:56 PM, Rithun Mukherjee [guest] < guest at bioconductor.org > wrote: I am trying to obtain per reference base position coverage for a BAM file. The BAM file has been generated by Samtools and contains alignment info for 10 million reads to a 7kb reference plasmid genome. What is the best way to go about this? I have explored the "readBamGappedAlignments" and "readAligned" functions but I am not sure of the way forward from there. Also should I be using a BAM file or the sorted BAM file with its corresponding index file? I am currently using the later. Thanks much. -- output of sessionInfo(): R version 2.14.2 (2012-02-29) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] ShortRead_1.12.4 latticeExtra_0.6-19 RColorBrewer_1.0-5 [4] lattice_0.20-6 Rsamtools_1.6.3 Biostrings_2.22.0 [7] GenomicRanges_1.6.7 IRanges_1.12.6 loaded via a namespace (and not attached): [1] Biobase_2.14.0 bitops_1.0-4.1 BSgenome_1.22.0 grid_2.14.2 [5] hwriter_1.3 RCurl_1.91-1 rtracklayer_1.14.4 tools_2.14.2 [9] XML_3.9-4 zlibbioc_1.0.1 -- Sent via the guest posting facility at bioconductor.org . _______________________________________________ Bioconductor mailing list Bioconductor at r-project.org https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Rithun Mukherjee Postdoctoral Research Associate Computational Biology Program Fred Hutchinson Cancer Research Center rmukherj at fhcrc.org 206.667.7661
ADD REPLY
0
Entering edit mode
@martin-morgan-1513
Last seen 5 months ago
United States
On 07/17/2012 12:56 PM, Maintainer wrote: > > I am trying to obtain per reference base position coverage for a BAM file. The BAM file has been generated by Samtools and contains alignment info for 10 million reads to a 7kb reference plasmid genome. What is the best way to go about this? I have explored the "readBamGappedAlignments" and "readAligned" functions but I am not sure of the way forward from there. > > Also should I be using a BAM file or the sorted BAM file with its corresponding index file? I am currently using the later. For simple depth of coverage follow Michaels' advice wrt readGappedAlignments (where 10 million reads might fit into memory w/out trouble, so no need to sort / index / query the BAM file though that doesn't hurt). If by 'per reference base position coverage' you mean the number of A, C, G, T above each position, you'll want Rsamtools::applyPileups. See ?applyPileups for a worked example; this isn't as easy as it should be. Normally, you'll want to use a sorted, indexed BAM file (?sortBam, ?indexBam) so that you can query just the regions you're interested in. One oddity of the BAM 'index' is that it indexes intervals on the reference genome, rather than say dividing the reads into bins of equal numbers. The smallest interval for indexing is 2^14 = 16kb, so in your case all your 10 million reads fall in a single bin and any search is linear(!) -- the first read at the first position is found fast, the last read at the last position only after looking at 9,999,999 other reads. If you want to process the whole file then iterating through it (via PileupParam yieldSize) works; random access will be terrible. Very recent samtools can generate more flexible indexes, but I think the resolution (with respect to the reference genome) is still relatively course and I have not tested files with these indexes in Rsamtools. Martin > Thanks much. > > > -- output of sessionInfo(): > > R version 2.14.2 (2012-02-29) > Platform: x86_64-unknown-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 > [7] LC_PAPER=C LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] ShortRead_1.12.4 latticeExtra_0.6-19 RColorBrewer_1.0-5 > [4] lattice_0.20-6 Rsamtools_1.6.3 Biostrings_2.22.0 > [7] GenomicRanges_1.6.7 IRanges_1.12.6 > > loaded via a namespace (and not attached): > [1] Biobase_2.14.0 bitops_1.0-4.1 BSgenome_1.22.0 grid_2.14.2 > [5] hwriter_1.3 RCurl_1.91-1 rtracklayer_1.14.4 tools_2.14.2 > [9] XML_3.9-4 zlibbioc_1.0.1 > > > -- > Sent via the guest posting facility at bioconductor.org. > > ____________________________________________________________________ ____ > devteam-bioc mailing list > To unsubscribe from this mailing list send a blank email to > devteam-bioc-leave at lists.fhcrc.org > You can also unsubscribe or change your personal options at > https://lists.fhcrc.org/mailman/listinfo/devteam-bioc > -- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793
ADD COMMENT

Login before adding your answer.

Traffic: 554 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6