Question

pileup coverage from BAM file

1

Entering edit mode

Martin Morgan 25k

@martin-morgan-1513

Last seen 11 months ago

United States

Hi Chris -- On 02/28/2013 09:30 AM, Chris Cabanski wrote: > Hi, > > I have a bam file and I am trying to calculate the coverage at each exon > position (base) within a gene. First, I find the exon boundaries for the > gene of interest: > > library(org.Hs.eg.db) > library(TxDb.Hsapiens.UCSC.hg19.knownGene) > > genesym <- c("CDKN2A") > geneid <- select(org.Hs.eg.db, keys=genesym, > keytype="SYMBOL",cols="ENTREZID") > > txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene > ex <- exonsBy(txdb, "gene") > ex1 <- ex[[geneid$ENTREZID[1]]] >> ex1 > GRanges with 6 ranges and 2 metadata columns: > seqnames ranges strand | exon_id exon_name > <rle> <iranges> <rle> | <integer> <character> > [1] chr9 [21967751, 21968241] - | 127318 <na> > [2] chr9 [21968574, 21968770] - | 127319 <na> > [3] chr9 [21970901, 21971207] - | 127320 <na> > [4] chr9 [21974403, 21974826] - | 127321 <na> > [5] chr9 [21974677, 21975132] - | 127322 <na> > [6] chr9 [21994138, 21994490] - | 127323 <na> > --- > seqlengths: > chr1 chr2 ... chrUn_gl000249 > 249250621 243199373 ... 38502 > > Next, I would like to find the coverage at each of these positions and > this is where I am getting stuck. I have tried to follow > example(applyPileups) but I get an error and am not sure that I am > calculating coverage at the base-level. depending on what you're interested in, you might find coverage(bamFile, param=ScanBamParam(which=ex1)) to be what you're looking for -- this is an 'Rle' (run length encoding) that describes how many reads overlap each position, without providing information on what the nucleotide or quality in the read was. When looking a coverage on a large number of regions, it is sometimes more efficient to calculate the coverage of the whole bam file (this takes some but not huge amounts of memory, say 4Gb) and then take a 'Views' on the Rle that reflect your regions of interest. The GenomicRanges and IRanges vignettes http://bioconductor.org/packages/2.11/bioc/html/GenomicRanges.html http://bioconductor.org/packages/2.11/bioc/html/IRanges.html are good resources for insight into working with these objects. > > library(Rsamtools) > > # create character list of BAM files (only use first file for now) > fls <- scan("SortedBams_tumor.txt",what='character') > > calcInfo <- function(x) { > ## information at each pile-up position > info <- apply(x[["seq"]], 2, function(y) { > y <- y[c("A", "C", "G", "T"),] probably what's happening here is that y[,] is subsetting a matrix, but there is only one (or no) columns, so it is being coerced to a vector, and then... > y <- y + 1L > # continuity > cvg <- colSums(y) the equivalent of... > colSums(1:5) Error in colSums(1:5) : 'x' must be an array of at least two dimensions so the immediate solution is likely y <- y[c("A", "C", "G", "T"), , drop=FALSE] and... > p <- y / cvg[col(y)] > h <- -colSums(p * log(p)) > ifelse(cvg == 4L, NA, h) > }) > list(seqnames=x[["seqnames"]], pos=x[["pos"]], info=info) > } > > which <- ex1 > param <- PileupParam(which=which, what=c("seq","qual"), > maxDepth=1000L, yieldBy="position", yieldAll=TRUE) > fl <- PileupFiles(fls[1],param=param) > res <- applyPileups(fl,calcInfo,param=param) >> Error: applyPileups: 'x' must be an array of at least two dimensions > > Is this correct route that I should be taking, or should I be going about > this differently? The function gmapR::bam_tally is an alternative to applyPileups. Hope that helps, Martin > > Thanks in advance, > Chris > >> sessionInfo() > R version 2.15.2 (2012-10-26) > Platform: x86_64-pc-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C > [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 > [7] LC_PAPER=C LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] Rsamtools_1.10.2 > [2] Biostrings_2.26.3 > [3] TxDb.Hsapiens.UCSC.hg19.knownGene_2.8.0 > [4] GenomicFeatures_1.10.1 > [5] GenomicRanges_1.10.6 > [6] IRanges_1.16.4 > [7] org.Hs.eg.db_2.8.0 > [8] RSQLite_0.9-4 > [9] DBI_0.2-5 > [10] AnnotationDbi_1.20.3 > [11] Biobase_2.18.0 > [12] BiocGenerics_0.4.0 > > loaded via a namespace (and not attached): > [1] BSgenome_1.26.1 RCurl_1.5-0 XML_3.2-0 biomaRt_2.14.0 > [5] bitops_1.0-4.1 parallel_2.15.2 rtracklayer_1.18.2 stats4_2.15.2 > [9] tools_2.15.2 zlibbioc_1.4.0 > -- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793

Coverage Cancer IRanges GenomicRanges Coverage Cancer IRanges GenomicRanges • 2.2k views

ADD COMMENT • link 12.8 years ago Martin Morgan 25k