Entering edit mode
Hi Chris --
On 02/28/2013 09:30 AM, Chris Cabanski wrote:
> Hi,
>
> I have a bam file and I am trying to calculate the coverage at each
exon
> position (base) within a gene. First, I find the exon boundaries
for the
> gene of interest:
>
> library(org.Hs.eg.db)
> library(TxDb.Hsapiens.UCSC.hg19.knownGene)
>
> genesym <- c("CDKN2A")
> geneid <- select(org.Hs.eg.db, keys=genesym,
> keytype="SYMBOL",cols="ENTREZID")
>
> txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
> ex <- exonsBy(txdb, "gene")
> ex1 <- ex[[geneid$ENTREZID[1]]]
>> ex1
> GRanges with 6 ranges and 2 metadata columns:
> seqnames ranges strand | exon_id exon_name
> <rle> <iranges> <rle> | <integer> <character>
> [1] chr9 [21967751, 21968241] - | 127318 <na>
> [2] chr9 [21968574, 21968770] - | 127319 <na>
> [3] chr9 [21970901, 21971207] - | 127320 <na>
> [4] chr9 [21974403, 21974826] - | 127321 <na>
> [5] chr9 [21974677, 21975132] - | 127322 <na>
> [6] chr9 [21994138, 21994490] - | 127323 <na>
> ---
> seqlengths:
> chr1 chr2 ...
chrUn_gl000249
> 249250621 243199373 ...
38502
>
> Next, I would like to find the coverage at each of these positions
and
> this is where I am getting stuck. I have tried to follow
> example(applyPileups) but I get an error and am not sure that I am
> calculating coverage at the base-level.
depending on what you're interested in, you might find
coverage(bamFile, param=ScanBamParam(which=ex1))
to be what you're looking for -- this is an 'Rle' (run length
encoding) that
describes how many reads overlap each position, without providing
information on
what the nucleotide or quality in the read was.
When looking a coverage on a large number of regions, it is sometimes
more
efficient to calculate the coverage of the whole bam file (this takes
some but
not huge amounts of memory, say 4Gb) and then take a 'Views' on the
Rle that
reflect your regions of interest.
The GenomicRanges and IRanges vignettes
http://bioconductor.org/packages/2.11/bioc/html/GenomicRanges.html
http://bioconductor.org/packages/2.11/bioc/html/IRanges.html
are good resources for insight into working with these objects.
>
> library(Rsamtools)
>
> # create character list of BAM files (only use first file for now)
> fls <- scan("SortedBams_tumor.txt",what='character')
>
> calcInfo <- function(x) {
> ## information at each pile-up position
> info <- apply(x[["seq"]], 2, function(y) {
> y <- y[c("A", "C", "G", "T"),]
probably what's happening here is that y[,] is subsetting a matrix,
but there is
only one (or no) columns, so it is being coerced to a vector, and
then...
> y <- y + 1L
> # continuity
> cvg <- colSums(y)
the equivalent of...
> colSums(1:5)
Error in colSums(1:5) : 'x' must be an array of at least two
dimensions
so the immediate solution is likely
y <- y[c("A", "C", "G", "T"), , drop=FALSE]
and...
> p <- y / cvg[col(y)]
> h <- -colSums(p * log(p))
> ifelse(cvg == 4L, NA, h)
> })
> list(seqnames=x[["seqnames"]], pos=x[["pos"]], info=info)
> }
>
> which <- ex1
> param <- PileupParam(which=which, what=c("seq","qual"),
> maxDepth=1000L, yieldBy="position", yieldAll=TRUE)
> fl <- PileupFiles(fls[1],param=param)
> res <- applyPileups(fl,calcInfo,param=param)
>> Error: applyPileups: 'x' must be an array of at least two
dimensions
>
> Is this correct route that I should be taking, or should I be going
about
> this differently?
The function gmapR::bam_tally is an alternative to applyPileups.
Hope that helps, Martin
>
> Thanks in advance,
> Chris
>
>> sessionInfo()
> R version 2.15.2 (2012-10-26)
> Platform: x86_64-pc-linux-gnu (64-bit)
>
> locale:
> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
> [7] LC_PAPER=C LC_NAME=C
> [9] LC_ADDRESS=C LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats graphics grDevices utils datasets methods base
>
> other attached packages:
> [1] Rsamtools_1.10.2
> [2] Biostrings_2.26.3
> [3] TxDb.Hsapiens.UCSC.hg19.knownGene_2.8.0
> [4] GenomicFeatures_1.10.1
> [5] GenomicRanges_1.10.6
> [6] IRanges_1.16.4
> [7] org.Hs.eg.db_2.8.0
> [8] RSQLite_0.9-4
> [9] DBI_0.2-5
> [10] AnnotationDbi_1.20.3
> [11] Biobase_2.18.0
> [12] BiocGenerics_0.4.0
>
> loaded via a namespace (and not attached):
> [1] BSgenome_1.26.1 RCurl_1.5-0 XML_3.2-0
biomaRt_2.14.0
> [5] bitops_1.0-4.1 parallel_2.15.2 rtracklayer_1.18.2
stats4_2.15.2
> [9] tools_2.15.2 zlibbioc_1.4.0
>
--
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109
Location: Arnold Building M1 B861
Phone: (206) 667-2793