Entering edit mode
Hi Kim,
On 10/18/2013 11:04 PM, Kim Siegmund wrote:
> Dear Herve,
>
> I work with Tim Triche Jr. at USC and he told me you were the best
person to contact about a particular programming question I have. I
would like to estimate the SNP variant frequencies in a BAM file at a
list of specific locations. Specifically, I have a file with a list of
somatic SNP variant locations, e.g.
> chr1 31184771
> chr1 40229367
> chr1 48699354
> ?. (~500 rows),
> and a BAM file of whole exon sequencing reads at ~ 40X coverage. I'd
like to know the base calls in the BAM file at the specific locations
from the file of somatic variant positions. Is there an easy way to
extract this information from the BAM file? If yes, can I also get
the quality score for each base?
>
> I am familiar with GenomicRanges & ReadGappedAlignment, but perhaps
at only an intermediate level. I was told this is easy in samtools
(is it the mpileup command?) but cannot find any examples showing how
this might be done in Bioconductor. Although it seems the package
VariantTools must have this capability, I did not see from the
vignette how I might do this targeted data summarization.
Maybe you could ask Michael about VariantTools.
Here below is a pure Biostrings + GenomicRanges solution. It uses
the pileLettersAt() function, which isn't in any of those packages
yet, but will be included in a package to be added soon to Bioc-devel.
Here is how to use pileLettersAt():
Input:
- A BAM file:
bamfile <- BamFile(system.file("extdata", "ex1.bam",
package="Rsamtools"))
seqinfo(bamfile) # to see the seqlevels and seqlengths
stackStringsFromBam(bamfile, param="seq1:1-21") # a quick look
at
# the reads
- A GRanges object containing single nucleotide positions:
snpos <- GRanges(Rle(c("seq1", "seq2"), c(7, 2)),
IRanges(c(1:5, 21, 1575, 1513:1514), width=1))
Some preliminary massage on 'snpos':
seqinfo(snpos) <- merge(seqinfo(snpos), seqinfo(bamfile))
seqlevels(snpos) <- seqlevelsInUse(snpos)
Load the BAM file in a GAlignments object. We load only the reads
aligned to the sequences in 'seqlevels(snpos)' and we filter out
reads not passing quality controls (flag bit 0x200) and PCR or
optical duplicates (flag bit 0x400). See ?ScanBamParam and the SAM
Spec for more information.
which <- as(seqinfo(snpos), "GRanges")
flag <- scanBamFlag(isNotPassingQualityControls=FALSE,
isDuplicate=FALSE)
what <- c("seq", "qual")
param <- ScanBamParam(flag=flag, what=c("seq", "qual"),
which=which)
gal <- readGAlignmentsFromBam(bamfile, param=param)
seqlevels(gal) <- seqlevels(snpos)
Extract the read sequences (a.k.a. query sequences) and quality
strings. Both are reported with respect to the + strand.
qseq <- mcols(gal)$seq
qual <- mcols(gal)$qual
nucl_piles <- pileLettersAt(qseq, seqnames(gal), start(gal),
cigar(gal), snpos)
qual_piles <- pileLettersAt(qual, seqnames(gal), start(gal),
cigar(gal), snpos)
mcols(snpos)$nucl_piles <- nucl_piles
mcols(snpos)$qual_piles <- qual_piles
Then:
> snpos
GRanges with 9 ranges and 2 metadata columns:
seqnames ranges strand | nucl_piles
<rle> <iranges> <rle> | <dnastringset>
[1] seq1 [ 1, 1] * | C
[2] seq1 [ 2, 2] * | A
[3] seq1 [ 3, 3] * | CC
[4] seq1 [ 4, 4] * | TT
[5] seq1 [ 5, 5] * | AAA
[6] seq1 [ 21, 21] * | TTTTTTTTT
[7] seq1 [1575, 1575] * |
[8] seq2 [1513, 1513] * | GGGGGGGGGGGGGGGGTTGG
[9] seq2 [1514, 1514] * | TTTTTTTTTTTTTTTTTTTTTTTTTAT
qual_piles
<bstringset>
[1] <
[2] <
[3] <<
[4] <<
[5] <<<
[6] <<<-<;<<=
[7]
[8] <<<<;6:<=4<;16<<'+:.
[9] <<<;;=<<=<<<<<<:;'84;;<<;+1
---
seqlengths:
seq1 seq2
1575 1584
Finally, to summarize A/C/G/T frequencies at each position:
> alphabetFrequency(nucl_piles, baseOnly=TRUE)
A C G T other
[1,] 0 1 0 0 0
[2,] 1 0 0 0 0
[3,] 0 2 0 0 0
[4,] 0 0 0 2 0
[5,] 3 0 0 0 0
[6,] 0 0 0 9 0
[7,] 0 0 0 0 0
[8,] 0 0 18 2 0
[9,] 1 0 0 26 0
See below for pileLettersAt's code.
Cheers,
H.
----------------------------------------------------------------------
-
### Conceptually equivalent to:
### sapply(x, function(xx) paste(xx, collapse=""))
collapse_XStringSetList <- function(x)
{
unlisted_x <- unlist(x, use.names=FALSE)
unlisted_ans <- unlist(unlisted_x, use.names=FALSE)
ans_width <- sum(relist(width(unlisted_x), x))
extract_at <- as(PartitioningByEnd(cumsum(ans_width)), "IRanges")
extractAt(unlisted_ans, extract_at)
}
### pileLettersOnSingleRefAt() is the workhorse behind
pileLettersAt().
### 'x', 'pos', 'cigar': 3 parallel vectors describing N strings
aligned
### to the same reference sequence. 'x' must be an XStringSet
(typically
### DNAStringSet) object containing the unaligned strings (a.k.a. the
### query sequences) reported with respect to the + strand. 'pos' must
### be an integer vector where 'pos[i]' is the 1-based position on the
### reference sequence of the first aligned letter in 'x[[i]]'.
'cigar'
### must be a character vector containing the extended CIGAR strings.
### 'at': must be an integer vector containing single nucleotide
### positions on the reference sequence.
### Returns an XStringSet (typically DNAStringSet) object parallel to
### 'at' (i.e. with 1 string per single nucleotide position).
pileLettersOnSingleRefAt <- function(x, pos, cigar, at)
{
stopifnot(is(x, "XStringSet"))
N <- length(x) # nb of alignments
stopifnot(is.integer(pos) && length(pos) == N)
stopifnot(is.character(cigar) && length(cigar) == N)
stopifnot(is.integer(at))
ops <- c("M", "=", "X")
ranges_on_ref <- cigarRangesAlongReferenceSpace(cigar, pos=pos,
ops=ops)
ranges_on_query <- cigarRangesAlongQuerySpace(cigar, ops=ops)
## 'ranges_on_ref' and 'ranges_on_query' are IRangesList objects
parallel
## to 'x', 'pos', and 'cigar'. In addition, the 2 IRangesList
objects
## have the same "shape" (i.e. same elementLengths()), so, after
## unlisting, the 2 unlisted objects are parallel IRanges
objects.
unlisted_ranges_on_ref <- unlist(ranges_on_ref, use.names=FALSE)
unlisted_ranges_on_query <- unlist(ranges_on_query,
use.names=FALSE)
## 2 integer vectors parallel to IRanges objects
'unlisted_ranges_on_ref'
## and 'unlisted_ranges_on_query' above.
range_group <- togroup(ranges_on_ref)
query2ref_shift <- start(unlisted_ranges_on_ref) -
start(unlisted_ranges_on_query)
hits <- findOverlaps(at, unlisted_ranges_on_ref)
hits_at_in_x <- at[queryHits(hits)] -
query2ref_shift[subjectHits(hits)]
hits_group <- range_group[subjectHits(hits)]
unlisted_piles <- subseq(x[hits_group], start=hits_at_in_x,
width=1L)
piles_skeleton <- PartitioningByEnd(queryHits(hits),
NG=length(at),
names=names(at))
piles <- relist(unlisted_piles, piles_skeleton)
collapse_XStringSetList(piles)
}
### 'x', 'seqnames', 'pos', 'cigar': 4 parallel vectors describing N
### aligned strings. 'x', 'pos', and 'cigar' as above. 'seqnames' must
### be a factor-Rle where 'seqnames[i]' is the name of the reference
### sequence of the i-th alignment.
### 'at': must be a GRanges object containing single nucleotide
### positions. 'seqlevels(at)' must be identical to
'levels(seqnames)'.
### Returns an XStringSet (typically DNAStringSet) object parallel to
### 'at' (i.e. with 1 string per single nucleotide position).
pileLettersAt <- function(x, seqnames, pos, cigar, at)
{
stopifnot(is(x, "XStringSet"))
N <- length(x) # nb of alignments
stopifnot(is(seqnames, "Rle"))
stopifnot(is.factor(runValue(seqnames)))
stopifnot(length(seqnames) == N)
stopifnot(is.integer(pos) && length(pos) == N)
stopifnot(is.character(cigar) && length(cigar) == N)
stopifnot(is(at, "GRanges"))
stopifnot(all(width(at) == 1L))
stopifnot(identical(seqlevels(at), levels(seqnames)))
## We process 1 chromosome at a time. So we start by splitting
## 'x', 'pos', 'cigar', and 'start(at)' by chromosome. The 4
## resulting list-like objects have 1 list element per chromosome
## in 'seqlevels(at)' (or in 'levels(seqnames)', which is
identical
## to 'seqlevels(at)').
x_by_chrom <- split(x, seqnames) # XStringSetList
pos_by_chrom <- split(pos, seqnames) # IntegerList
cigar_by_chrom <- split(cigar, seqnames) # CharacterList
at_by_chrom <- split(start(at), seqnames(at)) # IntegerList
## Unsplit index.
split_idx <- unlist(split(seq_along(at), seqnames(at)),
use.names=FALSE)
unsplit_idx <- integer(length(at))
unsplit_idx[split_idx] <- seq_along(at)
do.call("c", lapply(seq_along(seqlevels(at)),
function(i)
pileLettersOnSingleRefAt(
x_by_chrom[[i]],
pos_by_chrom[[i]],
cigar_by_chrom[[i]],
at_by_chrom[[i]])))[unsplit_idx]
}
>
> Thank you in advance for any advice.
>
> Kim Siegmund
>
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319