how to obtain base counts from BAM file at specific genomic coordinates
2
0
Entering edit mode
@herve-pages-1542
Last seen 1 day ago
Seattle, WA, United States
Hi Kim, On 10/18/2013 11:04 PM, Kim Siegmund wrote: > Dear Herve, > > I work with Tim Triche Jr. at USC and he told me you were the best person to contact about a particular programming question I have. I would like to estimate the SNP variant frequencies in a BAM file at a list of specific locations. Specifically, I have a file with a list of somatic SNP variant locations, e.g. > chr1 31184771 > chr1 40229367 > chr1 48699354 > ?. (~500 rows), > and a BAM file of whole exon sequencing reads at ~ 40X coverage. I'd like to know the base calls in the BAM file at the specific locations from the file of somatic variant positions. Is there an easy way to extract this information from the BAM file? If yes, can I also get the quality score for each base? > > I am familiar with GenomicRanges & ReadGappedAlignment, but perhaps at only an intermediate level. I was told this is easy in samtools (is it the mpileup command?) but cannot find any examples showing how this might be done in Bioconductor. Although it seems the package VariantTools must have this capability, I did not see from the vignette how I might do this targeted data summarization. Maybe you could ask Michael about VariantTools. Here below is a pure Biostrings + GenomicRanges solution. It uses the pileLettersAt() function, which isn't in any of those packages yet, but will be included in a package to be added soon to Bioc-devel. Here is how to use pileLettersAt(): Input: - A BAM file: bamfile <- BamFile(system.file("extdata", "ex1.bam", package="Rsamtools")) seqinfo(bamfile) # to see the seqlevels and seqlengths stackStringsFromBam(bamfile, param="seq1:1-21") # a quick look at # the reads - A GRanges object containing single nucleotide positions: snpos <- GRanges(Rle(c("seq1", "seq2"), c(7, 2)), IRanges(c(1:5, 21, 1575, 1513:1514), width=1)) Some preliminary massage on 'snpos': seqinfo(snpos) <- merge(seqinfo(snpos), seqinfo(bamfile)) seqlevels(snpos) <- seqlevelsInUse(snpos) Load the BAM file in a GAlignments object. We load only the reads aligned to the sequences in 'seqlevels(snpos)' and we filter out reads not passing quality controls (flag bit 0x200) and PCR or optical duplicates (flag bit 0x400). See ?ScanBamParam and the SAM Spec for more information. which <- as(seqinfo(snpos), "GRanges") flag <- scanBamFlag(isNotPassingQualityControls=FALSE, isDuplicate=FALSE) what <- c("seq", "qual") param <- ScanBamParam(flag=flag, what=c("seq", "qual"), which=which) gal <- readGAlignmentsFromBam(bamfile, param=param) seqlevels(gal) <- seqlevels(snpos) Extract the read sequences (a.k.a. query sequences) and quality strings. Both are reported with respect to the + strand. qseq <- mcols(gal)$seq qual <- mcols(gal)$qual nucl_piles <- pileLettersAt(qseq, seqnames(gal), start(gal), cigar(gal), snpos) qual_piles <- pileLettersAt(qual, seqnames(gal), start(gal), cigar(gal), snpos) mcols(snpos)$nucl_piles <- nucl_piles mcols(snpos)$qual_piles <- qual_piles Then: > snpos GRanges with 9 ranges and 2 metadata columns: seqnames ranges strand | nucl_piles <rle> <iranges> <rle> | <dnastringset> [1] seq1 [ 1, 1] * | C [2] seq1 [ 2, 2] * | A [3] seq1 [ 3, 3] * | CC [4] seq1 [ 4, 4] * | TT [5] seq1 [ 5, 5] * | AAA [6] seq1 [ 21, 21] * | TTTTTTTTT [7] seq1 [1575, 1575] * | [8] seq2 [1513, 1513] * | GGGGGGGGGGGGGGGGTTGG [9] seq2 [1514, 1514] * | TTTTTTTTTTTTTTTTTTTTTTTTTAT qual_piles <bstringset> [1] < [2] < [3] << [4] << [5] <<< [6] <<<-<;<<= [7] [8] <<<<;6:<=4<;16<<'+:. [9] <<<;;=<<=<<<<<<:;'84;;<<;+1 --- seqlengths: seq1 seq2 1575 1584 Finally, to summarize A/C/G/T frequencies at each position: > alphabetFrequency(nucl_piles, baseOnly=TRUE) A C G T other [1,] 0 1 0 0 0 [2,] 1 0 0 0 0 [3,] 0 2 0 0 0 [4,] 0 0 0 2 0 [5,] 3 0 0 0 0 [6,] 0 0 0 9 0 [7,] 0 0 0 0 0 [8,] 0 0 18 2 0 [9,] 1 0 0 26 0 See below for pileLettersAt's code. Cheers, H. ---------------------------------------------------------------------- - ### Conceptually equivalent to: ### sapply(x, function(xx) paste(xx, collapse="")) collapse_XStringSetList <- function(x) { unlisted_x <- unlist(x, use.names=FALSE) unlisted_ans <- unlist(unlisted_x, use.names=FALSE) ans_width <- sum(relist(width(unlisted_x), x)) extract_at <- as(PartitioningByEnd(cumsum(ans_width)), "IRanges") extractAt(unlisted_ans, extract_at) } ### pileLettersOnSingleRefAt() is the workhorse behind pileLettersAt(). ### 'x', 'pos', 'cigar': 3 parallel vectors describing N strings aligned ### to the same reference sequence. 'x' must be an XStringSet (typically ### DNAStringSet) object containing the unaligned strings (a.k.a. the ### query sequences) reported with respect to the + strand. 'pos' must ### be an integer vector where 'pos[i]' is the 1-based position on the ### reference sequence of the first aligned letter in 'x[[i]]'. 'cigar' ### must be a character vector containing the extended CIGAR strings. ### 'at': must be an integer vector containing single nucleotide ### positions on the reference sequence. ### Returns an XStringSet (typically DNAStringSet) object parallel to ### 'at' (i.e. with 1 string per single nucleotide position). pileLettersOnSingleRefAt <- function(x, pos, cigar, at) { stopifnot(is(x, "XStringSet")) N <- length(x) # nb of alignments stopifnot(is.integer(pos) && length(pos) == N) stopifnot(is.character(cigar) && length(cigar) == N) stopifnot(is.integer(at)) ops <- c("M", "=", "X") ranges_on_ref <- cigarRangesAlongReferenceSpace(cigar, pos=pos, ops=ops) ranges_on_query <- cigarRangesAlongQuerySpace(cigar, ops=ops) ## 'ranges_on_ref' and 'ranges_on_query' are IRangesList objects parallel ## to 'x', 'pos', and 'cigar'. In addition, the 2 IRangesList objects ## have the same "shape" (i.e. same elementLengths()), so, after ## unlisting, the 2 unlisted objects are parallel IRanges objects. unlisted_ranges_on_ref <- unlist(ranges_on_ref, use.names=FALSE) unlisted_ranges_on_query <- unlist(ranges_on_query, use.names=FALSE) ## 2 integer vectors parallel to IRanges objects 'unlisted_ranges_on_ref' ## and 'unlisted_ranges_on_query' above. range_group <- togroup(ranges_on_ref) query2ref_shift <- start(unlisted_ranges_on_ref) - start(unlisted_ranges_on_query) hits <- findOverlaps(at, unlisted_ranges_on_ref) hits_at_in_x <- at[queryHits(hits)] - query2ref_shift[subjectHits(hits)] hits_group <- range_group[subjectHits(hits)] unlisted_piles <- subseq(x[hits_group], start=hits_at_in_x, width=1L) piles_skeleton <- PartitioningByEnd(queryHits(hits), NG=length(at), names=names(at)) piles <- relist(unlisted_piles, piles_skeleton) collapse_XStringSetList(piles) } ### 'x', 'seqnames', 'pos', 'cigar': 4 parallel vectors describing N ### aligned strings. 'x', 'pos', and 'cigar' as above. 'seqnames' must ### be a factor-Rle where 'seqnames[i]' is the name of the reference ### sequence of the i-th alignment. ### 'at': must be a GRanges object containing single nucleotide ### positions. 'seqlevels(at)' must be identical to 'levels(seqnames)'. ### Returns an XStringSet (typically DNAStringSet) object parallel to ### 'at' (i.e. with 1 string per single nucleotide position). pileLettersAt <- function(x, seqnames, pos, cigar, at) { stopifnot(is(x, "XStringSet")) N <- length(x) # nb of alignments stopifnot(is(seqnames, "Rle")) stopifnot(is.factor(runValue(seqnames))) stopifnot(length(seqnames) == N) stopifnot(is.integer(pos) && length(pos) == N) stopifnot(is.character(cigar) && length(cigar) == N) stopifnot(is(at, "GRanges")) stopifnot(all(width(at) == 1L)) stopifnot(identical(seqlevels(at), levels(seqnames))) ## We process 1 chromosome at a time. So we start by splitting ## 'x', 'pos', 'cigar', and 'start(at)' by chromosome. The 4 ## resulting list-like objects have 1 list element per chromosome ## in 'seqlevels(at)' (or in 'levels(seqnames)', which is identical ## to 'seqlevels(at)'). x_by_chrom <- split(x, seqnames) # XStringSetList pos_by_chrom <- split(pos, seqnames) # IntegerList cigar_by_chrom <- split(cigar, seqnames) # CharacterList at_by_chrom <- split(start(at), seqnames(at)) # IntegerList ## Unsplit index. split_idx <- unlist(split(seq_along(at), seqnames(at)), use.names=FALSE) unsplit_idx <- integer(length(at)) unsplit_idx[split_idx] <- seq_along(at) do.call("c", lapply(seq_along(seqlevels(at)), function(i) pileLettersOnSingleRefAt( x_by_chrom[[i]], pos_by_chrom[[i]], cigar_by_chrom[[i]], at_by_chrom[[i]])))[unsplit_idx] } > > Thank you in advance for any advice. > > Kim Siegmund > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
SNP Sequencing Coverage Alignment Cancer Biostrings PROcess IRanges GenomicRanges SNP • 6.8k views
ADD COMMENT
0
Entering edit mode
@martin-morgan-1513
Last seen 4 months ago
United States
On 10/22/2013 02:58 AM, Hervé Pagès wrote: > Hi Kim, > > On 10/18/2013 11:04 PM, Kim Siegmund wrote: >> Dear Herve, >> >> I work with Tim Triche Jr. at USC and he told me you were the best person to >> contact about a particular programming question I have. I would like to >> estimate the SNP variant frequencies in a BAM file at a list of specific >> locations. Specifically, I have a file with a list of somatic SNP variant >> locations, e.g. >> chr1 31184771 >> chr1 40229367 >> chr1 48699354 >> ?. (~500 rows), If you're interested in nucleotide counts, then ?applyPileups will provide this. Martin >> and a BAM file of whole exon sequencing reads at ~ 40X coverage. I'd like to >> know the base calls in the BAM file at the specific locations from the file of >> somatic variant positions. Is there an easy way to extract this information >> from the BAM file? If yes, can I also get the quality score for each base? >> >> I am familiar with GenomicRanges & ReadGappedAlignment, but perhaps at only an >> intermediate level. I was told this is easy in samtools (is it the mpileup >> command?) but cannot find any examples showing how this might be done in >> Bioconductor. Although it seems the package VariantTools must have this >> capability, I did not see from the vignette how I might do this targeted data >> summarization. > > Maybe you could ask Michael about VariantTools. > > Here below is a pure Biostrings + GenomicRanges solution. It uses > the pileLettersAt() function, which isn't in any of those packages > yet, but will be included in a package to be added soon to Bioc- devel. > > Here is how to use pileLettersAt(): > > Input: > > - A BAM file: > > bamfile <- BamFile(system.file("extdata", "ex1.bam", package="Rsamtools")) > > seqinfo(bamfile) # to see the seqlevels and seqlengths > > stackStringsFromBam(bamfile, param="seq1:1-21") # a quick look at > # the reads > > - A GRanges object containing single nucleotide positions: > > snpos <- GRanges(Rle(c("seq1", "seq2"), c(7, 2)), > IRanges(c(1:5, 21, 1575, 1513:1514), width=1)) > > Some preliminary massage on 'snpos': > > seqinfo(snpos) <- merge(seqinfo(snpos), seqinfo(bamfile)) > seqlevels(snpos) <- seqlevelsInUse(snpos) > > Load the BAM file in a GAlignments object. We load only the reads > aligned to the sequences in 'seqlevels(snpos)' and we filter out > reads not passing quality controls (flag bit 0x200) and PCR or > optical duplicates (flag bit 0x400). See ?ScanBamParam and the SAM > Spec for more information. > > which <- as(seqinfo(snpos), "GRanges") > flag <- scanBamFlag(isNotPassingQualityControls=FALSE, > isDuplicate=FALSE) > what <- c("seq", "qual") > param <- ScanBamParam(flag=flag, what=c("seq", "qual"), which=which) > gal <- readGAlignmentsFromBam(bamfile, param=param) > seqlevels(gal) <- seqlevels(snpos) > > Extract the read sequences (a.k.a. query sequences) and quality > strings. Both are reported with respect to the + strand. > > qseq <- mcols(gal)$seq > qual <- mcols(gal)$qual > > nucl_piles <- pileLettersAt(qseq, seqnames(gal), start(gal), cigar(gal), snpos) > qual_piles <- pileLettersAt(qual, seqnames(gal), start(gal), cigar(gal), snpos) > mcols(snpos)$nucl_piles <- nucl_piles > mcols(snpos)$qual_piles <- qual_piles > > Then: > > > snpos > GRanges with 9 ranges and 2 metadata columns: > seqnames ranges strand | nucl_piles > <rle> <iranges> <rle> | <dnastringset> > [1] seq1 [ 1, 1] * | C > [2] seq1 [ 2, 2] * | A > [3] seq1 [ 3, 3] * | CC > [4] seq1 [ 4, 4] * | TT > [5] seq1 [ 5, 5] * | AAA > [6] seq1 [ 21, 21] * | TTTTTTTTT > [7] seq1 [1575, 1575] * | > [8] seq2 [1513, 1513] * | GGGGGGGGGGGGGGGGTTGG > [9] seq2 [1514, 1514] * | TTTTTTTTTTTTTTTTTTTTTTTTTAT > qual_piles > <bstringset> > [1] < > [2] < > [3] << > [4] << > [5] <<< > [6] <<<-<;<<= > [7] > [8] <<<<;6:<=4<;16<<'+:. > [9] <<<;;=<<=<<<<<<:;'84;;<<;+1 > --- > seqlengths: > seq1 seq2 > 1575 1584 > > Finally, to summarize A/C/G/T frequencies at each position: > > > alphabetFrequency(nucl_piles, baseOnly=TRUE) > A C G T other > [1,] 0 1 0 0 0 > [2,] 1 0 0 0 0 > [3,] 0 2 0 0 0 > [4,] 0 0 0 2 0 > [5,] 3 0 0 0 0 > [6,] 0 0 0 9 0 > [7,] 0 0 0 0 0 > [8,] 0 0 18 2 0 > [9,] 1 0 0 26 0 > > See below for pileLettersAt's code. > > Cheers, > H. > > > -------------------------------------------------------------------- --- > > ### Conceptually equivalent to: > ### sapply(x, function(xx) paste(xx, collapse="")) > > collapse_XStringSetList <- function(x) > { > unlisted_x <- unlist(x, use.names=FALSE) > unlisted_ans <- unlist(unlisted_x, use.names=FALSE) > ans_width <- sum(relist(width(unlisted_x), x)) > extract_at <- as(PartitioningByEnd(cumsum(ans_width)), "IRanges") > extractAt(unlisted_ans, extract_at) > } > > ### pileLettersOnSingleRefAt() is the workhorse behind pileLettersAt(). > ### 'x', 'pos', 'cigar': 3 parallel vectors describing N strings aligned > ### to the same reference sequence. 'x' must be an XStringSet (typically > ### DNAStringSet) object containing the unaligned strings (a.k.a. the > ### query sequences) reported with respect to the + strand. 'pos' must > ### be an integer vector where 'pos[i]' is the 1-based position on the > ### reference sequence of the first aligned letter in 'x[[i]]'. 'cigar' > ### must be a character vector containing the extended CIGAR strings. > ### 'at': must be an integer vector containing single nucleotide > ### positions on the reference sequence. > ### Returns an XStringSet (typically DNAStringSet) object parallel to > ### 'at' (i.e. with 1 string per single nucleotide position). > > pileLettersOnSingleRefAt <- function(x, pos, cigar, at) > { > stopifnot(is(x, "XStringSet")) > N <- length(x) # nb of alignments > stopifnot(is.integer(pos) && length(pos) == N) > stopifnot(is.character(cigar) && length(cigar) == N) > stopifnot(is.integer(at)) > > ops <- c("M", "=", "X") > ranges_on_ref <- cigarRangesAlongReferenceSpace(cigar, pos=pos, ops=ops) > ranges_on_query <- cigarRangesAlongQuerySpace(cigar, ops=ops) > > ## 'ranges_on_ref' and 'ranges_on_query' are IRangesList objects parallel > ## to 'x', 'pos', and 'cigar'. In addition, the 2 IRangesList objects > ## have the same "shape" (i.e. same elementLengths()), so, after > ## unlisting, the 2 unlisted objects are parallel IRanges objects. > unlisted_ranges_on_ref <- unlist(ranges_on_ref, use.names=FALSE) > unlisted_ranges_on_query <- unlist(ranges_on_query, use.names=FALSE) > > ## 2 integer vectors parallel to IRanges objects 'unlisted_ranges_on_ref' > ## and 'unlisted_ranges_on_query' above. > range_group <- togroup(ranges_on_ref) > query2ref_shift <- start(unlisted_ranges_on_ref) - > start(unlisted_ranges_on_query) > > hits <- findOverlaps(at, unlisted_ranges_on_ref) > hits_at_in_x <- at[queryHits(hits)] - query2ref_shift[subjectHits(hits)] > hits_group <- range_group[subjectHits(hits)] > unlisted_piles <- subseq(x[hits_group], start=hits_at_in_x, width=1L) > piles_skeleton <- PartitioningByEnd(queryHits(hits), NG=length(at), > names=names(at)) > piles <- relist(unlisted_piles, piles_skeleton) > collapse_XStringSetList(piles) > } > > ### 'x', 'seqnames', 'pos', 'cigar': 4 parallel vectors describing N > ### aligned strings. 'x', 'pos', and 'cigar' as above. 'seqnames' must > ### be a factor-Rle where 'seqnames[i]' is the name of the reference > ### sequence of the i-th alignment. > ### 'at': must be a GRanges object containing single nucleotide > ### positions. 'seqlevels(at)' must be identical to 'levels(seqnames)'. > ### Returns an XStringSet (typically DNAStringSet) object parallel to > ### 'at' (i.e. with 1 string per single nucleotide position). > > pileLettersAt <- function(x, seqnames, pos, cigar, at) > { > stopifnot(is(x, "XStringSet")) > N <- length(x) # nb of alignments > stopifnot(is(seqnames, "Rle")) > stopifnot(is.factor(runValue(seqnames))) > stopifnot(length(seqnames) == N) > stopifnot(is.integer(pos) && length(pos) == N) > stopifnot(is.character(cigar) && length(cigar) == N) > stopifnot(is(at, "GRanges")) > stopifnot(all(width(at) == 1L)) > stopifnot(identical(seqlevels(at), levels(seqnames))) > > ## We process 1 chromosome at a time. So we start by splitting > ## 'x', 'pos', 'cigar', and 'start(at)' by chromosome. The 4 > ## resulting list-like objects have 1 list element per chromosome > ## in 'seqlevels(at)' (or in 'levels(seqnames)', which is identical > ## to 'seqlevels(at)'). > x_by_chrom <- split(x, seqnames) # XStringSetList > pos_by_chrom <- split(pos, seqnames) # IntegerList > cigar_by_chrom <- split(cigar, seqnames) # CharacterList > at_by_chrom <- split(start(at), seqnames(at)) # IntegerList > > ## Unsplit index. > split_idx <- unlist(split(seq_along(at), seqnames(at)), > use.names=FALSE) > unsplit_idx <- integer(length(at)) > unsplit_idx[split_idx] <- seq_along(at) > > do.call("c", lapply(seq_along(seqlevels(at)), > function(i) > pileLettersOnSingleRefAt( > x_by_chrom[[i]], > pos_by_chrom[[i]], > cigar_by_chrom[[i]], > at_by_chrom[[i]])))[unsplit_idx] > } > > >> >> Thank you in advance for any advice. >> >> Kim Siegmund >> > -- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793
ADD COMMENT
0
Entering edit mode
@michael-lawrence-3846
Last seen 3.0 years ago
United States
On Tue, Oct 22, 2013 at 2:58 AM, Hervé Pagès <hpages@fhcrc.org> wrote: > Hi Kim, > > On 10/18/2013 11:04 PM, Kim Siegmund wrote: > >> Dear Herve, >> >> I work with Tim Triche Jr. at USC and he told me you were the best person >> to contact about a particular programming question I have. I would like to >> estimate the SNP variant frequencies in a BAM file at a list of specific >> locations. Specifically, I have a file with a list of somatic SNP variant >> locations, e.g. >> chr1 31184771 >> chr1 40229367 >> chr1 48699354 >> . (~500 rows), >> and a BAM file of whole exon sequencing reads at ~ 40X coverage. I'd like >> to know the base calls in the BAM file at the specific locations from the >> file of somatic variant positions. Is there an easy way to extract this >> information from the BAM file? If yes, can I also get the quality score >> for each base? >> >> I am familiar with GenomicRanges & ReadGappedAlignment, but perhaps at >> only an intermediate level. I was told this is easy in samtools (is it the >> mpileup command?) but cannot find any examples showing how this might be >> done in Bioconductor. Although it seems the package VariantTools must have >> this capability, I did not see from the vignette how I might do this >> targeted data summarization. >> > > Maybe you could ask Michael about VariantTools. > > For using VariantTools, you'll need a genome indexed for GMAP/GSNAP. It can be created from e.g. a BSgenome object: genome <- GmapGenome(BSgenome.Hsapiens.UCSC.hg19, create=TRUE) That will probably require more than 8 GB of RAM, so be careful, but it will only generate the index once (it is cached on disk). Then: param <- TallyVariantsParam(genome, which = your_snp_locations) vr <- tallyVariants(bam, param) The returned value is a VRanges, which is an extension of GRanges for working with variants. Here below is a pure Biostrings + GenomicRanges solution. It uses > the pileLettersAt() function, which isn't in any of those packages > yet, but will be included in a package to be added soon to Bioc- devel. > > Here is how to use pileLettersAt(): > > Input: > > - A BAM file: > > bamfile <- BamFile(system.file("extdata", "ex1.bam", > package="Rsamtools")) > > seqinfo(bamfile) # to see the seqlevels and seqlengths > > stackStringsFromBam(bamfile, param="seq1:1-21") # a quick look at > # the reads > > - A GRanges object containing single nucleotide positions: > > snpos <- GRanges(Rle(c("seq1", "seq2"), c(7, 2)), > IRanges(c(1:5, 21, 1575, 1513:1514), width=1)) > > Some preliminary massage on 'snpos': > > seqinfo(snpos) <- merge(seqinfo(snpos), seqinfo(bamfile)) > seqlevels(snpos) <- seqlevelsInUse(snpos) > > Load the BAM file in a GAlignments object. We load only the reads > aligned to the sequences in 'seqlevels(snpos)' and we filter out > reads not passing quality controls (flag bit 0x200) and PCR or > optical duplicates (flag bit 0x400). See ?ScanBamParam and the SAM > Spec for more information. > > which <- as(seqinfo(snpos), "GRanges") > flag <- scanBamFlag(**isNotPassingQualityControls=**FALSE, > isDuplicate=FALSE) > what <- c("seq", "qual") > param <- ScanBamParam(flag=flag, what=c("seq", "qual"), which=which) > gal <- readGAlignmentsFromBam(**bamfile, param=param) > seqlevels(gal) <- seqlevels(snpos) > > Extract the read sequences (a.k.a. query sequences) and quality > strings. Both are reported with respect to the + strand. > > qseq <- mcols(gal)$seq > qual <- mcols(gal)$qual > > nucl_piles <- pileLettersAt(qseq, seqnames(gal), start(gal), cigar(gal), > snpos) > qual_piles <- pileLettersAt(qual, seqnames(gal), start(gal), cigar(gal), > snpos) > mcols(snpos)$nucl_piles <- nucl_piles > mcols(snpos)$qual_piles <- qual_piles > > Then: > > > snpos > GRanges with 9 ranges and 2 metadata columns: > seqnames ranges strand | nucl_piles > <rle> <iranges> <rle> | <dnastringset> > [1] seq1 [ 1, 1] * | C > [2] seq1 [ 2, 2] * | A > [3] seq1 [ 3, 3] * | CC > [4] seq1 [ 4, 4] * | TT > [5] seq1 [ 5, 5] * | AAA > [6] seq1 [ 21, 21] * | TTTTTTTTT > [7] seq1 [1575, 1575] * | > [8] seq2 [1513, 1513] * | GGGGGGGGGGGGGGGGTTGG > [9] seq2 [1514, 1514] * | TTTTTTTTTTTTTTTTTTTTTTTTTAT > qual_piles > <bstringset> > [1] < > [2] < > [3] << > [4] << > [5] <<< > [6] <<<-<;<<= > [7] > [8] <<<<;6:<=4<;16<<'+:. > [9] <<<;;=<<=<<<<<<:;'84;;<<;+1 > --- > seqlengths: > seq1 seq2 > 1575 1584 > > Finally, to summarize A/C/G/T frequencies at each position: > > > alphabetFrequency(nucl_piles, baseOnly=TRUE) > A C G T other > [1,] 0 1 0 0 0 > [2,] 1 0 0 0 0 > [3,] 0 2 0 0 0 > [4,] 0 0 0 2 0 > [5,] 3 0 0 0 0 > [6,] 0 0 0 9 0 > [7,] 0 0 0 0 0 > [8,] 0 0 18 2 0 > [9,] 1 0 0 26 0 > > See below for pileLettersAt's code. > > Cheers, > H. > > > ------------------------------**------------------------------** > ----------- > > ### Conceptually equivalent to: > ### sapply(x, function(xx) paste(xx, collapse="")) > > collapse_XStringSetList <- function(x) > { > unlisted_x <- unlist(x, use.names=FALSE) > unlisted_ans <- unlist(unlisted_x, use.names=FALSE) > ans_width <- sum(relist(width(unlisted_x), x)) > extract_at <- as(PartitioningByEnd(cumsum(**ans_width)), "IRanges") > extractAt(unlisted_ans, extract_at) > } > > ### pileLettersOnSingleRefAt() is the workhorse behind pileLettersAt(). > ### 'x', 'pos', 'cigar': 3 parallel vectors describing N strings aligned > ### to the same reference sequence. 'x' must be an XStringSet (typically > ### DNAStringSet) object containing the unaligned strings (a.k.a. the > ### query sequences) reported with respect to the + strand. 'pos' must > ### be an integer vector where 'pos[i]' is the 1-based position on the > ### reference sequence of the first aligned letter in 'x[[i]]'. 'cigar' > ### must be a character vector containing the extended CIGAR strings. > ### 'at': must be an integer vector containing single nucleotide > ### positions on the reference sequence. > ### Returns an XStringSet (typically DNAStringSet) object parallel to > ### 'at' (i.e. with 1 string per single nucleotide position). > > pileLettersOnSingleRefAt <- function(x, pos, cigar, at) > { > stopifnot(is(x, "XStringSet")) > N <- length(x) # nb of alignments > stopifnot(is.integer(pos) && length(pos) == N) > stopifnot(is.character(cigar) && length(cigar) == N) > stopifnot(is.integer(at)) > > ops <- c("M", "=", "X") > ranges_on_ref <- cigarRangesAlongReferenceSpace**(cigar, pos=pos, > ops=ops) > ranges_on_query <- cigarRangesAlongQuerySpace(**cigar, ops=ops) > > ## 'ranges_on_ref' and 'ranges_on_query' are IRangesList objects > parallel > ## to 'x', 'pos', and 'cigar'. In addition, the 2 IRangesList objects > ## have the same "shape" (i.e. same elementLengths()), so, after > ## unlisting, the 2 unlisted objects are parallel IRanges objects. > unlisted_ranges_on_ref <- unlist(ranges_on_ref, use.names=FALSE) > unlisted_ranges_on_query <- unlist(ranges_on_query, use.names=FALSE) > > ## 2 integer vectors parallel to IRanges objects > 'unlisted_ranges_on_ref' > ## and 'unlisted_ranges_on_query' above. > range_group <- togroup(ranges_on_ref) > query2ref_shift <- start(unlisted_ranges_on_ref) - > start(unlisted_ranges_on_**query) > > hits <- findOverlaps(at, unlisted_ranges_on_ref) > hits_at_in_x <- at[queryHits(hits)] - query2ref_shift[subjectHits(** > hits)] > hits_group <- range_group[subjectHits(hits)] > unlisted_piles <- subseq(x[hits_group], start=hits_at_in_x, width=1L) > piles_skeleton <- PartitioningByEnd(queryHits(**hits), NG=length(at), > names=names(at)) > piles <- relist(unlisted_piles, piles_skeleton) > collapse_XStringSetList(piles) > } > > ### 'x', 'seqnames', 'pos', 'cigar': 4 parallel vectors describing N > ### aligned strings. 'x', 'pos', and 'cigar' as above. 'seqnames' must > ### be a factor-Rle where 'seqnames[i]' is the name of the reference > ### sequence of the i-th alignment. > ### 'at': must be a GRanges object containing single nucleotide > ### positions. 'seqlevels(at)' must be identical to 'levels(seqnames)'. > ### Returns an XStringSet (typically DNAStringSet) object parallel to > ### 'at' (i.e. with 1 string per single nucleotide position). > > pileLettersAt <- function(x, seqnames, pos, cigar, at) > { > stopifnot(is(x, "XStringSet")) > N <- length(x) # nb of alignments > stopifnot(is(seqnames, "Rle")) > stopifnot(is.factor(runValue(**seqnames))) > stopifnot(length(seqnames) == N) > stopifnot(is.integer(pos) && length(pos) == N) > stopifnot(is.character(cigar) && length(cigar) == N) > stopifnot(is(at, "GRanges")) > stopifnot(all(width(at) == 1L)) > stopifnot(identical(seqlevels(**at), levels(seqnames))) > > ## We process 1 chromosome at a time. So we start by splitting > ## 'x', 'pos', 'cigar', and 'start(at)' by chromosome. The 4 > ## resulting list-like objects have 1 list element per chromosome > ## in 'seqlevels(at)' (or in 'levels(seqnames)', which is identical > ## to 'seqlevels(at)'). > x_by_chrom <- split(x, seqnames) # XStringSetList > pos_by_chrom <- split(pos, seqnames) # IntegerList > cigar_by_chrom <- split(cigar, seqnames) # CharacterList > at_by_chrom <- split(start(at), seqnames(at)) # IntegerList > > ## Unsplit index. > split_idx <- unlist(split(seq_along(at), seqnames(at)), > use.names=FALSE) > unsplit_idx <- integer(length(at)) > unsplit_idx[split_idx] <- seq_along(at) > > do.call("c", lapply(seq_along(seqlevels(at)**), > function(i) > pileLettersOnSingleRefAt( > x_by_chrom[[i]], > pos_by_chrom[[i]], > cigar_by_chrom[[i]], > at_by_chrom[[i]])))[unsplit_**idx] > } > > > >> Thank you in advance for any advice. >> >> Kim Siegmund >> >> > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages@fhcrc.org > Phone: (206) 667-5791 > Fax: (206) 667-1319 > > ______________________________**_________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/**listinfo/bioconductor<https: stat.et="" hz.ch="" mailman="" listinfo="" bioconductor=""> > Search the archives: http://news.gmane.org/gmane.** > science.biology.informatics.**conductor<http: news.gmane.org="" gmane.="" science.biology.informatics.conductor=""> > [[alternative HTML version deleted]]
ADD COMMENT

Login before adding your answer.

Traffic: 602 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6