Dear BioConductors,
I'm looking for a way to "cut" out a "slice" of aligned sequences from a BAM file given start and end positions in the reference sequence.
I have looked at the GenomicAlignments package and GenomicRanges because I would expect those to be able to do the trick.
I can construct a ScanBamParam call that picks out the reads overlapping my region of interest, but the sequence I get out in the $seq metadata column is the full sequence of the read - not only the sequence inside my range of interest.
Here's a small example of what I do trying to get the aligned "slice" between positions 19363800 and 19363805 on chromosome 14, but as it shows, I get the full reads that overlap the range - not the slice:
library(GenomicAlignments) library(RNAseqData.HNRNPC.bam.chr14) fl <- RNAseqData.HNRNPC.bam.chr14_BAMFILES[1] bfl <- BamFile(fl) aln <- readGAlignments(bfl, param=ScanBamParam(what="seq", which=GRanges("chr14", IRanges(19363800,19363805 )))) > aln GAlignments object with 2 alignments and 1 metadata column: seqnames strand cigar qwidth start end width <Rle> <Rle> <character> <integer> <integer> <integer> <integer> [1] chr14 + 72M 72 19363738 19363809 72 [2] chr14 - 72M 72 19363755 19363826 72 njunc | <integer> | [1] 0 | [2] 0 | seq <DNAStringSet> [1] CCCATATGTACATCAGGCCCCAGGTATACACTGGACTCCAGGTGGACACCAGCACTCAGTTGGATACACACA [2] CCCCAGGTATACACTGGACTCCAGGTGGACACCAGCACTCAGTTGGATACACACACTCAAGGTGGACACCAG ------- seqinfo: 93 sequences from an unspecified genome
What I want to get out is only the "slice" between 19363800 and 19363805, ie
GATACA GATACA
I'm sure there must be a way to do just this, but I have not been able to find it.
I realize there are complications with reads having insertions in the specified region, but I'll be happy for a solution that works for simpleCigar=TRUE.
My motivation for wanting this is to count occurrences of individual gene-variants (over the region of interest) and count codon- and amino acid frequencies per position in the region of interest.
Thank you all for some great packages.
> sessionInfo()
R version 3.2.0 (2015-04-16)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.2 LTS
locale:
[1] LC_CTYPE=da_DK.UTF-8 LC_NUMERIC=C
[3] LC_TIME=da_DK.UTF-8 LC_COLLATE=da_DK.UTF-8
[5] LC_MONETARY=da_DK.UTF-8 LC_MESSAGES=da_DK.UTF-8
[7] LC_PAPER=da_DK.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=da_DK.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] RNAseqData.HNRNPC.bam.chr14_0.6.0 GenomicAlignments_1.4.1
[3] Rsamtools_1.20.4 Biostrings_2.36.1
[5] XVector_0.8.0 GenomicRanges_1.20.5
[7] GenomeInfoDb_1.4.1 IRanges_2.2.5
[9] S4Vectors_0.6.1 BiocGenerics_0.14.0
loaded via a namespace (and not attached):
[1] bitops_1.0-6 futile.options_1.0.0 zlibbioc_1.14.0
[4] futile.logger_1.4.1 lambda.r_1.1.7 BiocParallel_1.2.6
[7] tools_3.2.0 compiler_3.2.0
Hi Hervé,
This does exactly what I need.
Thank you very much!
All the best,
Thomas