Question

readGappedAlignments and param argument

0

Entering edit mode

Elena Grassi ▴ 10

@elena-grassi-6138

Last seen 8.4 years ago

Italy, Rozzano, Humanitas University

Hello, I've stumbled across a strange behaviour when indexing bams and then reading them with ReadGappedAlignments with a param holding a which section. Given this bam: data at tungsteno:/rogue_bis/data/bioinfotree/prj/roar/dataset/0.2/roa r_test_2$ samtools view test.bam D1JP1ACXX130107:2:2201:7003:63880 16 chr1 243675625 255 92M * 0 0 * * > library(GenomicRanges) > bam <- "/scarlet/data/bioinfotree/prj/roar/dataset/0.2/roar_test_2/test.bam" > ga <- readGappedAlignments(bam) > ga GappedAlignments with 1 alignment and 0 metadata columns: seqnames strand cigar qwidth start end width ngap <rle> <rle> <character> <integer> <integer> <integer> <integer> <integer> [1] chr1 - 92M 92 243675625 243675716 92 0 --- seqlengths: chr1 chr2 ... chrUn_gl000226 chr18_gl000207_random 249250621 243199373 ... 15008 4262 Everything is fine here. But if index it and load it with a param: > gene_id <- c("A", "B") > features <- GRanges( + seqnames = Rle(c("chr1", "chr1")), + strand = strand(c("-", "-")), + ranges = IRanges( + start=c(243675625, 243675627), + width=c(2, 100)), + DataFrame(gene_id) + ) > indexBam(bam) /tmp/RtmpVbmBLu/filead32d37188.bam "/tmp/RtmpVbmBLu/filead32d37188.bam.bai" > ga_s_p <- readGappedAlignments(bam, param=ScanBamParam(which=features)) > ga_s_p GappedAlignments with 2 alignments and 0 metadata columns: seqnames strand cigar qwidth start end width ngap <rle> <rle> <character> <integer> <integer> <integer> <integer> <integer> [1] chr1 - 92M 92 243675625 243675716 92 0 [2] chr1 - 92M 92 243675625 243675716 92 0 --- seqlengths: chr1 chr2 ... chrUn_gl000226 chr18_gl000207_random 249250621 243199373 ... 15008 4262 > The single read is duplicated...I'm not an expert with these libraries so maybe this is my fault (if I remove the second entry in features everything works, but from the ScanBamParam man about which I was not expecting this behaviour). Sorry for the convoluted example but I've stumbled on this problem while comparing counts obtained with summarizeOverlaps in a script of mine. Here's my session info: > sessionInfo() R version 3.0.1 (2013-05-16) Platform: x86_64-pc-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats graphics grDevices utils datasets methods [8] base other attached packages: [1] rtracklayer_1.20.2 Rsamtools_1.12.3 Biostrings_2.28.0 [4] GenomicRanges_1.12.4 IRanges_1.18.1 BiocGenerics_0.6.0 loaded via a namespace (and not attached): [1] bitops_1.0-5 BSgenome_1.28.0 RCurl_1.95-4.1 stats4_3.0.1 [5] tools_3.0.1 XML_3.98-1.1 zlibbioc_1.6.0 Thanks, E.G. -- $ pom

Alignment Alignment • 1.3k views

ADD COMMENT • link updated 12.4 years ago by Martin Morgan 25k • written 12.4 years ago by Elena Grassi ▴ 10

score 0 · Answer 1 · 2013-09-09

Hi Elena -- On 09/09/2013 03:57 AM, Elena Grassi wrote: > Hello, > > I've stumbled across a strange behaviour when indexing bams and then > reading them with ReadGappedAlignments with a param > holding a which section. > Given this bam: > data at tungsteno:/rogue_bis/data/bioinfotree/prj/roar/dataset/0.2/r oar_test_2$ > samtools view test.bam > D1JP1ACXX130107:2:2201:7003:63880 16 chr1 243675625 > 255 92M * 0 0 * * > >> library(GenomicRanges) >> bam <- "/scarlet/data/bioinfotree/prj/roar/dataset/0.2/roar_test_2/test.bam" >> ga <- readGappedAlignments(bam) >> ga > GappedAlignments with 1 alignment and 0 metadata columns: > seqnames strand cigar qwidth start end > width ngap > <rle> <rle> <character> <integer> <integer> <integer> > <integer> <integer> > [1] chr1 - 92M 92 243675625 243675716 > 92 0 > --- > seqlengths: > chr1 chr2 ... > chrUn_gl000226 chr18_gl000207_random > 249250621 243199373 ... > 15008 4262 > > Everything is fine here. But if index it and load it with a param: >> gene_id <- c("A", "B") >> features <- GRanges( > + seqnames = Rle(c("chr1", "chr1")), > + strand = strand(c("-", "-")), > + ranges = IRanges( > + start=c(243675625, 243675627), > + width=c(2, 100)), > + DataFrame(gene_id) > + ) >> indexBam(bam) > /tmp/RtmpVbmBLu/filead32d37188.bam > "/tmp/RtmpVbmBLu/filead32d37188.bam.bai" >> ga_s_p <- readGappedAlignments(bam, param=ScanBamParam(which=features)) >> ga_s_p > GappedAlignments with 2 alignments and 0 metadata columns: > seqnames strand cigar qwidth start end > width ngap > <rle> <rle> <character> <integer> <integer> <integer> > <integer> <integer> > [1] chr1 - 92M 92 243675625 243675716 > 92 0 > [2] chr1 - 92M 92 243675625 243675716 > 92 0 > --- > seqlengths: > chr1 chr2 ... > chrUn_gl000226 chr18_gl000207_random > 249250621 243199373 ... > 15008 4262 >> > > The single read is duplicated...I'm not an expert with these libraries Your GRanges query is analogous to asking for (something like, maybe I didn't get the indexing correct) samtools view <input.bam> chr1:243675625-243675622 chr1:243675627-243675727 which performs two independent queries -- a read overlapping both regions will be returned twice. You could reduce(features) to make ranges non- overlapping, although it's still possible to have a single read overlapping two distinct regions hence returned twice. Or you could use findOverlaps() and manipulate the returned object to implement your scheme for selecting reads, or identify the appropriate mode in summarizeOverlaps for doing the counting. Or perhaps you can provide more of an explanation for the problem that led you to this? Hope that helps. Martin > so maybe this is my fault (if I remove > the second entry in features everything works, but from the > ScanBamParam man about which I was not expecting > this behaviour). > Sorry for the convoluted example but I've stumbled on this problem > while comparing counts obtained > with summarizeOverlaps in a script of mine. > > Here's my session info: >> sessionInfo() > R version 3.0.1 (2013-05-16) > Platform: x86_64-pc-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 > [7] LC_PAPER=C LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] parallel stats graphics grDevices utils datasets methods > [8] base > > other attached packages: > [1] rtracklayer_1.20.2 Rsamtools_1.12.3 Biostrings_2.28.0 > [4] GenomicRanges_1.12.4 IRanges_1.18.1 BiocGenerics_0.6.0 > > loaded via a namespace (and not attached): > [1] bitops_1.0-5 BSgenome_1.28.0 RCurl_1.95-4.1 stats4_3.0.1 > [5] tools_3.0.1 XML_3.98-1.1 zlibbioc_1.6.0 > > > Thanks, > E.G. > > -- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793