Hello,
I've stumbled across a strange behaviour when indexing bams and then
reading them with ReadGappedAlignments with a param
holding a which section.
Given this bam:
data at tungsteno:/rogue_bis/data/bioinfotree/prj/roar/dataset/0.2/roa
r_test_2$
samtools view test.bam
D1JP1ACXX130107:2:2201:7003:63880 16 chr1 243675625
255 92M * 0 0 * *
> library(GenomicRanges)
> bam <-
"/scarlet/data/bioinfotree/prj/roar/dataset/0.2/roar_test_2/test.bam"
> ga <- readGappedAlignments(bam)
> ga
GappedAlignments with 1 alignment and 0 metadata columns:
seqnames strand cigar qwidth start end
width ngap
<rle> <rle> <character> <integer> <integer> <integer>
<integer> <integer>
[1] chr1 - 92M 92 243675625 243675716
92 0
---
seqlengths:
chr1 chr2 ...
chrUn_gl000226 chr18_gl000207_random
249250621 243199373 ...
15008 4262
Everything is fine here. But if index it and load it with a param:
> gene_id <- c("A", "B")
> features <- GRanges(
+ seqnames = Rle(c("chr1", "chr1")),
+ strand = strand(c("-", "-")),
+ ranges = IRanges(
+ start=c(243675625, 243675627),
+ width=c(2, 100)),
+ DataFrame(gene_id)
+ )
> indexBam(bam)
/tmp/RtmpVbmBLu/filead32d37188.bam
"/tmp/RtmpVbmBLu/filead32d37188.bam.bai"
> ga_s_p <- readGappedAlignments(bam,
param=ScanBamParam(which=features))
> ga_s_p
GappedAlignments with 2 alignments and 0 metadata columns:
seqnames strand cigar qwidth start end
width ngap
<rle> <rle> <character> <integer> <integer> <integer>
<integer> <integer>
[1] chr1 - 92M 92 243675625 243675716
92 0
[2] chr1 - 92M 92 243675625 243675716
92 0
---
seqlengths:
chr1 chr2 ...
chrUn_gl000226 chr18_gl000207_random
249250621 243199373 ...
15008 4262
>
The single read is duplicated...I'm not an expert with these libraries
so maybe this is my fault (if I remove
the second entry in features everything works, but from the
ScanBamParam man about which I was not expecting
this behaviour).
Sorry for the convoluted example but I've stumbled on this problem
while comparing counts obtained
with summarizeOverlaps in a script of mine.
Here's my session info:
> sessionInfo()
R version 3.0.1 (2013-05-16)
Platform: x86_64-pc-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=C LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] parallel stats graphics grDevices utils datasets
methods
[8] base
other attached packages:
[1] rtracklayer_1.20.2 Rsamtools_1.12.3 Biostrings_2.28.0
[4] GenomicRanges_1.12.4 IRanges_1.18.1 BiocGenerics_0.6.0
loaded via a namespace (and not attached):
[1] bitops_1.0-5 BSgenome_1.28.0 RCurl_1.95-4.1 stats4_3.0.1
[5] tools_3.0.1 XML_3.98-1.1 zlibbioc_1.6.0
Thanks,
E.G.
--
$ pom
Hi Elena --
On 09/09/2013 03:57 AM, Elena Grassi wrote:
> Hello,
>
> I've stumbled across a strange behaviour when indexing bams and then
> reading them with ReadGappedAlignments with a param
> holding a which section.
> Given this bam:
> data at tungsteno:/rogue_bis/data/bioinfotree/prj/roar/dataset/0.2/r
oar_test_2$
> samtools view test.bam
> D1JP1ACXX130107:2:2201:7003:63880 16 chr1 243675625
> 255 92M * 0 0 * *
>
>> library(GenomicRanges)
>> bam <-
"/scarlet/data/bioinfotree/prj/roar/dataset/0.2/roar_test_2/test.bam"
>> ga <- readGappedAlignments(bam)
>> ga
> GappedAlignments with 1 alignment and 0 metadata columns:
> seqnames strand cigar qwidth start end
> width ngap
> <rle> <rle> <character> <integer> <integer> <integer>
> <integer> <integer>
> [1] chr1 - 92M 92 243675625 243675716
> 92 0
> ---
> seqlengths:
> chr1 chr2 ...
> chrUn_gl000226 chr18_gl000207_random
> 249250621 243199373 ...
> 15008 4262
>
> Everything is fine here. But if index it and load it with a param:
>> gene_id <- c("A", "B")
>> features <- GRanges(
> + seqnames = Rle(c("chr1", "chr1")),
> + strand = strand(c("-", "-")),
> + ranges = IRanges(
> + start=c(243675625, 243675627),
> + width=c(2, 100)),
> + DataFrame(gene_id)
> + )
>> indexBam(bam)
> /tmp/RtmpVbmBLu/filead32d37188.bam
> "/tmp/RtmpVbmBLu/filead32d37188.bam.bai"
>> ga_s_p <- readGappedAlignments(bam,
param=ScanBamParam(which=features))
>> ga_s_p
> GappedAlignments with 2 alignments and 0 metadata columns:
> seqnames strand cigar qwidth start end
> width ngap
> <rle> <rle> <character> <integer> <integer> <integer>
> <integer> <integer>
> [1] chr1 - 92M 92 243675625 243675716
> 92 0
> [2] chr1 - 92M 92 243675625 243675716
> 92 0
> ---
> seqlengths:
> chr1 chr2 ...
> chrUn_gl000226 chr18_gl000207_random
> 249250621 243199373 ...
> 15008 4262
>>
>
> The single read is duplicated...I'm not an expert with these
libraries
Your GRanges query is analogous to asking for (something like, maybe I
didn't
get the indexing correct)
samtools view <input.bam> chr1:243675625-243675622
chr1:243675627-243675727
which performs two independent queries -- a read overlapping both
regions will
be returned twice. You could reduce(features) to make ranges non-
overlapping,
although it's still possible to have a single read overlapping two
distinct
regions hence returned twice. Or you could use findOverlaps() and
manipulate the
returned object to implement your scheme for selecting reads, or
identify the
appropriate mode in summarizeOverlaps for doing the counting. Or
perhaps you can
provide more of an explanation for the problem that led you to this?
Hope that helps.
Martin
> so maybe this is my fault (if I remove
> the second entry in features everything works, but from the
> ScanBamParam man about which I was not expecting
> this behaviour).
> Sorry for the convoluted example but I've stumbled on this problem
> while comparing counts obtained
> with summarizeOverlaps in a script of mine.
>
> Here's my session info:
>> sessionInfo()
> R version 3.0.1 (2013-05-16)
> Platform: x86_64-pc-linux-gnu (64-bit)
>
> locale:
> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
> [7] LC_PAPER=C LC_NAME=C
> [9] LC_ADDRESS=C LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] parallel stats graphics grDevices utils datasets
methods
> [8] base
>
> other attached packages:
> [1] rtracklayer_1.20.2 Rsamtools_1.12.3 Biostrings_2.28.0
> [4] GenomicRanges_1.12.4 IRanges_1.18.1 BiocGenerics_0.6.0
>
> loaded via a namespace (and not attached):
> [1] bitops_1.0-5 BSgenome_1.28.0 RCurl_1.95-4.1 stats4_3.0.1
> [5] tools_3.0.1 XML_3.98-1.1 zlibbioc_1.6.0
>
>
> Thanks,
> E.G.
>
>
--
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109
Location: Arnold Building M1 B861
Phone: (206) 667-2793