nearly exact overlaps between GAlignmentPairs and GenomicRanges

0

Entering edit mode

Georg Otto ▴ 120

@georg-otto-6333

Last seen 7.1 years ago

United Kingdom

Dear all, I have a problem finding nerly exact overlaps between read pairs and genomic regions. I try this by using GAlignmentPairs and GenomicRanges: > library(Rsamtools) ex1_file <- system.file("extdata", "ex1.bam", > package="Rsamtools") galp <- readGAlignmentPairs(ex1_file, use.names=TRUE) > galp[3:4] GAlignmentPairs with 2 alignment pairs and 0 metadata columns: seqnames strand : ranges -- ranges <rle> <rle> : <iranges> -- <iranges> EAS54_65:3:321:311:983 seq1 + : [51, 85] -- [228, 262] B7_591:5:42:540:501 seq1 + : [60, 95] -- [224, 259] --- seqlengths: seq1 seq2 1575 1584 > ranges <- GRanges(c("seq1", "seq1"), IRanges(c(50, 57), end = c(263, 260))) > ranges GRanges with 2 ranges and 0 metadata columns: seqnames ranges strand <rle> <iranges> <rle> [1] seq1 [50, 263] * [2] seq1 [57, 260] * --- seqlengths: seq1 NA > summary <-summarizeOverlaps(ranges, galp[3:4], mode = "IntersectionStrict", + inter.feature = FALSE, fragments = FALSE) Warning message: In if (all(strand(reads) == "*")) ignore.strand <- TRUE : the condition has length > 1 and only the first element will be used > assays(summary)$counts reads [1,] 2 [2,] 1 I am not entirely sure what the warning means, but my point is a different one: I use the "IntersectStrict" mode because I only want to count read pairs that are completely within the interval. This is what I get, but I want to use a more strict selection for "almost equal", i.e. intervals that are the same size or only slightly larger (say 3 bp) than the range of the read pairs. So ranges[1] should count galp[3], but not galp[4] Any hint will be appreciated.... Best wishes, Georg > sessionInfo() R version 3.0.1 (2013-05-16) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8 [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats graphics grDevices utils datasets methods [8] base other attached packages: [1] Rsamtools_1.14.3 Biostrings_2.30.1 GenomicRanges_1.14.4 [4] XVector_0.2.0 IRanges_1.20.7 BiocGenerics_0.8.0 loaded via a namespace (and not attached): [1] bitops_1.0-6 compiler_3.0.1 stats4_3.0.1 tools_3.0.1 zlibbioc_1.8.0

Alignment Alignment • 2.3k views

ADD COMMENT • link 11.4 years ago Georg Otto ▴ 120

0

Entering edit mode

Michael Lawrence ▴ 60

@michael-lawrence-6705

Last seen 8.7 years ago

United States

The summarizeOverlaps function supports passing a function as the "mode" argument, which needs to take the two sets of ranges as input and return a vector of counts. So the problem reduces to counting the "within" overlaps between a GRanges and a GAlignmentPairs, where the range of interest should only be slightly larger than the paired read alignment. The closest built-in overlap mode to this is probably type="equal" and maxgap=2L. But it's not quite right, because that mode is symmetric in its tolerance, i.e., the read could be slightly larger than the range of interest. The other problem is that for some reason GAlignmentPairs does not yet support type="equal", but that is actually solvable by reducing the GAlignmentPairs to a GRanges via granges(galp). We can do that in this case, since the ranges of interest are gapless ranges. But to get exactly what you want, we need to use type="within" with findOverlaps() and then filter the hits for cases where the overhang is below some tolerance (3bp): reads <- granges(galp) # simplifies things a bit hits <- findOverlaps(reads, ranges, type="within") overhang <- width(ranges[subjectHits(hits)]) - width(reads[queryHits(hits)]) hits <- hits[overhang < 3L] Now we need the count vector: counts <- countSubjectHits(hits) Then it's just a simple exercise to insert that code into a function that conforms to the expectations of the "mode" parameter. Hope that helps, Michael On Thu, Aug 21, 2014 at 12:59 PM, Georg Otto <georg.otto at="" imm.ox.ac.uk=""> wrote: > > Dear all, > > I have a problem finding nerly exact overlaps between read pairs and > genomic regions. I try this by using GAlignmentPairs and GenomicRanges: > > > library(Rsamtools) ex1_file <- system.file("extdata", "ex1.bam", > > package="Rsamtools") galp <- readGAlignmentPairs(ex1_file, > use.names=TRUE) > > galp[3:4] > > GAlignmentPairs with 2 alignment pairs and 0 metadata columns: > seqnames strand : ranges -- ranges > <rle> <rle> : <iranges> -- <iranges> > EAS54_65:3:321:311:983 seq1 + : [51, 85] -- [228, 262] > B7_591:5:42:540:501 seq1 + : [60, 95] -- [224, 259] > --- > seqlengths: > seq1 seq2 1575 1584 > > > > ranges <- GRanges(c("seq1", "seq1"), IRanges(c(50, 57), end = c(263, > 260))) > > ranges > GRanges with 2 ranges and 0 metadata columns: > seqnames ranges strand > <rle> <iranges> <rle> > [1] seq1 [50, 263] * > [2] seq1 [57, 260] * > --- > seqlengths: > seq1 > NA > > summary <-summarizeOverlaps(ranges, galp[3:4], mode = > "IntersectionStrict", > + inter.feature = FALSE, fragments = FALSE) > Warning message: > In if (all(strand(reads) == "*")) ignore.strand <- TRUE : > the condition has length > 1 and only the first element will be used > > assays(summary)$counts > reads > [1,] 2 > [2,] 1 > > > I am not entirely sure what the warning means, but my point is a > different one: I use the "IntersectStrict" mode because I only want to > count read > pairs that are completely within the interval. This is what I > get, but I want to use a more strict selection for "almost equal", > i.e. intervals that are the same size or only slightly larger (say 3 bp) > than the range of the read pairs. So ranges[1] should count galp[3], but > not galp[4] > > Any hint will be appreciated.... > > Best wishes, > > Georg > > > sessionInfo() > R version 3.0.1 (2013-05-16) > Platform: x86_64-unknown-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8 > [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8 > [7] LC_PAPER=C LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] parallel stats graphics grDevices utils datasets methods > [8] base > > other attached packages: > [1] Rsamtools_1.14.3 Biostrings_2.30.1 GenomicRanges_1.14.4 > [4] XVector_0.2.0 IRanges_1.20.7 BiocGenerics_0.8.0 > > loaded via a namespace (and not attached): > [1] bitops_1.0-6 compiler_3.0.1 stats4_3.0.1 tools_3.0.1 > zlibbioc_1.8.0 > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]

ADD COMMENT • link 11.4 years ago Michael Lawrence ▴ 60

0

Entering edit mode

Hi Georg, I noticed your R and Bioconductor packages are out of date - you may want to update. Once you update you'll see that summarizeOverlaps() has moved from GenomicRanges to GenomicAlignments package. I've fixed the strand-related warning you see below in GenomicAlignments release (1.0.6) and devel (1.1.26). Let us know if you have trouble wrapping Michael's suggestion into a mode function. Valerie On 08/21/2014 09:48 PM, Michael Lawrence wrote: > The summarizeOverlaps function supports passing a function as the "mode" > argument, which needs to take the two sets of ranges as input and return a > vector of counts. So the problem reduces to counting the "within" overlaps > between a GRanges and a GAlignmentPairs, where the range of interest should > only be slightly larger than the paired read alignment. > > The closest built-in overlap mode to this is probably type="equal" and > maxgap=2L. But it's not quite right, because that mode is symmetric in its > tolerance, i.e., the read could be slightly larger than the range of > interest. The other problem is that for some reason GAlignmentPairs does > not yet support type="equal", but that is actually solvable by reducing the > GAlignmentPairs to a GRanges via granges(galp). We can do that in this > case, since the ranges of interest are gapless ranges. > > But to get exactly what you want, we need to use type="within" with > findOverlaps() and then filter the hits for cases where the overhang is > below some tolerance (3bp): > > reads <- granges(galp) # simplifies things a bit > hits <- findOverlaps(reads, ranges, type="within") > overhang <- width(ranges[subjectHits(hits)]) - width(reads[queryHits(hits)]) > hits <- hits[overhang < 3L] > > Now we need the count vector: > > counts <- countSubjectHits(hits) > > Then it's just a simple exercise to insert that code into a function that > conforms to the expectations of the "mode" parameter. > > Hope that helps, > Michael > > On Thu, Aug 21, 2014 at 12:59 PM, Georg Otto <georg.otto at="" imm.ox.ac.uk=""> > wrote: > >> >> Dear all, >> >> I have a problem finding nerly exact overlaps between read pairs and >> genomic regions. I try this by using GAlignmentPairs and GenomicRanges: >> >>> library(Rsamtools) ex1_file <- system.file("extdata", "ex1.bam", >>> package="Rsamtools") galp <- readGAlignmentPairs(ex1_file, >> use.names=TRUE) >>> galp[3:4] >> >> GAlignmentPairs with 2 alignment pairs and 0 metadata columns: >> seqnames strand : ranges -- ranges >> <rle> <rle> : <iranges> -- <iranges> >> EAS54_65:3:321:311:983 seq1 + : [51, 85] -- [228, 262] >> B7_591:5:42:540:501 seq1 + : [60, 95] -- [224, 259] >> --- >> seqlengths: >> seq1 seq2 1575 1584 >> >> >>> ranges <- GRanges(c("seq1", "seq1"), IRanges(c(50, 57), end = c(263, >> 260))) >>> ranges >> GRanges with 2 ranges and 0 metadata columns: >> seqnames ranges strand >> <rle> <iranges> <rle> >> [1] seq1 [50, 263] * >> [2] seq1 [57, 260] * >> --- >> seqlengths: >> seq1 >> NA >>> summary <-summarizeOverlaps(ranges, galp[3:4], mode = >> "IntersectionStrict", >> + inter.feature = FALSE, fragments = FALSE) >> Warning message: >> In if (all(strand(reads) == "*")) ignore.strand <- TRUE : >> the condition has length > 1 and only the first element will be used >>> assays(summary)$counts >> reads >> [1,] 2 >> [2,] 1 >> >> >> I am not entirely sure what the warning means, but my point is a >> different one: I use the "IntersectStrict" mode because I only want to >> count read >> pairs that are completely within the interval. This is what I >> get, but I want to use a more strict selection for "almost equal", >> i.e. intervals that are the same size or only slightly larger (say 3 bp) >> than the range of the read pairs. So ranges[1] should count galp[3], but >> not galp[4] >> >> Any hint will be appreciated.... >> >> Best wishes, >> >> Georg >> >>> sessionInfo() >> R version 3.0.1 (2013-05-16) >> Platform: x86_64-unknown-linux-gnu (64-bit) >> >> locale: >> [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C >> [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8 >> [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8 >> [7] LC_PAPER=C LC_NAME=C >> [9] LC_ADDRESS=C LC_TELEPHONE=C >> [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C >> >> attached base packages: >> [1] parallel stats graphics grDevices utils datasets methods >> [8] base >> >> other attached packages: >> [1] Rsamtools_1.14.3 Biostrings_2.30.1 GenomicRanges_1.14.4 >> [4] XVector_0.2.0 IRanges_1.20.7 BiocGenerics_0.8.0 >> >> loaded via a namespace (and not attached): >> [1] bitops_1.0-6 compiler_3.0.1 stats4_3.0.1 tools_3.0.1 >> zlibbioc_1.8.0 >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Valerie Obenchain Program in Computational Biology Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, Seattle, WA 98109 Email: vobencha at fhcrc.org Phone: (206) 667-3158

ADD REPLY • link 11.4 years ago Valerie Obenchain ★ 6.8k

0

Entering edit mode

Georg Otto ▴ 120

@georg-otto-6333

Last seen 7.1 years ago

United Kingdom

Michael Lawrence <michafla at="" gene.com=""> writes: > But to get exactly what you want, we need to use type="within" with > findOverlaps() and then filter the hits for cases where the overhang is > below some tolerance (3bp): > > reads <- granges(galp) # simplifies things a bit > hits <- findOverlaps(reads, ranges, type="within") > overhang <- width(ranges[subjectHits(hits)]) - width(reads[queryHits(hits)]) > hits <- hits[overhang < 3L] > > Now we need the count vector: > > counts <- countSubjectHits(hits) > > Then it's just a simple exercise to insert that code into a function that > conforms to the expectations of the "mode" parameter. > Thanks a lot for your help. I have problems, however, to wrap this into a mode function. Here is what I have done: > ex1_file <- system.file("extdata", "ex1.bam", package="Rsamtools") > galp <- readGAlignmentPairs(ex1_file, use.names=TRUE) > galp[3:4] GAlignmentPairs with 2 alignment pairs and 0 metadata columns: seqnames strand : ranges -- ranges <rle> <rle> : <iranges> -- <iranges> EAS54_65:3:321:311:983 seq1 + : [51, 85] -- [228, 262] B7_591:5:42:540:501 seq1 + : [60, 95] -- [224, 259] --- seqlengths: seq1 seq2 1575 1584 > ranges <- GRanges(c("seq1", "seq1"), IRanges(c(50, 57), end = c(263, 260))) > ranges GRanges with 2 ranges and 0 metadata columns: seqnames ranges strand <rle> <iranges> <rle> [1] seq1 [50, 263] * [2] seq1 [57, 260] * --- seqlengths: seq1 NA > summary <-summarizeOverlaps(ranges, galp[3:4], mode = "IntersectionStrict", + inter.feature = FALSE, fragments = FALSE) Warning message: In if (all(strand(reads) == "*")) ignore.strand <- TRUE : the condition has length > 1 and only the first element will be used > assays(summary)$counts reads [1,] 2 [2,] 1 However, I want the first feature only to be matched by galp[3] so I use this function: > intersectFun <-function (features, reads, ignore.strand = FALSE, inter.feature = TRUE) { + + reads <- granges(reads) + hits <- findOverlaps(reads, features, type="within") + overhang <- width(ranges[subjectHits(hits)]) - width(reads[queryHits(hits)]) + hits <- hits[overhang < 3L] + counts <- countSubjectHits(hits) + return(counts) + } > summary <-summarizeOverlaps(ranges, galp, mode = "intersectFun", + inter.feature = FALSE, fragments = FALSE) Error in (function (classes, fdef, mtable) (from #3) : unable to find an inherited method for function ?granges? for signature ?"GRangesList"? In addition: Warning message: In if (all(strand(reads) == "*")) ignore.strand <- TRUE : the condition has length > 1 and only the first element will be used Apparently, the summarizeOverlaps converts the input GAlignmentPairs object into a GRangesList, which then can not be used by the mode function. Sorry, I do not have a deep understanding of these data structures. Could you help me with a workaround? Best wishes, Georg

ADD COMMENT • link 11.4 years ago Georg Otto ▴ 120

0

Entering edit mode

Since we're going for the bounds of the paired alignments, use reads <- unlist(range(reads), use.names=FALSE) This assumes that you do not have cases where members of the same pair map to different chromosomes. On Fri, Aug 22, 2014 at 10:40 AM, Georg Otto <georg.otto at="" imm.ox.ac.uk=""> wrote: > > > Michael Lawrence <michafla at="" gene.com=""> writes: > > > But to get exactly what you want, we need to use type="within" with > > findOverlaps() and then filter the hits for cases where the overhang is > > below some tolerance (3bp): > > > > reads <- granges(galp) # simplifies things a bit > > hits <- findOverlaps(reads, ranges, type="within") > > overhang <- width(ranges[subjectHits(hits)]) - > width(reads[queryHits(hits)]) > > hits <- hits[overhang < 3L] > > > > Now we need the count vector: > > > > counts <- countSubjectHits(hits) > > > > Then it's just a simple exercise to insert that code into a function that > > conforms to the expectations of the "mode" parameter. > > > > > Thanks a lot for your help. I have problems, however, to wrap this into > a mode function. Here is what I have done: > > > > ex1_file <- system.file("extdata", "ex1.bam", package="Rsamtools") > > galp <- readGAlignmentPairs(ex1_file, use.names=TRUE) > > galp[3:4] > GAlignmentPairs with 2 alignment pairs and 0 metadata columns: > seqnames strand : ranges -- ranges > <rle> <rle> : <iranges> -- <iranges> > EAS54_65:3:321:311:983 seq1 + : [51, 85] -- [228, 262] > B7_591:5:42:540:501 seq1 + : [60, 95] -- [224, 259] > --- > seqlengths: > seq1 seq2 > 1575 1584 > > > ranges <- GRanges(c("seq1", "seq1"), IRanges(c(50, 57), end = c(263, > 260))) > > ranges > GRanges with 2 ranges and 0 metadata columns: > seqnames ranges strand > <rle> <iranges> <rle> > [1] seq1 [50, 263] * > [2] seq1 [57, 260] * > --- > seqlengths: > seq1 > NA > > > summary <-summarizeOverlaps(ranges, galp[3:4], mode = > "IntersectionStrict", > + inter.feature = FALSE, fragments = FALSE) > > Warning message: > In if (all(strand(reads) == "*")) ignore.strand <- TRUE : > the condition has length > 1 and only the first element will be used > > > assays(summary)$counts > reads > [1,] 2 > [2,] 1 > > > > However, I want the first feature only to be matched by galp[3] so I use > this function: > > > > intersectFun <-function (features, reads, ignore.strand = FALSE, > inter.feature = TRUE) { > + > + reads <- granges(reads) > + hits <- findOverlaps(reads, features, type="within") > + overhang <- width(ranges[subjectHits(hits)]) - > width(reads[queryHits(hits)]) > + hits <- hits[overhang < 3L] > + counts <- countSubjectHits(hits) > + return(counts) > + } > > > > summary <-summarizeOverlaps(ranges, galp, mode = "intersectFun", > + inter.feature = FALSE, fragments = FALSE) > > Error in (function (classes, fdef, mtable) (from #3) : > unable to find an inherited method for function ?granges? for signature > ?"GRangesList"? > In addition: Warning message: > In if (all(strand(reads) == "*")) ignore.strand <- TRUE : > the condition has length > 1 and only the first element will be used > > > Apparently, the summarizeOverlaps converts the input GAlignmentPairs > object into a GRangesList, which then can not be used by the mode > function. Sorry, I do not have a deep understanding of these data > structures. Could you help me with a workaround? > > Best wishes, > > Georg > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]

ADD REPLY • link 11.4 years ago Michael Lawrence ▴ 60

Login before adding your answer.