summarizeOverlaps using GRanges or bed file as reads?
2
0
Entering edit mode
@ryan-c-thompson-5618
Last seen 21 months ago
Scripps Research, La Jolla, CA
Hello, I would like to manipulate the start and end positions of my reads before calling summarizeOverlaps. One way to do this is to convert my reads to a GRanges and then use flank, narrow, etc. to properly position the read ends where I want them. However, I don't see a method for summarizeOverlaps that accepts a GRanges object or bed file or similar for the reads. Is there such a method, and if not, would it be possible to add it? The specific application I have in mind is single-end ChIP-Seq reads, where we have a good idea of what the fragment length is and would like to extend the reads to this length. Alternately, it may be preferable to count only the 5-prime ends of the read, and this could be done by narrowing to 1 bp length. -Ryan Thompson
convert convert • 2.0k views
0
Entering edit mode
@herve-pages-1542
Last seen 4 hours ago
Seattle, WA, United States
0
Entering edit mode
@martin-morgan-1513
Last seen 6 days ago
United States
On 04/14/2014 06:08 PM, Ryan C. Thompson wrote: > Hello, > > I would like to manipulate the start and end positions of my reads before > calling summarizeOverlaps. One way to do this is to convert my reads to a > GRanges and then use flank, narrow, etc. to properly position the read ends > where I want them. However, I don't see a method for summarizeOverlaps that > accepts a GRanges object or bed file or similar for the reads. Is there such a > method, and if not, would it be possible to add it? > > The specific application I have in mind is single-end ChIP-Seq reads, where we > have a good idea of what the fragment length is and would like to extend the > reads to this length. Alternately, it may be preferable to count only the > 5-prime ends of the read, and this could be done by narrowing to 1 bp length. if bam is 'bed file or similar' then... summarizeOverlaps can take an arbitrary function as it's 'mode' argument, as documented on ?summarizeOverlaps ## The 'mode' argument can take a custom count function whose ## arguments are the same as those in the current counting modes ## (i.e., Union, IntersectionNotEmpty, IntersectionStrict). ## In this example records are filtered by map quality before counting. mapq_filter <- function(features, reads, ignore.strand, inter.feature) { require(GenomicAlignments) # needed for parallel evaluation Union(features, reads[mcols(reads)$mapq >= 20], ignore.strand=ignore.strand, inter.feature=inter.feature) } genes <- GRanges("seq1", IRanges(seq(1, 1500, by=200), width=100)) param <- ScanBamParam(what="mapq") fl <- system.file("extdata", "ex1.bam", package="Rsamtools") se <- summarizeOverlaps(genes, fl, mode=mapq_filter, param=param) assays(se)$counts ## The count function can be completely custom (i.e., not use the ## pre-defined count functions at all). Requirements are that ## the input arguments match the pre-defined modes and the output ## is a vector of counts the same length as 'features'. my_count <- function(features, reads, ignore.strand, inter.feature) { ## perform filtering, or subsetting etc. require(GenomicAlignments) # needed for parallel evaluation countOverlaps(features, reads) } the 'reads' argument is a GAlignments / GAlignmentsList (for single / paired reads) so you could do whatever you'd like to them, so long as your function returns a vector of counts equal to length(features) ryans_count = function(features, reads, ...) { reads = as(reads, "GRanges") ## equivalently (??) granges(reads) width(reads) = 147 countOverlaps(features, reads) } I think you're also interested in width of overlaps, so you could implement that in ryans_count, too, e.g., via http://stackoverflow.com/questions/14685802/width-of-the-overlapped- segment-in-genomicranges-package or https://stat.ethz.ch/pipermail/bioconductor/2014-January/056891.html and other parts of that thread, include the late continuation https://stat.ethz.ch/pipermail/bioconductor/2014-March/058620.html (probably need to think through carefully what these are doing). summarizeOverlaps will iterate through a bam file in chunks, so moderating memory use (your function will be called several times). If you have several bam files and a linux / Mac, load the BiocParallel package and it'll distribute one bam file to each processor. If memory becomes a problem use BamFileList(your_files, yieldSize=500000) or similar to reduce the size of each chunk (the default yield size is I think 1,000,000). A newer approach is to build a tool of your own using the bed / wig reader functions in rtracklayer with GenomicRanges::tileGenome to create large-ish chunks of file to process, followed by, e.g., GenomicFiles::reduceByFile to MAP from the file to count in each tile, and REDUCE to summarize counts within a file; the vignette outlines this approach for several case studies (though not for counting reads overlapping ranges -- I'm sure a worked example using, e.g., http://bioconductor.org/packages/release/data/experiment/html/RNAseqDa ta.HNRNPC.bam.chr14.html would be appreciated). Martin > > -Ryan Thompson > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor -- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793
0
Entering edit mode
Methods for 'reads' as 'GRanges' and 'GRangesList' have been added to GenomicAlignments 1.1.1. Valerie On 04/14/2014 11:35 PM, Martin Morgan wrote: > On 04/14/2014 06:08 PM, Ryan C. Thompson wrote: >> Hello, >> >> I would like to manipulate the start and end positions of my reads before >> calling summarizeOverlaps. One way to do this is to convert my reads to a >> GRanges and then use flank, narrow, etc. to properly position the read >> ends >> where I want them. However, I don't see a method for summarizeOverlaps >> that >> accepts a GRanges object or bed file or similar for the reads. Is >> there such a >> method, and if not, would it be possible to add it? >> >> The specific application I have in mind is single-end ChIP-Seq reads, >> where we >> have a good idea of what the fragment length is and would like to >> extend the >> reads to this length. Alternately, it may be preferable to count only the >> 5-prime ends of the read, and this could be done by narrowing to 1 bp >> length. > > > if bam is 'bed file or similar' then... summarizeOverlaps can take an > arbitrary function as it's 'mode' argument, as documented on > ?summarizeOverlaps > > ## The 'mode' argument can take a custom count function whose > ## arguments are the same as those in the current counting modes > ## (i.e., Union, IntersectionNotEmpty, IntersectionStrict). > ## In this example records are filtered by map quality before > counting. > > mapq_filter <- function(features, reads, ignore.strand, > inter.feature) { > require(GenomicAlignments) # needed for parallel evaluation > Union(features, reads[mcols(reads)$mapq >= 20], > ignore.strand=ignore.strand, > inter.feature=inter.feature) > } > > genes <- GRanges("seq1", IRanges(seq(1, 1500, by=200), width=100)) > param <- ScanBamParam(what="mapq") > fl <- system.file("extdata", "ex1.bam", package="Rsamtools") > se <- summarizeOverlaps(genes, fl, mode=mapq_filter, param=param) > assays(se)$counts > > ## The count function can be completely custom (i.e., not use the > ## pre-defined count functions at all). Requirements are that > ## the input arguments match the pre-defined modes and the output > ## is a vector of counts the same length as 'features'. > > my_count <- function(features, reads, ignore.strand, > inter.feature) { > ## perform filtering, or subsetting etc. > require(GenomicAlignments) # needed for parallel evaluation > countOverlaps(features, reads) > } > > the 'reads' argument is a GAlignments / GAlignmentsList (for single / > paired reads) so you could do whatever you'd like to them, so long as > your function returns a vector of counts equal to length(features) > > ryans_count = function(features, reads, ...) { > reads = as(reads, "GRanges") ## equivalently (??) granges(reads) > width(reads) = 147 > countOverlaps(features, reads) > } > > I think you're also interested in width of overlaps, so you could > implement that in ryans_count, too, e.g., via > > > http://stackoverflow.com/questions/14685802/width-of-the-overlapped- segment-in-genomicranges-package > > > or > > https://stat.ethz.ch/pipermail/bioconductor/2014-January/056891.html > > and other parts of that thread, include the late continuation > > https://stat.ethz.ch/pipermail/bioconductor/2014-March/058620.html > > (probably need to think through carefully what these are doing). > > summarizeOverlaps will iterate through a bam file in chunks, so > moderating memory use (your function will be called several times). > > If you have several bam files and a linux / Mac, load the BiocParallel > package and it'll distribute one bam file to each processor. If memory > becomes a problem use BamFileList(your_files, yieldSize=500000) or > similar to reduce the size of each chunk (the default yield size is I > think 1,000,000). > > > A newer approach is to build a tool of your own using the bed / wig > reader functions in rtracklayer with GenomicRanges::tileGenome to create > large-ish chunks of file to process, followed by, e.g., > GenomicFiles::reduceByFile to MAP from the file to count in each tile, > and REDUCE to summarize counts within a file; the vignette outlines this > approach for several case studies (though not for counting reads > overlapping ranges -- I'm sure a worked example using, e.g., > > > http://bioconductor.org/packages/release/data/experiment/html/RNAseq Data.HNRNPC.bam.chr14.html > > > would be appreciated). > > > Martin > >> >> -Ryan Thompson >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor > >
0
Entering edit mode
Thanks! This is exactly what I wanted. On Tue Apr 15 09:24:27 2014, Valerie Obenchain wrote: > Methods for 'reads' as 'GRanges' and 'GRangesList' have been added to > GenomicAlignments 1.1.1. > > Valerie > > On 04/14/2014 11:35 PM, Martin Morgan wrote: >> On 04/14/2014 06:08 PM, Ryan C. Thompson wrote: >>> Hello, >>> >>> I would like to manipulate the start and end positions of my reads >>> before >>> calling summarizeOverlaps. One way to do this is to convert my reads >>> to a >>> GRanges and then use flank, narrow, etc. to properly position the read >>> ends >>> where I want them. However, I don't see a method for summarizeOverlaps >>> that >>> accepts a GRanges object or bed file or similar for the reads. Is >>> there such a >>> method, and if not, would it be possible to add it? >>> >>> The specific application I have in mind is single-end ChIP-Seq reads, >>> where we >>> have a good idea of what the fragment length is and would like to >>> extend the >>> reads to this length. Alternately, it may be preferable to count >>> only the >>> 5-prime ends of the read, and this could be done by narrowing to 1 bp >>> length. >> >> >> if bam is 'bed file or similar' then... summarizeOverlaps can take an >> arbitrary function as it's 'mode' argument, as documented on >> ?summarizeOverlaps >> >> ## The 'mode' argument can take a custom count function whose >> ## arguments are the same as those in the current counting modes >> ## (i.e., Union, IntersectionNotEmpty, IntersectionStrict). >> ## In this example records are filtered by map quality before >> counting. >> >> mapq_filter <- function(features, reads, ignore.strand, >> inter.feature) { >> require(GenomicAlignments) # needed for parallel evaluation >> Union(features, reads[mcols(reads)$mapq >= 20], >> ignore.strand=ignore.strand, >> inter.feature=inter.feature) >> } >> >> genes <- GRanges("seq1", IRanges(seq(1, 1500, by=200), width=100)) >> param <- ScanBamParam(what="mapq") >> fl <- system.file("extdata", "ex1.bam", package="Rsamtools") >> se <- summarizeOverlaps(genes, fl, mode=mapq_filter, param=param) >> assays(se)$counts >> >> ## The count function can be completely custom (i.e., not use the >> ## pre-defined count functions at all). Requirements are that >> ## the input arguments match the pre-defined modes and the output >> ## is a vector of counts the same length as 'features'. >> >> my_count <- function(features, reads, ignore.strand, >> inter.feature) { >> ## perform filtering, or subsetting etc. >> require(GenomicAlignments) # needed for parallel evaluation >> countOverlaps(features, reads) >> } >> >> the 'reads' argument is a GAlignments / GAlignmentsList (for single / >> paired reads) so you could do whatever you'd like to them, so long as >> your function returns a vector of counts equal to length(features) >> >> ryans_count = function(features, reads, ...) { >> reads = as(reads, "GRanges") ## equivalently (??) >> granges(reads) >> width(reads) = 147 >> countOverlaps(features, reads) >> } >> >> I think you're also interested in width of overlaps, so you could >> implement that in ryans_count, too, e.g., via >> >> >> http://stackoverflow.com/questions/14685802/width-of-the- overlapped-segment-in-genomicranges-package >> >> >> >> or >> >> https://stat.ethz.ch/pipermail/bioconductor/2014-January/056891.html >> >> and other parts of that thread, include the late continuation >> >> https://stat.ethz.ch/pipermail/bioconductor/2014-March/058620.html >> >> (probably need to think through carefully what these are doing). >> >> summarizeOverlaps will iterate through a bam file in chunks, so >> moderating memory use (your function will be called several times). >> >> If you have several bam files and a linux / Mac, load the BiocParallel >> package and it'll distribute one bam file to each processor. If memory >> becomes a problem use BamFileList(your_files, yieldSize=500000) or >> similar to reduce the size of each chunk (the default yield size is I >> think 1,000,000). >> >> >> A newer approach is to build a tool of your own using the bed / wig >> reader functions in rtracklayer with GenomicRanges::tileGenome to create >> large-ish chunks of file to process, followed by, e.g., >> GenomicFiles::reduceByFile to MAP from the file to count in each tile, >> and REDUCE to summarize counts within a file; the vignette outlines this >> approach for several case studies (though not for counting reads >> overlapping ranges -- I'm sure a worked example using, e.g., >> >> >> http://bioconductor.org/packages/release/data/experiment/html/RNAse qData.HNRNPC.bam.chr14.html >> >> >> >> would be appreciated). >> >> >> Martin >> >>> >>> -Ryan Thompson >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> >
0
Entering edit mode
Ok, so using this method, I could write a function that takes the reads, coerces them to GRanges, and then manipulates the start and end points as I like, and then dispatches to the standard counting functions. Then I can call summarizeOverlaps on my bam files directly. That's pretty nice. On 4/14/14, 11:35 PM, Martin Morgan wrote: > On 04/14/2014 06:08 PM, Ryan C. Thompson wrote: >> Hello, >> >> I would like to manipulate the start and end positions of my reads >> before >> calling summarizeOverlaps. One way to do this is to convert my reads >> to a >> GRanges and then use flank, narrow, etc. to properly position the >> read ends >> where I want them. However, I don't see a method for >> summarizeOverlaps that >> accepts a GRanges object or bed file or similar for the reads. Is >> there such a >> method, and if not, would it be possible to add it? >> >> The specific application I have in mind is single-end ChIP-Seq reads, >> where we >> have a good idea of what the fragment length is and would like to >> extend the >> reads to this length. Alternately, it may be preferable to count only >> the >> 5-prime ends of the read, and this could be done by narrowing to 1 bp >> length. > > > if bam is 'bed file or similar' then... summarizeOverlaps can take an > arbitrary function as it's 'mode' argument, as documented on > ?summarizeOverlaps > > ## The 'mode' argument can take a custom count function whose > ## arguments are the same as those in the current counting modes > ## (i.e., Union, IntersectionNotEmpty, IntersectionStrict). > ## In this example records are filtered by map quality before > counting. > > mapq_filter <- function(features, reads, ignore.strand, > inter.feature) { > require(GenomicAlignments) # needed for parallel evaluation > Union(features, reads[mcols(reads)$mapq >= 20], > ignore.strand=ignore.strand, > inter.feature=inter.feature) > } > > genes <- GRanges("seq1", IRanges(seq(1, 1500, by=200), width=100)) > param <- ScanBamParam(what="mapq") > fl <- system.file("extdata", "ex1.bam", package="Rsamtools") > se <- summarizeOverlaps(genes, fl, mode=mapq_filter, param=param) > assays(se)$counts > > ## The count function can be completely custom (i.e., not use the > ## pre-defined count functions at all). Requirements are that > ## the input arguments match the pre-defined modes and the output > ## is a vector of counts the same length as 'features'. > > my_count <- function(features, reads, ignore.strand, > inter.feature) { > ## perform filtering, or subsetting etc. > require(GenomicAlignments) # needed for parallel evaluation > countOverlaps(features, reads) > } > > the 'reads' argument is a GAlignments / GAlignmentsList (for single / > paired reads) so you could do whatever you'd like to them, so long as > your function returns a vector of counts equal to length(features) > > ryans_count = function(features, reads, ...) { > reads = as(reads, "GRanges") ## equivalently (??) > granges(reads) > width(reads) = 147 > countOverlaps(features, reads) > } > > I think you're also interested in width of overlaps, so you could > implement that in ryans_count, too, e.g., via > > > http://stackoverflow.com/questions/14685802/width-of-the-overlapped- segment-in-genomicranges-package > > > or > > https://stat.ethz.ch/pipermail/bioconductor/2014-January/056891.html > > and other parts of that thread, include the late continuation > > https://stat.ethz.ch/pipermail/bioconductor/2014-March/058620.html > > (probably need to think through carefully what these are doing). > > summarizeOverlaps will iterate through a bam file in chunks, so > moderating memory use (your function will be called several times). > > If you have several bam files and a linux / Mac, load the BiocParallel > package and it'll distribute one bam file to each processor. If memory > becomes a problem use BamFileList(your_files, yieldSize=500000) or > similar to reduce the size of each chunk (the default yield size is I > think 1,000,000). > > > A newer approach is to build a tool of your own using the bed / wig > reader functions in rtracklayer with GenomicRanges::tileGenome to > create large-ish chunks of file to process, followed by, e.g., > GenomicFiles::reduceByFile to MAP from the file to count in each tile, > and REDUCE to summarize counts within a file; the vignette outlines > this approach for several case studies (though not for counting reads > overlapping ranges -- I'm sure a worked example using, e.g., > > > http://bioconductor.org/packages/release/data/experiment/html/RNAseq Data.HNRNPC.bam.chr14.html > > > would be appreciated). > > > Martin > >> >> -Ryan Thompson >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor > >