Search
Question: summarizeOverlaps mode ignoring inter feature overlaps
0
gravatar for Thomas Girke
4.6 years ago by
Thomas Girke1.6k
United States
Thomas Girke1.6k wrote:
Dear Valerie, Is there currently any way to run summarizeOverlaps in a feature- overlap unaware mode, e.g with an ignorefeatureOL=FALSE/TRUE setting? Currently, one can switch back to countOverlaps when feature overlap unawareness is the more appropriate counting mode for a biological question, but then double counting of reads mapping to multiple-range features is not accounted for. It would be really nice to have such a feature-overlap unaware option directly in summarizeOverlaps. Another question relates to the memory usage of summarizeOverlaps. Has this been optimized yet? On a typical bam file with ~50-100 million reads the memory usage of summarizeOverlaps is often around 10-20GB. To use the function on a desktop computer or in large-scale RNA-Seq projects on a commodity compute cluster, it would be desirable if every counting instance would consume not more than 5GB of RAM. Thanks in advance for your help and suggestions, Thomas
ADD COMMENTlink modified 4.6 years ago • written 4.6 years ago by Thomas Girke1.6k
0
gravatar for Ryan C. Thompson
4.6 years ago by
The Scripps Research Institute, La Jolla, CA
Ryan C. Thompson6.1k wrote:
Memory usage is addressed by using the yieldSize argument to the BamFile(List) function to read only a small number of reads (or read pairs) at a time. On Mon 08 Apr 2013 05:52:21 PM PDT, Thomas Girke wrote: > Dear Valerie, > > Is there currently any way to run summarizeOverlaps in a feature- overlap > unaware mode, e.g with an ignorefeatureOL=FALSE/TRUE setting? Currently, > one can switch back to countOverlaps when feature overlap unawareness is > the more appropriate counting mode for a biological question, but then > double counting of reads mapping to multiple-range features is not > accounted for. It would be really nice to have such a feature- overlap > unaware option directly in summarizeOverlaps. > > Another question relates to the memory usage of summarizeOverlaps. Has > this been optimized yet? On a typical bam file with ~50-100 million > reads the memory usage of summarizeOverlaps is often around 10-20GB. To > use the function on a desktop computer or in large-scale RNA-Seq > projects on a commodity compute cluster, it would be desirable if every > counting instance would consume not more than 5GB of RAM. > > Thanks in advance for your help and suggestions, > > Thomas > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
ADD COMMENTlink written 4.6 years ago by Ryan C. Thompson6.1k
0
gravatar for Valerie Obenchain
4.6 years ago by
Valerie Obenchain ♦♦ 6.4k
United States
Valerie Obenchain ♦♦ 6.4k wrote:
Hi Thomas, On 04/08/2013 05:52 PM, Thomas Girke wrote: > Dear Valerie, > > Is there currently any way to run summarizeOverlaps in a feature- overlap > unaware mode, e.g with an ignorefeatureOL=FALSE/TRUE setting? Currently, > one can switch back to countOverlaps when feature overlap unawareness is > the more appropriate counting mode for a biological question, but then > double counting of reads mapping to multiple-range features is not > accounted for. It would be really nice to have such a feature- overlap > unaware option directly in summarizeOverlaps. No, we don't currently have an option to ignore feature-overlap. It sounds like you want to count with countOverlaps() but still want the double counting to be resolved. This is essentially what the other modes are doing so I must be missing something. In this example 2 reads hit feature A, 1 read hits feature B. With something like ignorefeature0L=TRUE, what results would you expect to see? Maybe you have another, more descriptive example? reads <- GRanges("chr1", IRanges(c(1, 5, 20), width=3)) features <- GRanges("chr1", IRanges(c(1, 20), width=10, names=c("A", "B"))) > countOverlaps(features, reads) [1] 2 1 > > Another question relates to the memory usage of summarizeOverlaps. Has > this been optimized yet? On a typical bam file with ~50-100 million > reads the memory usage of summarizeOverlaps is often around 10-20GB. To > use the function on a desktop computer or in large-scale RNA-Seq > projects on a commodity compute cluster, it would be desirable if every > counting instance would consume not more than 5GB of RAM. Have you tried the BamFileList-method? There is an example at the bottom of the ?BamFileList man page using summarizeOverlaps(). As Ryan mentioned, the key is to set the 'yieldSize' parameter when creating the BamFile. This method also makes use of mclapply(). Valerie > > Thanks in advance for your help and suggestions, > > Thomas > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >
ADD COMMENTlink written 4.6 years ago by Valerie Obenchain ♦♦ 6.4k
0
gravatar for Thomas Girke
4.6 years ago by
Thomas Girke1.6k
United States
Thomas Girke1.6k wrote:
Hi Valerie, Perhaps let's call this more explicitly an ignore_inter_feature_overlap option that we often need for cases like this: Read1 ---- F1 ---------------- F2 ----------- where we would like to have at least the option to assign Read1 to both F1 and F2: F1: 1 F2: 1 However, summarizeOverlapse doesn't count Read1 at all in all of its currently available modes that I am aware of. This lack of an ignore_inter_feature_overlap option is frequently a problem when working with poorly annotated genomes (high uncertainty about hidden/incorrect feature overlaps) or various RNA/ChIP-Seq _engineering_ projects where I rather take the risk of ambiguous read assignments than not counting at all as shown above. ## Here is a code example illustrating the same case: library(GenomicRanges); library(Rsamtools) rd <- GappedAlignments(letters[1], seqnames = Rle(rep("chr1",1)), pos = as.integer(c(500)), cigar = rep("100M", 1), strand = strand(rep("+", 1))) gr1 <- GRanges("chr1", IRanges(start=100, width=1001), strand="+", ID="feat1") gr2 <- GRanges("chr1", IRanges(start=500, width=101), strand="+", ID="feat2") gr <- c(gr1, gr2) ## All of the three current modes in summarizeOverlaps return a count of zero ## because of its inter_feature_overlap awareness: assays(summarizeOverlaps(gr, rd, mode="Union", ignore.strand=TRUE))$counts assays(summarizeOverlaps(gr, rd, mode="IntersectionStrict", ignore.strand=TRUE))$counts assays(summarizeOverlaps(gr, rd, mode="IntersectionNotEmpty", ignore.strand=TRUE))$counts [,1] [1,] 0 [2,] 0 ## However, countOverlaps handles this correctly, but is not the best choice when ## counting multiple range features. countOverlaps(gr, rd) [1] 1 1 Thomas > sessionInfo() R version 3.0.0 (2013-04-03) Platform: x86_64-apple-darwin10.8.0 (64-bit) locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] parallel stats graphics grDevices utils datasets methods [8] base other attached packages: [1] Rsamtools_1.12.0 Biostrings_2.28.0 GenomicRanges_1.12.1 [4] IRanges_1.18.0 BiocGenerics_0.6.0 loaded via a namespace (and not attached): [1] bitops_1.0-5 stats4_3.0.0 tools_3.0.0 zlibbioc_1.6.0 On Tue, Apr 09, 2013 at 03:42:03PM +0000, Valerie Obenchain wrote: > Hi Thomas, > > On 04/08/2013 05:52 PM, Thomas Girke wrote: > > Dear Valerie, > > > > Is there currently any way to run summarizeOverlaps in a feature- overlap > > unaware mode, e.g with an ignorefeatureOL=FALSE/TRUE setting? Currently, > > one can switch back to countOverlaps when feature overlap unawareness is > > the more appropriate counting mode for a biological question, but then > > double counting of reads mapping to multiple-range features is not > > accounted for. It would be really nice to have such a feature- overlap > > unaware option directly in summarizeOverlaps. > > No, we don't currently have an option to ignore feature-overlap. It > sounds like you want to count with countOverlaps() but still want the > double counting to be resolved. This is essentially what the other modes > are doing so I must be missing something. > > In this example 2 reads hit feature A, 1 read hits feature B. With > something like ignorefeature0L=TRUE, what results would you expect to > see? Maybe you have another, more descriptive example? > > reads <- GRanges("chr1", IRanges(c(1, 5, 20), width=3)) > features <- GRanges("chr1", IRanges(c(1, 20), width=10, > names=c("A", "B"))) > > > countOverlaps(features, reads) > [1] 2 1 > > > > > > Another question relates to the memory usage of summarizeOverlaps. Has > > this been optimized yet? On a typical bam file with ~50-100 million > > reads the memory usage of summarizeOverlaps is often around 10-20GB. To > > use the function on a desktop computer or in large-scale RNA-Seq > > projects on a commodity compute cluster, it would be desirable if every > > counting instance would consume not more than 5GB of RAM. > > Have you tried the BamFileList-method? There is an example at the bottom > of the ?BamFileList man page using summarizeOverlaps(). As Ryan > mentioned, the key is to set the 'yieldSize' parameter when creating the > BamFile. This method also makes use of mclapply(). > > Valerie > > > > > Thanks in advance for your help and suggestions, > > > > Thomas > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor at r-project.org > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > >
ADD COMMENTlink written 4.6 years ago by Thomas Girke1.6k
Thanks for the example. You're right, none of the modes will count a read falling completely within >1 feature. You can supply your own function as a 'mode'. Two requirements must be met: (1) The function must have 3 arguments that correspond to 'features', 'reads' and 'ignore.strand', in that order. (2) The function must return a vector of counts the same length as 'features' Here is an example using countOverlaps() which I think gets at the counting you want. counter <- function(x, y, ignore.strand) countOverlaps(y, x, ignore.strand=ignore.strand) > assays(summarizeOverlaps(gr, rd, mode=counter))$counts [,1] [1,] 1 [2,] 1 Valerie On 04/09/2013 09:37 AM, Thomas Girke wrote: > Hi Valerie, > > Perhaps let's call this more explicitly an ignore_inter_feature_overlap option > that we often need for cases like this: > > Read1 ---- > F1 ---------------- > F2 ----------- > > where we would like to have at least the option to assign Read1 to both F1 and F2: > > F1: 1 > F2: 1 > > However, summarizeOverlapse doesn't count Read1 at all in all of its currently > available modes that I am aware of. This lack of an ignore_inter_feature_overlap > option is frequently a problem when working with poorly annotated genomes (high > uncertainty about hidden/incorrect feature overlaps) or various > RNA/ChIP-Seq _engineering_ projects where I rather take the risk of ambiguous read > assignments than not counting at all as shown above. > > ## Here is a code example illustrating the same case: > library(GenomicRanges); library(Rsamtools) > rd <- GappedAlignments(letters[1], seqnames = Rle(rep("chr1",1)), > pos = as.integer(c(500)), > cigar = rep("100M", 1), strand = strand(rep("+", 1))) > gr1 <- GRanges("chr1", IRanges(start=100, width=1001), strand="+", ID="feat1") > gr2 <- GRanges("chr1", IRanges(start=500, width=101), strand="+", ID="feat2") > gr <- c(gr1, gr2) > > ## All of the three current modes in summarizeOverlaps return a count of zero > ## because of its inter_feature_overlap awareness: > assays(summarizeOverlaps(gr, rd, mode="Union", ignore.strand=TRUE))$counts > assays(summarizeOverlaps(gr, rd, mode="IntersectionStrict", ignore.strand=TRUE))$counts > assays(summarizeOverlaps(gr, rd, mode="IntersectionNotEmpty", ignore.strand=TRUE))$counts > [,1] > [1,] 0 > [2,] 0 > > ## However, countOverlaps handles this correctly, but is not the best choice when > ## counting multiple range features. > countOverlaps(gr, rd) > [1] 1 1 > > > Thomas > >> sessionInfo() > R version 3.0.0 (2013-04-03) > Platform: x86_64-apple-darwin10.8.0 (64-bit) > > locale: > [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 > > attached base packages: > [1] parallel stats graphics grDevices utils datasets methods > [8] base > > other attached packages: > [1] Rsamtools_1.12.0 Biostrings_2.28.0 GenomicRanges_1.12.1 > [4] IRanges_1.18.0 BiocGenerics_0.6.0 > > loaded via a namespace (and not attached): > [1] bitops_1.0-5 stats4_3.0.0 tools_3.0.0 zlibbioc_1.6.0 > > > On Tue, Apr 09, 2013 at 03:42:03PM +0000, Valerie Obenchain wrote: >> Hi Thomas, >> >> On 04/08/2013 05:52 PM, Thomas Girke wrote: >>> Dear Valerie, >>> >>> Is there currently any way to run summarizeOverlaps in a feature- overlap >>> unaware mode, e.g with an ignorefeatureOL=FALSE/TRUE setting? Currently, >>> one can switch back to countOverlaps when feature overlap unawareness is >>> the more appropriate counting mode for a biological question, but then >>> double counting of reads mapping to multiple-range features is not >>> accounted for. It would be really nice to have such a feature- overlap >>> unaware option directly in summarizeOverlaps. >> >> No, we don't currently have an option to ignore feature-overlap. It >> sounds like you want to count with countOverlaps() but still want the >> double counting to be resolved. This is essentially what the other modes >> are doing so I must be missing something. >> >> In this example 2 reads hit feature A, 1 read hits feature B. With >> something like ignorefeature0L=TRUE, what results would you expect to >> see? Maybe you have another, more descriptive example? >> >> reads <- GRanges("chr1", IRanges(c(1, 5, 20), width=3)) >> features <- GRanges("chr1", IRanges(c(1, 20), width=10, >> names=c("A", "B"))) >> >> > countOverlaps(features, reads) >> [1] 2 1 >> >> >>> >>> Another question relates to the memory usage of summarizeOverlaps. Has >>> this been optimized yet? On a typical bam file with ~50-100 million >>> reads the memory usage of summarizeOverlaps is often around 10-20GB. To >>> use the function on a desktop computer or in large-scale RNA-Seq >>> projects on a commodity compute cluster, it would be desirable if every >>> counting instance would consume not more than 5GB of RAM. >> >> Have you tried the BamFileList-method? There is an example at the bottom >> of the ?BamFileList man page using summarizeOverlaps(). As Ryan >> mentioned, the key is to set the 'yieldSize' parameter when creating the >> BamFile. This method also makes use of mclapply(). >> >> Valerie >> >>> >>> Thanks in advance for your help and suggestions, >>> >>> Thomas >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >>>
ADD REPLYlink written 4.6 years ago by Valerie Obenchain ♦♦ 6.4k
0
gravatar for Thomas Girke
4.6 years ago by
Thomas Girke1.6k
United States
Thomas Girke1.6k wrote:
Thanks for the tip. I guess doing it this way reverses the counting mode back to countOverlaps, but how can I use at the same time "IntersectionStrict" or any of the other modes provided by summarizeOverlaps if its mode argument is already used and countOverlaps doesn't accept one? Thomas On Tue, Apr 09, 2013 at 09:08:02PM +0000, Valerie Obenchain wrote: > Thanks for the example. You're right, none of the modes will count a > read falling completely within >1 feature. > > You can supply your own function as a 'mode'. Two requirements must be met: > > (1) The function must have 3 arguments that correspond to > 'features', 'reads' and 'ignore.strand', in that order. > > (2) The function must return a vector of counts the same length > as 'features' > > Here is an example using countOverlaps() which I think gets at the > counting you want. > > counter <- function(x, y, ignore.strand) > countOverlaps(y, x, ignore.strand=ignore.strand) > > > assays(summarizeOverlaps(gr, rd, mode=counter))$counts > [,1] > [1,] 1 > [2,] 1 > > > Valerie > > On 04/09/2013 09:37 AM, Thomas Girke wrote: > > Hi Valerie, > > > > Perhaps let's call this more explicitly an ignore_inter_feature_overlap option > > that we often need for cases like this: > > > > Read1 ---- > > F1 ---------------- > > F2 ----------- > > > > where we would like to have at least the option to assign Read1 to both F1 and F2: > > > > F1: 1 > > F2: 1 > > > > However, summarizeOverlapse doesn't count Read1 at all in all of its currently > > available modes that I am aware of. This lack of an ignore_inter_feature_overlap > > option is frequently a problem when working with poorly annotated genomes (high > > uncertainty about hidden/incorrect feature overlaps) or various > > RNA/ChIP-Seq _engineering_ projects where I rather take the risk of ambiguous read > > assignments than not counting at all as shown above. > > > > ## Here is a code example illustrating the same case: > > library(GenomicRanges); library(Rsamtools) > > rd <- GappedAlignments(letters[1], seqnames = Rle(rep("chr1",1)), > > pos = as.integer(c(500)), > > cigar = rep("100M", 1), strand = strand(rep("+", 1))) > > gr1 <- GRanges("chr1", IRanges(start=100, width=1001), strand="+", ID="feat1") > > gr2 <- GRanges("chr1", IRanges(start=500, width=101), strand="+", ID="feat2") > > gr <- c(gr1, gr2) > > > > ## All of the three current modes in summarizeOverlaps return a count of zero > > ## because of its inter_feature_overlap awareness: > > assays(summarizeOverlaps(gr, rd, mode="Union", ignore.strand=TRUE))$counts > > assays(summarizeOverlaps(gr, rd, mode="IntersectionStrict", ignore.strand=TRUE))$counts > > assays(summarizeOverlaps(gr, rd, mode="IntersectionNotEmpty", ignore.strand=TRUE))$counts > > [,1] > > [1,] 0 > > [2,] 0 > > > > ## However, countOverlaps handles this correctly, but is not the best choice when > > ## counting multiple range features. > > countOverlaps(gr, rd) > > [1] 1 1 > > > > > > Thomas > > > >> sessionInfo() > > R version 3.0.0 (2013-04-03) > > Platform: x86_64-apple-darwin10.8.0 (64-bit) > > > > locale: > > [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 > > > > attached base packages: > > [1] parallel stats graphics grDevices utils datasets methods > > [8] base > > > > other attached packages: > > [1] Rsamtools_1.12.0 Biostrings_2.28.0 GenomicRanges_1.12.1 > > [4] IRanges_1.18.0 BiocGenerics_0.6.0 > > > > loaded via a namespace (and not attached): > > [1] bitops_1.0-5 stats4_3.0.0 tools_3.0.0 zlibbioc_1.6.0 > > > > > > On Tue, Apr 09, 2013 at 03:42:03PM +0000, Valerie Obenchain wrote: > >> Hi Thomas, > >> > >> On 04/08/2013 05:52 PM, Thomas Girke wrote: > >>> Dear Valerie, > >>> > >>> Is there currently any way to run summarizeOverlaps in a feature-overlap > >>> unaware mode, e.g with an ignorefeatureOL=FALSE/TRUE setting? Currently, > >>> one can switch back to countOverlaps when feature overlap unawareness is > >>> the more appropriate counting mode for a biological question, but then > >>> double counting of reads mapping to multiple-range features is not > >>> accounted for. It would be really nice to have such a feature- overlap > >>> unaware option directly in summarizeOverlaps. > >> > >> No, we don't currently have an option to ignore feature-overlap. It > >> sounds like you want to count with countOverlaps() but still want the > >> double counting to be resolved. This is essentially what the other modes > >> are doing so I must be missing something. > >> > >> In this example 2 reads hit feature A, 1 read hits feature B. With > >> something like ignorefeature0L=TRUE, what results would you expect to > >> see? Maybe you have another, more descriptive example? > >> > >> reads <- GRanges("chr1", IRanges(c(1, 5, 20), width=3)) > >> features <- GRanges("chr1", IRanges(c(1, 20), width=10, > >> names=c("A", "B"))) > >> > >> > countOverlaps(features, reads) > >> [1] 2 1 > >> > >> > >>> > >>> Another question relates to the memory usage of summarizeOverlaps. Has > >>> this been optimized yet? On a typical bam file with ~50-100 million > >>> reads the memory usage of summarizeOverlaps is often around 10-20GB. To > >>> use the function on a desktop computer or in large-scale RNA-Seq > >>> projects on a commodity compute cluster, it would be desirable if every > >>> counting instance would consume not more than 5GB of RAM. > >> > >> Have you tried the BamFileList-method? There is an example at the bottom > >> of the ?BamFileList man page using summarizeOverlaps(). As Ryan > >> mentioned, the key is to set the 'yieldSize' parameter when creating the > >> BamFile. This method also makes use of mclapply(). > >> > >> Valerie > >> > >>> > >>> Thanks in advance for your help and suggestions, > >>> > >>> Thomas > >>> > >>> _______________________________________________ > >>> Bioconductor mailing list > >>> Bioconductor at r-project.org > >>> https://stat.ethz.ch/mailman/listinfo/bioconductor > >>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > >>>
ADD COMMENTlink written 4.6 years ago by Thomas Girke1.6k
Ah, I see. You'd like to count with one of the existing modes but have the option to pick up counts for these inter-feature reads (fall completely 'within' >1 feature). These inter-feature reads would be double (triple, quadruple, etc.) counted. Essentially they would add one count to each feature they hit. Right? One more thought about memory usage. If you are working with single- end reads the summarizeOverlaps,BamFileList-method I mentioned below should work fine. The approach is slightly different with paired-end reads. Our current algorithm for pairing paired-end reads requires the whole file to be read into memory. A different approach is currently being developed but in the meantime you can take the qname-sorted approach. The Bam file will need to be sorted by qname and both the 'yieldSize' and 'obeyQname=TRUE' set when creating the BamFile/BamFileList. An example is on ?BamFile. Valerie On 04/09/2013 08:01 PM, Thomas Girke wrote: > Thanks for the tip. I guess doing it this way reverses the counting mode > back to countOverlaps, but how can I use at the same time > "IntersectionStrict" or any of the other modes provided by > summarizeOverlaps if its mode argument is already used and countOverlaps > doesn't accept one? > > Thomas > > On Tue, Apr 09, 2013 at 09:08:02PM +0000, Valerie Obenchain wrote: >> Thanks for the example. You're right, none of the modes will count a >> read falling completely within >1 feature. >> >> You can supply your own function as a 'mode'. Two requirements must be met: >> >> (1) The function must have 3 arguments that correspond to >> 'features', 'reads' and 'ignore.strand', in that order. >> >> (2) The function must return a vector of counts the same length >> as 'features' >> >> Here is an example using countOverlaps() which I think gets at the >> counting you want. >> >> counter <- function(x, y, ignore.strand) >> countOverlaps(y, x, ignore.strand=ignore.strand) >> >> > assays(summarizeOverlaps(gr, rd, mode=counter))$counts >> [,1] >> [1,] 1 >> [2,] 1 >> >> >> Valerie >> >> On 04/09/2013 09:37 AM, Thomas Girke wrote: >>> Hi Valerie, >>> >>> Perhaps let's call this more explicitly an ignore_inter_feature_overlap option >>> that we often need for cases like this: >>> >>> Read1 ---- >>> F1 ---------------- >>> F2 ----------- >>> >>> where we would like to have at least the option to assign Read1 to both F1 and F2: >>> >>> F1: 1 >>> F2: 1 >>> >>> However, summarizeOverlapse doesn't count Read1 at all in all of its currently >>> available modes that I am aware of. This lack of an ignore_inter_feature_overlap >>> option is frequently a problem when working with poorly annotated genomes (high >>> uncertainty about hidden/incorrect feature overlaps) or various >>> RNA/ChIP-Seq _engineering_ projects where I rather take the risk of ambiguous read >>> assignments than not counting at all as shown above. >>> >>> ## Here is a code example illustrating the same case: >>> library(GenomicRanges); library(Rsamtools) >>> rd <- GappedAlignments(letters[1], seqnames = Rle(rep("chr1",1)), >>> pos = as.integer(c(500)), >>> cigar = rep("100M", 1), strand = strand(rep("+", 1))) >>> gr1 <- GRanges("chr1", IRanges(start=100, width=1001), strand="+", ID="feat1") >>> gr2 <- GRanges("chr1", IRanges(start=500, width=101), strand="+", ID="feat2") >>> gr <- c(gr1, gr2) >>> >>> ## All of the three current modes in summarizeOverlaps return a count of zero >>> ## because of its inter_feature_overlap awareness: >>> assays(summarizeOverlaps(gr, rd, mode="Union", ignore.strand=TRUE))$counts >>> assays(summarizeOverlaps(gr, rd, mode="IntersectionStrict", ignore.strand=TRUE))$counts >>> assays(summarizeOverlaps(gr, rd, mode="IntersectionNotEmpty", ignore.strand=TRUE))$counts >>> [,1] >>> [1,] 0 >>> [2,] 0 >>> >>> ## However, countOverlaps handles this correctly, but is not the best choice when >>> ## counting multiple range features. >>> countOverlaps(gr, rd) >>> [1] 1 1 >>> >>> >>> Thomas >>> >>>> sessionInfo() >>> R version 3.0.0 (2013-04-03) >>> Platform: x86_64-apple-darwin10.8.0 (64-bit) >>> >>> locale: >>> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 >>> >>> attached base packages: >>> [1] parallel stats graphics grDevices utils datasets methods >>> [8] base >>> >>> other attached packages: >>> [1] Rsamtools_1.12.0 Biostrings_2.28.0 GenomicRanges_1.12.1 >>> [4] IRanges_1.18.0 BiocGenerics_0.6.0 >>> >>> loaded via a namespace (and not attached): >>> [1] bitops_1.0-5 stats4_3.0.0 tools_3.0.0 zlibbioc_1.6.0 >>> >>> >>> On Tue, Apr 09, 2013 at 03:42:03PM +0000, Valerie Obenchain wrote: >>>> Hi Thomas, >>>> >>>> On 04/08/2013 05:52 PM, Thomas Girke wrote: >>>>> Dear Valerie, >>>>> >>>>> Is there currently any way to run summarizeOverlaps in a feature-overlap >>>>> unaware mode, e.g with an ignorefeatureOL=FALSE/TRUE setting? Currently, >>>>> one can switch back to countOverlaps when feature overlap unawareness is >>>>> the more appropriate counting mode for a biological question, but then >>>>> double counting of reads mapping to multiple-range features is not >>>>> accounted for. It would be really nice to have such a feature- overlap >>>>> unaware option directly in summarizeOverlaps. >>>> >>>> No, we don't currently have an option to ignore feature-overlap. It >>>> sounds like you want to count with countOverlaps() but still want the >>>> double counting to be resolved. This is essentially what the other modes >>>> are doing so I must be missing something. >>>> >>>> In this example 2 reads hit feature A, 1 read hits feature B. With >>>> something like ignorefeature0L=TRUE, what results would you expect to >>>> see? Maybe you have another, more descriptive example? >>>> >>>> reads <- GRanges("chr1", IRanges(c(1, 5, 20), width=3)) >>>> features <- GRanges("chr1", IRanges(c(1, 20), width=10, >>>> names=c("A", "B"))) >>>> >>>> > countOverlaps(features, reads) >>>> [1] 2 1 >>>> >>>> >>>>> >>>>> Another question relates to the memory usage of summarizeOverlaps. Has >>>>> this been optimized yet? On a typical bam file with ~50-100 million >>>>> reads the memory usage of summarizeOverlaps is often around 10-20GB. To >>>>> use the function on a desktop computer or in large-scale RNA-Seq >>>>> projects on a commodity compute cluster, it would be desirable if every >>>>> counting instance would consume not more than 5GB of RAM. >>>> >>>> Have you tried the BamFileList-method? There is an example at the bottom >>>> of the ?BamFileList man page using summarizeOverlaps(). As Ryan >>>> mentioned, the key is to set the 'yieldSize' parameter when creating the >>>> BamFile. This method also makes use of mclapply(). >>>> >>>> Valerie >>>> >>>>> >>>>> Thanks in advance for your help and suggestions, >>>>> >>>>> Thomas >>>>> >>>>> _______________________________________________ >>>>> Bioconductor mailing list >>>>> Bioconductor at r-project.org >>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>>
ADD REPLYlink written 4.6 years ago by Valerie Obenchain ♦♦ 6.4k
It sounds like what Thomas wants should be achievable with countOverlaps. For example, IntersectionStrict (if treating all features independently) would equate to type="within". So we just need a summarizeOverlaps- style interface to those simple utilities. I agree this would be very useful. Michael On Wed, Apr 10, 2013 at 8:50 AM, Valerie Obenchain <vobencha@fhcrc.org>wrote: > Ah, I see. You'd like to count with one of the existing modes but have the > option to pick up counts for these inter-feature reads (fall completely > 'within' >1 feature). These inter-feature reads would be double (triple, > quadruple, etc.) counted. Essentially they would add one count to each > feature they hit. Right? > > One more thought about memory usage. If you are working with single- end > reads the summarizeOverlaps,BamFileList-**method I mentioned below should > work fine. The approach is slightly different with paired-end reads. Our > current algorithm for pairing paired-end reads requires the whole file to > be read into memory. A different approach is currently being developed but > in the meantime you can take the qname-sorted approach. The Bam file will > need to be sorted by qname and both the 'yieldSize' and 'obeyQname=TRUE' > set when creating the BamFile/BamFileList. An example is on ?BamFile. > > Valerie > > > > On 04/09/2013 08:01 PM, Thomas Girke wrote: > >> Thanks for the tip. I guess doing it this way reverses the counting mode >> back to countOverlaps, but how can I use at the same time >> "IntersectionStrict" or any of the other modes provided by >> summarizeOverlaps if its mode argument is already used and countOverlaps >> doesn't accept one? >> >> Thomas >> >> On Tue, Apr 09, 2013 at 09:08:02PM +0000, Valerie Obenchain wrote: >> >>> Thanks for the example. You're right, none of the modes will count a >>> read falling completely within >1 feature. >>> >>> You can supply your own function as a 'mode'. Two requirements must be >>> met: >>> >>> (1) The function must have 3 arguments that correspond to >>> 'features', 'reads' and 'ignore.strand', in that order. >>> >>> (2) The function must return a vector of counts the same length >>> as 'features' >>> >>> Here is an example using countOverlaps() which I think gets at the >>> counting you want. >>> >>> counter <- function(x, y, ignore.strand) >>> countOverlaps(y, x, ignore.strand=ignore.strand) >>> >>> > assays(summarizeOverlaps(gr, rd, mode=counter))$counts >>> [,1] >>> [1,] 1 >>> [2,] 1 >>> >>> >>> Valerie >>> >>> On 04/09/2013 09:37 AM, Thomas Girke wrote: >>> >>>> Hi Valerie, >>>> >>>> Perhaps let's call this more explicitly an ignore_inter_feature_overlap >>>> option >>>> that we often need for cases like this: >>>> >>>> Read1 ---- >>>> F1 ---------------- >>>> F2 ----------- >>>> >>>> where we would like to have at least the option to assign Read1 to both >>>> F1 and F2: >>>> >>>> F1: 1 >>>> F2: 1 >>>> >>>> However, summarizeOverlapse doesn't count Read1 at all in all of its >>>> currently >>>> available modes that I am aware of. This lack of an >>>> ignore_inter_feature_overlap >>>> option is frequently a problem when working with poorly annotated >>>> genomes (high >>>> uncertainty about hidden/incorrect feature overlaps) or various >>>> RNA/ChIP-Seq _engineering_ projects where I rather take the risk of >>>> ambiguous read >>>> assignments than not counting at all as shown above. >>>> >>>> ## Here is a code example illustrating the same case: >>>> library(GenomicRanges); library(Rsamtools) >>>> rd <- GappedAlignments(letters[1], seqnames = Rle(rep("chr1",1)), >>>> pos = as.integer(c(500)), >>>> cigar = rep("100M", 1), strand = strand(rep("+", 1))) >>>> gr1 <- GRanges("chr1", IRanges(start=100, width=1001), strand="+", >>>> ID="feat1") >>>> gr2 <- GRanges("chr1", IRanges(start=500, width=101), strand="+", >>>> ID="feat2") >>>> gr <- c(gr1, gr2) >>>> >>>> ## All of the three current modes in summarizeOverlaps return a count >>>> of zero >>>> ## because of its inter_feature_overlap awareness: >>>> assays(summarizeOverlaps(gr, rd, mode="Union", >>>> ignore.strand=TRUE))$counts >>>> assays(summarizeOverlaps(gr, rd, mode="IntersectionStrict", >>>> ignore.strand=TRUE))$counts >>>> assays(summarizeOverlaps(gr, rd, mode="IntersectionNotEmpty", >>>> ignore.strand=TRUE))$counts >>>> [,1] >>>> [1,] 0 >>>> [2,] 0 >>>> >>>> ## However, countOverlaps handles this correctly, but is not the best >>>> choice when >>>> ## counting multiple range features. >>>> countOverlaps(gr, rd) >>>> [1] 1 1 >>>> >>>> >>>> Thomas >>>> >>>> sessionInfo() >>>>> >>>> R version 3.0.0 (2013-04-03) >>>> Platform: x86_64-apple-darwin10.8.0 (64-bit) >>>> >>>> locale: >>>> [1] en_US.UTF-8/en_US.UTF-8/en_US.**UTF-8/C/en_US.UTF-8/en_US.UTF-**8 >>>> >>>> attached base packages: >>>> [1] parallel stats graphics grDevices utils datasets methods >>>> [8] base >>>> >>>> other attached packages: >>>> [1] Rsamtools_1.12.0 Biostrings_2.28.0 GenomicRanges_1.12.1 >>>> [4] IRanges_1.18.0 BiocGenerics_0.6.0 >>>> >>>> loaded via a namespace (and not attached): >>>> [1] bitops_1.0-5 stats4_3.0.0 tools_3.0.0 zlibbioc_1.6.0 >>>> >>>> >>>> On Tue, Apr 09, 2013 at 03:42:03PM +0000, Valerie Obenchain wrote: >>>> >>>>> Hi Thomas, >>>>> >>>>> On 04/08/2013 05:52 PM, Thomas Girke wrote: >>>>> >>>>>> Dear Valerie, >>>>>> >>>>>> Is there currently any way to run summarizeOverlaps in a >>>>>> feature-overlap >>>>>> unaware mode, e.g with an ignorefeatureOL=FALSE/TRUE setting? >>>>>> Currently, >>>>>> one can switch back to countOverlaps when feature overlap unawareness >>>>>> is >>>>>> the more appropriate counting mode for a biological question, but then >>>>>> double counting of reads mapping to multiple-range features is not >>>>>> accounted for. It would be really nice to have such a feature- overlap >>>>>> unaware option directly in summarizeOverlaps. >>>>>> >>>>> >>>>> No, we don't currently have an option to ignore feature-overlap. It >>>>> sounds like you want to count with countOverlaps() but still want the >>>>> double counting to be resolved. This is essentially what the other >>>>> modes >>>>> are doing so I must be missing something. >>>>> >>>>> In this example 2 reads hit feature A, 1 read hits feature B. With >>>>> something like ignorefeature0L=TRUE, what results would you expect to >>>>> see? Maybe you have another, more descriptive example? >>>>> >>>>> reads <- GRanges("chr1", IRanges(c(1, 5, 20), width=3)) >>>>> features <- GRanges("chr1", IRanges(c(1, 20), width=10, >>>>> names=c("A", "B"))) >>>>> >>>>> > countOverlaps(features, reads) >>>>> [1] 2 1 >>>>> >>>>> >>>>> >>>>>> Another question relates to the memory usage of summarizeOverlaps. Has >>>>>> this been optimized yet? On a typical bam file with ~50-100 million >>>>>> reads the memory usage of summarizeOverlaps is often around 10-20GB. >>>>>> To >>>>>> use the function on a desktop computer or in large-scale RNA- Seq >>>>>> projects on a commodity compute cluster, it would be desirable if >>>>>> every >>>>>> counting instance would consume not more than 5GB of RAM. >>>>>> >>>>> >>>>> Have you tried the BamFileList-method? There is an example at the >>>>> bottom >>>>> of the ?BamFileList man page using summarizeOverlaps(). As Ryan >>>>> mentioned, the key is to set the 'yieldSize' parameter when creating >>>>> the >>>>> BamFile. This method also makes use of mclapply(). >>>>> >>>>> Valerie >>>>> >>>>> >>>>>> Thanks in advance for your help and suggestions, >>>>>> >>>>>> Thomas >>>>>> >>>>>> ______________________________**_________________ >>>>>> Bioconductor mailing list >>>>>> Bioconductor@r-project.org >>>>>> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https: st="" at.ethz.ch="" mailman="" listinfo="" bioconductor=""> >>>>>> Search the archives: http://news.gmane.org/gmane.** >>>>>> science.biology.informatics.**conductor<http: news.gmane.org="" g="" mane.science.biology.informatics.conductor=""> >>>>>> >>>>>> > ______________________________**_________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/**listinfo/bioconductor<https: stat.et="" hz.ch="" mailman="" listinfo="" bioconductor=""> > Search the archives: http://news.gmane.org/gmane.** > science.biology.informatics.**conductor<http: news.gmane.org="" gmane.="" science.biology.informatics.conductor=""> > [[alternative HTML version deleted]]
ADD REPLYlink written 4.6 years ago by Michael Lawrence9.8k
I think he wants both the counts from using a 'mode' and counts for the intra-feature reads. Using countOverlaps(..., type="within") would get the counts for the intra-feature reads only and then you'd need to add them to the counts from using 'mode'. The interface for using findOverlaps/countOverlaps from summarizeOverlaps() alreay exists but may not be user-friendly or intuitive enough. For example, if you wanted to count with countOverlaps(..., type="within"), where you are interested in the reads falling "within" the features, the counter would be something like, counter <- function(x, y, ignore.strand) { counts <- subjectHits(findOverlaps(y, x, type="within", ignore.strand=ignore.strand)) structure(tabulate(counts, NROW(x)), names=names(x)) } Too cryptic? Valerie On 04/10/2013 09:16 AM, Michael Lawrence wrote: > It sounds like what Thomas wants should be achievable with > countOverlaps. For example, IntersectionStrict (if treating all features > independently) would equate to type="within". So we just need a > summarizeOverlaps-style interface to those simple utilities. I agree > this would be very useful. > > Michael > > > On Wed, Apr 10, 2013 at 8:50 AM, Valerie Obenchain <vobencha at="" fhcrc.org=""> <mailto:vobencha at="" fhcrc.org="">> wrote: > > Ah, I see. You'd like to count with one of the existing modes but > have the option to pick up counts for these inter-feature reads > (fall completely 'within' >1 feature). These inter-feature reads > would be double (triple, quadruple, etc.) counted. Essentially they > would add one count to each feature they hit. Right? > > One more thought about memory usage. If you are working with > single-end reads the summarizeOverlaps,BamFileList-__method I > mentioned below should work fine. The approach is slightly different > with paired-end reads. Our current algorithm for pairing paired- end > reads requires the whole file to be read into memory. A different > approach is currently being developed but in the meantime you can > take the qname-sorted approach. The Bam file will need to be sorted > by qname and both the 'yieldSize' and 'obeyQname=TRUE' set when > creating the BamFile/BamFileList. An example is on ?BamFile. > > Valerie > > > > On 04/09/2013 08:01 PM, Thomas Girke wrote: > > Thanks for the tip. I guess doing it this way reverses the > counting mode > back to countOverlaps, but how can I use at the same time > "IntersectionStrict" or any of the other modes provided by > summarizeOverlaps if its mode argument is already used and > countOverlaps > doesn't accept one? > > Thomas > > On Tue, Apr 09, 2013 at 09:08:02PM +0000, Valerie Obenchain wrote: > > Thanks for the example. You're right, none of the modes will > count a > read falling completely within >1 feature. > > You can supply your own function as a 'mode'. Two > requirements must be met: > > (1) The function must have 3 arguments that correspond to > 'features', 'reads' and 'ignore.strand', in that order. > > (2) The function must return a vector of counts the same length > as 'features' > > Here is an example using countOverlaps() which I think gets > at the > counting you want. > > counter <- function(x, y, ignore.strand) > countOverlaps(y, x, ignore.strand=ignore.strand) > > > assays(summarizeOverlaps(gr, rd, mode=counter))$counts > [,1] > [1,] 1 > [2,] 1 > > > Valerie > > On 04/09/2013 09:37 AM, Thomas Girke wrote: > > Hi Valerie, > > Perhaps let's call this more explicitly an > ignore_inter_feature_overlap option > that we often need for cases like this: > > Read1 ---- > F1 ---------------- > F2 ----------- > > where we would like to have at least the option to > assign Read1 to both F1 and F2: > > F1: 1 > F2: 1 > > However, summarizeOverlapse doesn't count Read1 at all > in all of its currently > available modes that I am aware of. This lack of an > ignore_inter_feature_overlap > option is frequently a problem when working with poorly > annotated genomes (high > uncertainty about hidden/incorrect feature overlaps) or > various > RNA/ChIP-Seq _engineering_ projects where I rather take > the risk of ambiguous read > assignments than not counting at all as shown above. > > ## Here is a code example illustrating the same case: > library(GenomicRanges); library(Rsamtools) > rd <- GappedAlignments(letters[1], seqnames = > Rle(rep("chr1",1)), > pos = as.integer(c(500)), > cigar = rep("100M", 1), strand = > strand(rep("+", 1))) > gr1 <- GRanges("chr1", IRanges(start=100, width=1001), > strand="+", ID="feat1") > gr2 <- GRanges("chr1", IRanges(start=500, width=101), > strand="+", ID="feat2") > gr <- c(gr1, gr2) > > ## All of the three current modes in summarizeOverlaps > return a count of zero > ## because of its inter_feature_overlap awareness: > assays(summarizeOverlaps(gr, rd, mode="Union", > ignore.strand=TRUE))$counts > assays(summarizeOverlaps(gr, rd, > mode="IntersectionStrict", ignore.strand=TRUE))$counts > assays(summarizeOverlaps(gr, rd, > mode="IntersectionNotEmpty", ignore.strand=TRUE))$counts > [,1] > [1,] 0 > [2,] 0 > > ## However, countOverlaps handles this correctly, but is > not the best choice when > ## counting multiple range features. > countOverlaps(gr, rd) > [1] 1 1 > > > Thomas > > sessionInfo() > > R version 3.0.0 (2013-04-03) > Platform: x86_64-apple-darwin10.8.0 (64-bit) > > locale: > [1] > en_US.UTF-8/en_US.UTF-8/en_US.__UTF-8/C/en_US.UTF-8/en_US.UTF-__8 > > attached base packages: > [1] parallel stats graphics grDevices utils > datasets methods > [8] base > > other attached packages: > [1] Rsamtools_1.12.0 Biostrings_2.28.0 > GenomicRanges_1.12.1 > [4] IRanges_1.18.0 BiocGenerics_0.6.0 > > loaded via a namespace (and not attached): > [1] bitops_1.0-5 stats4_3.0.0 tools_3.0.0 > zlibbioc_1.6.0 > > > On Tue, Apr 09, 2013 at 03:42:03PM +0000, Valerie > Obenchain wrote: > > Hi Thomas, > > On 04/08/2013 05:52 PM, Thomas Girke wrote: > > Dear Valerie, > > Is there currently any way to run > summarizeOverlaps in a feature-overlap > unaware mode, e.g with an > ignorefeatureOL=FALSE/TRUE setting? Currently, > one can switch back to countOverlaps when > feature overlap unawareness is > the more appropriate counting mode for a > biological question, but then > double counting of reads mapping to > multiple-range features is not > accounted for. It would be really nice to have > such a feature-overlap > unaware option directly in summarizeOverlaps. > > > No, we don't currently have an option to ignore > feature-overlap. It > sounds like you want to count with countOverlaps() > but still want the > double counting to be resolved. This is essentially > what the other modes > are doing so I must be missing something. > > In this example 2 reads hit feature A, 1 read hits > feature B. With > something like ignorefeature0L=TRUE, what results > would you expect to > see? Maybe you have another, more descriptive example? > > reads <- GRanges("chr1", IRanges(c(1, 5, 20), width=3)) > features <- GRanges("chr1", IRanges(c(1, 20), width=10, > names=c("A", "B"))) > > > countOverlaps(features, reads) > [1] 2 1 > > > > Another question relates to the memory usage of > summarizeOverlaps. Has > this been optimized yet? On a typical bam file > with ~50-100 million > reads the memory usage of summarizeOverlaps is > often around 10-20GB. To > use the function on a desktop computer or in > large-scale RNA-Seq > projects on a commodity compute cluster, it > would be desirable if every > counting instance would consume not more than > 5GB of RAM. > > > Have you tried the BamFileList-method? There is an > example at the bottom > of the ?BamFileList man page using > summarizeOverlaps(). As Ryan > mentioned, the key is to set the 'yieldSize' > parameter when creating the > BamFile. This method also makes use of mclapply(). > > Valerie > > > Thanks in advance for your help and suggestions, > > Thomas > > _________________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > <mailto:bioconductor at="" r-project.org=""> > https://stat.ethz.ch/mailman/__listinfo/bioconductor > <https: stat.ethz.ch="" mailman="" listinfo="" bioconductor=""> > Search the archives: > http://news.gmane.org/gmane.__science.biology.informatics.__conductor > <http: news.gmane.org="" gmane.science.biology.informatics.conductor=""> > > > _________________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org <mailto:bioconductor at="" r-project.org=""> > https://stat.ethz.ch/mailman/__listinfo/bioconductor > <https: stat.ethz.ch="" mailman="" listinfo="" bioconductor=""> > Search the archives: > http://news.gmane.org/gmane.__science.biology.informatics.__conductor <http: news.gmane.org="" gmane.science.biology.informatics.conductor=""> > >
ADD REPLYlink written 4.6 years ago by Valerie Obenchain ♦♦ 6.4k
On Wed, Apr 10, 2013 at 9:59 AM, Valerie Obenchain <vobencha@fhcrc.org>wrote: > I think he wants both the counts from using a 'mode' and counts for the > intra-feature reads. Using countOverlaps(..., type="within") would get the > counts for the intra-feature reads only and then you'd need to add them to > the counts from using 'mode'. > > Hmm, generating multiple types of counts sounds like this is getting complicated. How many different combinations is the user going to want? I would stick with generating a single set of counts, unless there are obviously useful combinations. > The interface for using findOverlaps/countOverlaps from > summarizeOverlaps() alreay exists but may not be user-friendly or intuitive > enough. For example, if you wanted to count with countOverlaps(..., > type="within"), where you are interested in the reads falling "within" the > features, the counter would be something like, > > > counter <- function(x, y, ignore.strand) > { > counts <- subjectHits(findOverlaps(y, x, type="within", > ignore.strand=ignore.strand)) > structure(tabulate(counts, NROW(x)), names=names(x)) > } > > > Too cryptic? > > Well the simplest cases should be easy. If someone wants something more complicated, then extra work is justified. > Valerie > > > > On 04/10/2013 09:16 AM, Michael Lawrence wrote: > >> It sounds like what Thomas wants should be achievable with >> countOverlaps. For example, IntersectionStrict (if treating all features >> independently) would equate to type="within". So we just need a >> summarizeOverlaps-style interface to those simple utilities. I agree >> this would be very useful. >> >> Michael >> >> >> On Wed, Apr 10, 2013 at 8:50 AM, Valerie Obenchain <vobencha@fhcrc.org>> <mailto:vobencha@fhcrc.org>> wrote: >> >> Ah, I see. You'd like to count with one of the existing modes but >> have the option to pick up counts for these inter-feature reads >> (fall completely 'within' >1 feature). These inter-feature reads >> would be double (triple, quadruple, etc.) counted. Essentially they >> would add one count to each feature they hit. Right? >> >> One more thought about memory usage. If you are working with >> single-end reads the summarizeOverlaps,BamFileList-**__method I >> >> mentioned below should work fine. The approach is slightly different >> with paired-end reads. Our current algorithm for pairing paired-end >> reads requires the whole file to be read into memory. A different >> approach is currently being developed but in the meantime you can >> take the qname-sorted approach. The Bam file will need to be sorted >> by qname and both the 'yieldSize' and 'obeyQname=TRUE' set when >> creating the BamFile/BamFileList. An example is on ?BamFile. >> >> Valerie >> >> >> >> On 04/09/2013 08:01 PM, Thomas Girke wrote: >> >> Thanks for the tip. I guess doing it this way reverses the >> counting mode >> back to countOverlaps, but how can I use at the same time >> "IntersectionStrict" or any of the other modes provided by >> summarizeOverlaps if its mode argument is already used and >> countOverlaps >> doesn't accept one? >> >> Thomas >> >> On Tue, Apr 09, 2013 at 09:08:02PM +0000, Valerie Obenchain wrote: >> >> Thanks for the example. You're right, none of the modes will >> count a >> read falling completely within >1 feature. >> >> You can supply your own function as a 'mode'. Two >> requirements must be met: >> >> (1) The function must have 3 arguments that correspond to >> 'features', 'reads' and 'ignore.strand', in that order. >> >> (2) The function must return a vector of counts the same >> length >> as 'features' >> >> Here is an example using countOverlaps() which I think gets >> at the >> counting you want. >> >> counter <- function(x, y, ignore.strand) >> countOverlaps(y, x, ignore.strand=ignore.strand) >> >> > assays(summarizeOverlaps(gr, rd, mode=counter))$counts >> [,1] >> [1,] 1 >> [2,] 1 >> >> >> Valerie >> >> On 04/09/2013 09:37 AM, Thomas Girke wrote: >> >> Hi Valerie, >> >> Perhaps let's call this more explicitly an >> ignore_inter_feature_overlap option >> that we often need for cases like this: >> >> Read1 ---- >> F1 ---------------- >> F2 ----------- >> >> where we would like to have at least the option to >> assign Read1 to both F1 and F2: >> >> F1: 1 >> F2: 1 >> >> However, summarizeOverlapse doesn't count Read1 at all >> in all of its currently >> available modes that I am aware of. This lack of an >> ignore_inter_feature_overlap >> option is frequently a problem when working with poorly >> annotated genomes (high >> uncertainty about hidden/incorrect feature overlaps) or >> various >> RNA/ChIP-Seq _engineering_ projects where I rather take >> the risk of ambiguous read >> assignments than not counting at all as shown above. >> >> ## Here is a code example illustrating the same case: >> library(GenomicRanges); library(Rsamtools) >> rd <- GappedAlignments(letters[1], seqnames = >> Rle(rep("chr1",1)), >> pos = as.integer(c(500)), >> cigar = rep("100M", 1), strand = >> strand(rep("+", 1))) >> gr1 <- GRanges("chr1", IRanges(start=100, width=1001), >> strand="+", ID="feat1") >> gr2 <- GRanges("chr1", IRanges(start=500, width=101), >> strand="+", ID="feat2") >> gr <- c(gr1, gr2) >> >> ## All of the three current modes in summarizeOverlaps >> return a count of zero >> ## because of its inter_feature_overlap awareness: >> assays(summarizeOverlaps(gr, rd, mode="Union", >> ignore.strand=TRUE))$counts >> assays(summarizeOverlaps(gr, rd, >> mode="IntersectionStrict", ignore.strand=TRUE))$counts >> assays(summarizeOverlaps(gr, rd, >> mode="IntersectionNotEmpty", ignore.strand=TRUE))$counts >> [,1] >> [1,] 0 >> [2,] 0 >> >> ## However, countOverlaps handles this correctly, but is >> not the best choice when >> ## counting multiple range features. >> countOverlaps(gr, rd) >> [1] 1 1 >> >> >> Thomas >> >> sessionInfo() >> >> R version 3.0.0 (2013-04-03) >> Platform: x86_64-apple-darwin10.8.0 (64-bit) >> >> locale: >> [1] >> en_US.UTF-8/en_US.UTF-8/en_US.** >> __UTF-8/C/en_US.UTF-8/en_US.**UTF-__8 >> >> >> attached base packages: >> [1] parallel stats graphics grDevices utils >> datasets methods >> [8] base >> >> other attached packages: >> [1] Rsamtools_1.12.0 Biostrings_2.28.0 >> GenomicRanges_1.12.1 >> [4] IRanges_1.18.0 BiocGenerics_0.6.0 >> >> loaded via a namespace (and not attached): >> [1] bitops_1.0-5 stats4_3.0.0 tools_3.0.0 >> zlibbioc_1.6.0 >> >> >> On Tue, Apr 09, 2013 at 03:42:03PM +0000, Valerie >> Obenchain wrote: >> >> Hi Thomas, >> >> On 04/08/2013 05:52 PM, Thomas Girke wrote: >> >> Dear Valerie, >> >> Is there currently any way to run >> summarizeOverlaps in a feature-overlap >> unaware mode, e.g with an >> ignorefeatureOL=FALSE/TRUE setting? Currently, >> one can switch back to countOverlaps when >> feature overlap unawareness is >> the more appropriate counting mode for a >> biological question, but then >> double counting of reads mapping to >> multiple-range features is not >> accounted for. It would be really nice to have >> such a feature-overlap >> unaware option directly in summarizeOverlaps. >> >> >> No, we don't currently have an option to ignore >> feature-overlap. It >> sounds like you want to count with countOverlaps() >> but still want the >> double counting to be resolved. This is essentially >> what the other modes >> are doing so I must be missing something. >> >> In this example 2 reads hit feature A, 1 read hits >> feature B. With >> something like ignorefeature0L=TRUE, what results >> would you expect to >> see? Maybe you have another, more descriptive example? >> >> reads <- GRanges("chr1", IRanges(c(1, 5, 20), >> width=3)) >> features <- GRanges("chr1", IRanges(c(1, 20), >> width=10, >> names=c("A", "B"))) >> >> > countOverlaps(features, reads) >> [1] 2 1 >> >> >> >> Another question relates to the memory usage of >> summarizeOverlaps. Has >> this been optimized yet? On a typical bam file >> with ~50-100 million >> reads the memory usage of summarizeOverlaps is >> often around 10-20GB. To >> use the function on a desktop computer or in >> large-scale RNA-Seq >> projects on a commodity compute cluster, it >> would be desirable if every >> counting instance would consume not more than >> 5GB of RAM. >> >> >> Have you tried the BamFileList-method? There is an >> example at the bottom >> of the ?BamFileList man page using >> summarizeOverlaps(). As Ryan >> mentioned, the key is to set the 'yieldSize' >> parameter when creating the >> BamFile. This method also makes use of mclapply(). >> >> Valerie >> >> >> Thanks in advance for your help and suggestions, >> >> Thomas >> >> ______________________________** >> ___________________ >> Bioconductor mailing list >> Bioconductor@r-project.org >> <mailto:bioconductor@r-**project.org<bioconductor@r-project.org> >> > >> https://stat.ethz.ch/mailman/_** >> _listinfo/bioconductor<https: stat.ethz.ch="" mailman="" __listinfo="" bioc="" onductor=""> >> >> <https: stat.ethz.ch="" mailman="" **="">> listinfo/bioconductor<https: stat.ethz.ch="" mailman="" listinfo="" biocond="" uctor=""> >> > >> Search the archives: >> http://news.gmane.org/gmane.__** >> science.biology.informatics.__**conductor<http: news.gmane.org="" gma="" ne.__science.biology.informatics.__conductor=""> >> <http: news.gmane.org="" gmane.**="">> science.biology.informatics.**conductor<http: news.gmane.org="" gmane="" .science.biology.informatics.conductor=""> >> > >> >> >> ______________________________**___________________ >> Bioconductor mailing list >> Bioconductor@r-project.org <mailto:bioconductor@r-**project.org<bioconductor@r-project.org> >> > >> https://stat.ethz.ch/mailman/_**_listinfo/bioconductor<https: stat.ethz.ch="" mailman="" __listinfo="" bioconductor=""> >> >> <https: stat.ethz.ch="" mailman="" **listinfo="" bioconductor<https:="" s="" tat.ethz.ch="" mailman="" listinfo="" bioconductor=""> >> > >> Search the archives: >> http://news.gmane.org/gmane.__**science.biology.informatics.__** >> conductor<http: news.gmane.org="" gmane.__science.biology.informatics="" .__conductor="">< >> http://news.gmane.org/gmane.**science.biology.informatics.**conduct or<http: news.gmane.org="" gmane.science.biology.informatics.conductor=""> >> > >> >> >> [[alternative HTML version deleted]]
ADD REPLYlink written 4.6 years ago by Michael Lawrence9.8k
0
gravatar for Thomas Girke
4.6 years ago by
Thomas Girke1.6k
United States
Thomas Girke1.6k wrote:
Valerie, Please see my inserted comments below. On Wed, Apr 10, 2013 at 03:50:54PM +0000, Valerie Obenchain wrote: > Ah, I see. You'd like to count with one of the existing modes but have > the option to pick up counts for these inter-feature reads (fall > completely 'within' >1 feature). These inter-feature reads would be > double (triple, quadruple, etc.) counted. Essentially they would add one > count to each feature they hit. Right? Correct. Perhaps let's discuss this with a very common example of transcript-level counting rather than counting on the unified (virtual) exonic regions of genes. With the current description provided in the summarizeOverlaps vignette at http://www.bioconductor.org/packages/2.12/bioc/vignettes/GenomicRanges /inst/doc/summarizeOverlaps.pdf I don't see how this can be achieved without falling back to using countOverlaps without loosing the new counting modes provided by summarizeOverlaps? > > One more thought about memory usage. If you are working with single- end > reads the summarizeOverlaps,BamFileList-method I mentioned below should > work fine. The approach is slightly different with paired-end reads. Our > current algorithm for pairing paired-end reads requires the whole file > to be read into memory. A different approach is currently being > developed but in the meantime you can take the qname-sorted approach. > The Bam file will need to be sorted by qname and both the 'yieldSize' > and 'obeyQname=TRUE' set when creating the BamFile/BamFileList. An > example is on ?BamFile. Thanks for pointing this out. My fault that I didn't read through all the linked documentation. Perhaps it may not be a bad idea to make the memory restricted bam read instances the default setting in the future. This will definitely help biologists using those utilities without crashing their machines on the first attempt. Thomas > > Valerie > > > On 04/09/2013 08:01 PM, Thomas Girke wrote: > > Thanks for the tip. I guess doing it this way reverses the counting mode > > back to countOverlaps, but how can I use at the same time > > "IntersectionStrict" or any of the other modes provided by > > summarizeOverlaps if its mode argument is already used and countOverlaps > > doesn't accept one? > > > > Thomas > > > > On Tue, Apr 09, 2013 at 09:08:02PM +0000, Valerie Obenchain wrote: > >> Thanks for the example. You're right, none of the modes will count a > >> read falling completely within >1 feature. > >> > >> You can supply your own function as a 'mode'. Two requirements must be met: > >> > >> (1) The function must have 3 arguments that correspond to > >> 'features', 'reads' and 'ignore.strand', in that order. > >> > >> (2) The function must return a vector of counts the same length > >> as 'features' > >> > >> Here is an example using countOverlaps() which I think gets at the > >> counting you want. > >> > >> counter <- function(x, y, ignore.strand) > >> countOverlaps(y, x, ignore.strand=ignore.strand) > >> > >> > assays(summarizeOverlaps(gr, rd, mode=counter))$counts > >> [,1] > >> [1,] 1 > >> [2,] 1 > >> > >> > >> Valerie > >> > >> On 04/09/2013 09:37 AM, Thomas Girke wrote: > >>> Hi Valerie, > >>> > >>> Perhaps let's call this more explicitly an ignore_inter_feature_overlap option > >>> that we often need for cases like this: > >>> > >>> Read1 ---- > >>> F1 ---------------- > >>> F2 ----------- > >>> > >>> where we would like to have at least the option to assign Read1 to both F1 and F2: > >>> > >>> F1: 1 > >>> F2: 1 > >>> > >>> However, summarizeOverlapse doesn't count Read1 at all in all of its currently > >>> available modes that I am aware of. This lack of an ignore_inter_feature_overlap > >>> option is frequently a problem when working with poorly annotated genomes (high > >>> uncertainty about hidden/incorrect feature overlaps) or various > >>> RNA/ChIP-Seq _engineering_ projects where I rather take the risk of ambiguous read > >>> assignments than not counting at all as shown above. > >>> > >>> ## Here is a code example illustrating the same case: > >>> library(GenomicRanges); library(Rsamtools) > >>> rd <- GappedAlignments(letters[1], seqnames = Rle(rep("chr1",1)), > >>> pos = as.integer(c(500)), > >>> cigar = rep("100M", 1), strand = strand(rep("+", 1))) > >>> gr1 <- GRanges("chr1", IRanges(start=100, width=1001), strand="+", ID="feat1") > >>> gr2 <- GRanges("chr1", IRanges(start=500, width=101), strand="+", ID="feat2") > >>> gr <- c(gr1, gr2) > >>> > >>> ## All of the three current modes in summarizeOverlaps return a count of zero > >>> ## because of its inter_feature_overlap awareness: > >>> assays(summarizeOverlaps(gr, rd, mode="Union", ignore.strand=TRUE))$counts > >>> assays(summarizeOverlaps(gr, rd, mode="IntersectionStrict", ignore.strand=TRUE))$counts > >>> assays(summarizeOverlaps(gr, rd, mode="IntersectionNotEmpty", ignore.strand=TRUE))$counts > >>> [,1] > >>> [1,] 0 > >>> [2,] 0 > >>> > >>> ## However, countOverlaps handles this correctly, but is not the best choice when > >>> ## counting multiple range features. > >>> countOverlaps(gr, rd) > >>> [1] 1 1 > >>> > >>> > >>> Thomas > >>> > >>>> sessionInfo() > >>> R version 3.0.0 (2013-04-03) > >>> Platform: x86_64-apple-darwin10.8.0 (64-bit) > >>> > >>> locale: > >>> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 > >>> > >>> attached base packages: > >>> [1] parallel stats graphics grDevices utils datasets methods > >>> [8] base > >>> > >>> other attached packages: > >>> [1] Rsamtools_1.12.0 Biostrings_2.28.0 GenomicRanges_1.12.1 > >>> [4] IRanges_1.18.0 BiocGenerics_0.6.0 > >>> > >>> loaded via a namespace (and not attached): > >>> [1] bitops_1.0-5 stats4_3.0.0 tools_3.0.0 zlibbioc_1.6.0 > >>> > >>> > >>> On Tue, Apr 09, 2013 at 03:42:03PM +0000, Valerie Obenchain wrote: > >>>> Hi Thomas, > >>>> > >>>> On 04/08/2013 05:52 PM, Thomas Girke wrote: > >>>>> Dear Valerie, > >>>>> > >>>>> Is there currently any way to run summarizeOverlaps in a feature-overlap > >>>>> unaware mode, e.g with an ignorefeatureOL=FALSE/TRUE setting? Currently, > >>>>> one can switch back to countOverlaps when feature overlap unawareness is > >>>>> the more appropriate counting mode for a biological question, but then > >>>>> double counting of reads mapping to multiple-range features is not > >>>>> accounted for. It would be really nice to have such a feature- overlap > >>>>> unaware option directly in summarizeOverlaps. > >>>> > >>>> No, we don't currently have an option to ignore feature- overlap. It > >>>> sounds like you want to count with countOverlaps() but still want the > >>>> double counting to be resolved. This is essentially what the other modes > >>>> are doing so I must be missing something. > >>>> > >>>> In this example 2 reads hit feature A, 1 read hits feature B. With > >>>> something like ignorefeature0L=TRUE, what results would you expect to > >>>> see? Maybe you have another, more descriptive example? > >>>> > >>>> reads <- GRanges("chr1", IRanges(c(1, 5, 20), width=3)) > >>>> features <- GRanges("chr1", IRanges(c(1, 20), width=10, > >>>> names=c("A", "B"))) > >>>> > >>>> > countOverlaps(features, reads) > >>>> [1] 2 1 > >>>> > >>>> > >>>>> > >>>>> Another question relates to the memory usage of summarizeOverlaps. Has > >>>>> this been optimized yet? On a typical bam file with ~50-100 million > >>>>> reads the memory usage of summarizeOverlaps is often around 10-20GB. To > >>>>> use the function on a desktop computer or in large-scale RNA- Seq > >>>>> projects on a commodity compute cluster, it would be desirable if every > >>>>> counting instance would consume not more than 5GB of RAM. > >>>> > >>>> Have you tried the BamFileList-method? There is an example at the bottom > >>>> of the ?BamFileList man page using summarizeOverlaps(). As Ryan > >>>> mentioned, the key is to set the 'yieldSize' parameter when creating the > >>>> BamFile. This method also makes use of mclapply(). > >>>> > >>>> Valerie > >>>> > >>>>> > >>>>> Thanks in advance for your help and suggestions, > >>>>> > >>>>> Thomas > >>>>> > >>>>> _______________________________________________ > >>>>> Bioconductor mailing list > >>>>> Bioconductor at r-project.org > >>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor > >>>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > >>>>>
ADD COMMENTlink written 4.6 years ago by Thomas Girke1.6k
Hi Thomas, On 04/10/2013 09:23 AM, Thomas Girke wrote: > Valerie, > > Please see my inserted comments below. > > On Wed, Apr 10, 2013 at 03:50:54PM +0000, Valerie Obenchain wrote: >> Ah, I see. You'd like to count with one of the existing modes but have >> the option to pick up counts for these inter-feature reads (fall >> completely 'within' >1 feature). These inter-feature reads would be >> double (triple, quadruple, etc.) counted. Essentially they would add one >> count to each feature they hit. Right? > > Correct. Perhaps let's discuss this with a very common example of > transcript-level counting rather than counting on the unified (virtual) > exonic regions of genes. With the current description provided in the > summarizeOverlaps vignette at > http://www.bioconductor.org/packages/2.12/bioc/vignettes/GenomicRang es/inst/doc/summarizeOverlaps.pdf > I don't see how this can be achieved without falling back to using > countOverlaps without loosing the new counting modes provided by > summarizeOverlaps? It can't be achieved with the function as is but we could add an argument to handle this (as you suggested from the start). If 'inter-feature=TRUE' then these counts would be added to the counts already obtained from using a particular 'mode'. I will move ahead with implementing this argument. > >> >> One more thought about memory usage. If you are working with single-end >> reads the summarizeOverlaps,BamFileList-method I mentioned below should >> work fine. The approach is slightly different with paired-end reads. Our >> current algorithm for pairing paired-end reads requires the whole file >> to be read into memory. A different approach is currently being >> developed but in the meantime you can take the qname-sorted approach. >> The Bam file will need to be sorted by qname and both the 'yieldSize' >> and 'obeyQname=TRUE' set when creating the BamFile/BamFileList. An >> example is on ?BamFile. > > Thanks for pointing this out. My fault that I didn't read through all > the linked documentation. Perhaps it may not be a bad idea to make the > memory restricted bam read instances the default setting in the future. > This will definitely help biologists using those utilities without > crashing their machines on the first attempt. Good suggestion, thanks. Valerie > > Thomas > > >> >> Valerie >> >> >> On 04/09/2013 08:01 PM, Thomas Girke wrote: >>> Thanks for the tip. I guess doing it this way reverses the counting mode >>> back to countOverlaps, but how can I use at the same time >>> "IntersectionStrict" or any of the other modes provided by >>> summarizeOverlaps if its mode argument is already used and countOverlaps >>> doesn't accept one? >>> >>> Thomas >>> >>> On Tue, Apr 09, 2013 at 09:08:02PM +0000, Valerie Obenchain wrote: >>>> Thanks for the example. You're right, none of the modes will count a >>>> read falling completely within >1 feature. >>>> >>>> You can supply your own function as a 'mode'. Two requirements must be met: >>>> >>>> (1) The function must have 3 arguments that correspond to >>>> 'features', 'reads' and 'ignore.strand', in that order. >>>> >>>> (2) The function must return a vector of counts the same length >>>> as 'features' >>>> >>>> Here is an example using countOverlaps() which I think gets at the >>>> counting you want. >>>> >>>> counter <- function(x, y, ignore.strand) >>>> countOverlaps(y, x, ignore.strand=ignore.strand) >>>> >>>> > assays(summarizeOverlaps(gr, rd, mode=counter))$counts >>>> [,1] >>>> [1,] 1 >>>> [2,] 1 >>>> >>>> >>>> Valerie >>>> >>>> On 04/09/2013 09:37 AM, Thomas Girke wrote: >>>>> Hi Valerie, >>>>> >>>>> Perhaps let's call this more explicitly an ignore_inter_feature_overlap option >>>>> that we often need for cases like this: >>>>> >>>>> Read1 ---- >>>>> F1 ---------------- >>>>> F2 ----------- >>>>> >>>>> where we would like to have at least the option to assign Read1 to both F1 and F2: >>>>> >>>>> F1: 1 >>>>> F2: 1 >>>>> >>>>> However, summarizeOverlapse doesn't count Read1 at all in all of its currently >>>>> available modes that I am aware of. This lack of an ignore_inter_feature_overlap >>>>> option is frequently a problem when working with poorly annotated genomes (high >>>>> uncertainty about hidden/incorrect feature overlaps) or various >>>>> RNA/ChIP-Seq _engineering_ projects where I rather take the risk of ambiguous read >>>>> assignments than not counting at all as shown above. >>>>> >>>>> ## Here is a code example illustrating the same case: >>>>> library(GenomicRanges); library(Rsamtools) >>>>> rd <- GappedAlignments(letters[1], seqnames = Rle(rep("chr1",1)), >>>>> pos = as.integer(c(500)), >>>>> cigar = rep("100M", 1), strand = strand(rep("+", 1))) >>>>> gr1 <- GRanges("chr1", IRanges(start=100, width=1001), strand="+", ID="feat1") >>>>> gr2 <- GRanges("chr1", IRanges(start=500, width=101), strand="+", ID="feat2") >>>>> gr <- c(gr1, gr2) >>>>> >>>>> ## All of the three current modes in summarizeOverlaps return a count of zero >>>>> ## because of its inter_feature_overlap awareness: >>>>> assays(summarizeOverlaps(gr, rd, mode="Union", ignore.strand=TRUE))$counts >>>>> assays(summarizeOverlaps(gr, rd, mode="IntersectionStrict", ignore.strand=TRUE))$counts >>>>> assays(summarizeOverlaps(gr, rd, mode="IntersectionNotEmpty", ignore.strand=TRUE))$counts >>>>> [,1] >>>>> [1,] 0 >>>>> [2,] 0 >>>>> >>>>> ## However, countOverlaps handles this correctly, but is not the best choice when >>>>> ## counting multiple range features. >>>>> countOverlaps(gr, rd) >>>>> [1] 1 1 >>>>> >>>>> >>>>> Thomas >>>>> >>>>>> sessionInfo() >>>>> R version 3.0.0 (2013-04-03) >>>>> Platform: x86_64-apple-darwin10.8.0 (64-bit) >>>>> >>>>> locale: >>>>> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 >>>>> >>>>> attached base packages: >>>>> [1] parallel stats graphics grDevices utils datasets methods >>>>> [8] base >>>>> >>>>> other attached packages: >>>>> [1] Rsamtools_1.12.0 Biostrings_2.28.0 GenomicRanges_1.12.1 >>>>> [4] IRanges_1.18.0 BiocGenerics_0.6.0 >>>>> >>>>> loaded via a namespace (and not attached): >>>>> [1] bitops_1.0-5 stats4_3.0.0 tools_3.0.0 zlibbioc_1.6.0 >>>>> >>>>> >>>>> On Tue, Apr 09, 2013 at 03:42:03PM +0000, Valerie Obenchain wrote: >>>>>> Hi Thomas, >>>>>> >>>>>> On 04/08/2013 05:52 PM, Thomas Girke wrote: >>>>>>> Dear Valerie, >>>>>>> >>>>>>> Is there currently any way to run summarizeOverlaps in a feature-overlap >>>>>>> unaware mode, e.g with an ignorefeatureOL=FALSE/TRUE setting? Currently, >>>>>>> one can switch back to countOverlaps when feature overlap unawareness is >>>>>>> the more appropriate counting mode for a biological question, but then >>>>>>> double counting of reads mapping to multiple-range features is not >>>>>>> accounted for. It would be really nice to have such a feature- overlap >>>>>>> unaware option directly in summarizeOverlaps. >>>>>> >>>>>> No, we don't currently have an option to ignore feature- overlap. It >>>>>> sounds like you want to count with countOverlaps() but still want the >>>>>> double counting to be resolved. This is essentially what the other modes >>>>>> are doing so I must be missing something. >>>>>> >>>>>> In this example 2 reads hit feature A, 1 read hits feature B. With >>>>>> something like ignorefeature0L=TRUE, what results would you expect to >>>>>> see? Maybe you have another, more descriptive example? >>>>>> >>>>>> reads <- GRanges("chr1", IRanges(c(1, 5, 20), width=3)) >>>>>> features <- GRanges("chr1", IRanges(c(1, 20), width=10, >>>>>> names=c("A", "B"))) >>>>>> >>>>>> > countOverlaps(features, reads) >>>>>> [1] 2 1 >>>>>> >>>>>> >>>>>>> >>>>>>> Another question relates to the memory usage of summarizeOverlaps. Has >>>>>>> this been optimized yet? On a typical bam file with ~50-100 million >>>>>>> reads the memory usage of summarizeOverlaps is often around 10-20GB. To >>>>>>> use the function on a desktop computer or in large-scale RNA- Seq >>>>>>> projects on a commodity compute cluster, it would be desirable if every >>>>>>> counting instance would consume not more than 5GB of RAM. >>>>>> >>>>>> Have you tried the BamFileList-method? There is an example at the bottom >>>>>> of the ?BamFileList man page using summarizeOverlaps(). As Ryan >>>>>> mentioned, the key is to set the 'yieldSize' parameter when creating the >>>>>> BamFile. This method also makes use of mclapply(). >>>>>> >>>>>> Valerie >>>>>> >>>>>>> >>>>>>> Thanks in advance for your help and suggestions, >>>>>>> >>>>>>> Thomas >>>>>>> >>>>>>> _______________________________________________ >>>>>>> Bioconductor mailing list >>>>>>> Bioconductor at r-project.org >>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>>>>
ADD REPLYlink written 4.6 years ago by Valerie Obenchain ♦♦ 6.4k
0
gravatar for Thomas Girke
4.6 years ago by
Thomas Girke1.6k
United States
Thomas Girke1.6k wrote:
Thanks. Adding an inter-feature unaware mode will be very helpful and also broaden summarizeOverlaps' application spectrum for a lot of use cases. Thomas On Wed, Apr 10, 2013 at 04:59:49PM +0000, Valerie Obenchain wrote: > Hi Thomas, > > On 04/10/2013 09:23 AM, Thomas Girke wrote: > > Valerie, > > > > Please see my inserted comments below. > > > > On Wed, Apr 10, 2013 at 03:50:54PM +0000, Valerie Obenchain wrote: > >> Ah, I see. You'd like to count with one of the existing modes but have > >> the option to pick up counts for these inter-feature reads (fall > >> completely 'within' >1 feature). These inter-feature reads would be > >> double (triple, quadruple, etc.) counted. Essentially they would add one > >> count to each feature they hit. Right? > > > > Correct. Perhaps let's discuss this with a very common example of > > transcript-level counting rather than counting on the unified (virtual) > > exonic regions of genes. With the current description provided in the > > summarizeOverlaps vignette at > > http://www.bioconductor.org/packages/2.12/bioc/vignettes/GenomicRa nges/inst/doc/summarizeOverlaps.pdf > > I don't see how this can be achieved without falling back to using > > countOverlaps without loosing the new counting modes provided by > > summarizeOverlaps? > > It can't be achieved with the function as is but we could add an > argument to handle this (as you suggested from the start). If > 'inter-feature=TRUE' then these counts would be added to the counts > already obtained from using a particular 'mode'. I will move ahead with > implementing this argument. > > > > >> > >> One more thought about memory usage. If you are working with single-end > >> reads the summarizeOverlaps,BamFileList-method I mentioned below should > >> work fine. The approach is slightly different with paired-end reads. Our > >> current algorithm for pairing paired-end reads requires the whole file > >> to be read into memory. A different approach is currently being > >> developed but in the meantime you can take the qname-sorted approach. > >> The Bam file will need to be sorted by qname and both the 'yieldSize' > >> and 'obeyQname=TRUE' set when creating the BamFile/BamFileList. An > >> example is on ?BamFile. > > > > Thanks for pointing this out. My fault that I didn't read through all > > the linked documentation. Perhaps it may not be a bad idea to make the > > memory restricted bam read instances the default setting in the future. > > This will definitely help biologists using those utilities without > > crashing their machines on the first attempt. > > Good suggestion, thanks. > > Valerie > > > > > Thomas > > > > > >> > >> Valerie > >> > >> > >> On 04/09/2013 08:01 PM, Thomas Girke wrote: > >>> Thanks for the tip. I guess doing it this way reverses the counting mode > >>> back to countOverlaps, but how can I use at the same time > >>> "IntersectionStrict" or any of the other modes provided by > >>> summarizeOverlaps if its mode argument is already used and countOverlaps > >>> doesn't accept one? > >>> > >>> Thomas > >>> > >>> On Tue, Apr 09, 2013 at 09:08:02PM +0000, Valerie Obenchain wrote: > >>>> Thanks for the example. You're right, none of the modes will count a > >>>> read falling completely within >1 feature. > >>>> > >>>> You can supply your own function as a 'mode'. Two requirements must be met: > >>>> > >>>> (1) The function must have 3 arguments that correspond to > >>>> 'features', 'reads' and 'ignore.strand', in that order. > >>>> > >>>> (2) The function must return a vector of counts the same length > >>>> as 'features' > >>>> > >>>> Here is an example using countOverlaps() which I think gets at the > >>>> counting you want. > >>>> > >>>> counter <- function(x, y, ignore.strand) > >>>> countOverlaps(y, x, ignore.strand=ignore.strand) > >>>> > >>>> > assays(summarizeOverlaps(gr, rd, mode=counter))$counts > >>>> [,1] > >>>> [1,] 1 > >>>> [2,] 1 > >>>> > >>>> > >>>> Valerie > >>>> > >>>> On 04/09/2013 09:37 AM, Thomas Girke wrote: > >>>>> Hi Valerie, > >>>>> > >>>>> Perhaps let's call this more explicitly an ignore_inter_feature_overlap option > >>>>> that we often need for cases like this: > >>>>> > >>>>> Read1 ---- > >>>>> F1 ---------------- > >>>>> F2 ----------- > >>>>> > >>>>> where we would like to have at least the option to assign Read1 to both F1 and F2: > >>>>> > >>>>> F1: 1 > >>>>> F2: 1 > >>>>> > >>>>> However, summarizeOverlapse doesn't count Read1 at all in all of its currently > >>>>> available modes that I am aware of. This lack of an ignore_inter_feature_overlap > >>>>> option is frequently a problem when working with poorly annotated genomes (high > >>>>> uncertainty about hidden/incorrect feature overlaps) or various > >>>>> RNA/ChIP-Seq _engineering_ projects where I rather take the risk of ambiguous read > >>>>> assignments than not counting at all as shown above. > >>>>> > >>>>> ## Here is a code example illustrating the same case: > >>>>> library(GenomicRanges); library(Rsamtools) > >>>>> rd <- GappedAlignments(letters[1], seqnames = Rle(rep("chr1",1)), > >>>>> pos = as.integer(c(500)), > >>>>> cigar = rep("100M", 1), strand = strand(rep("+", 1))) > >>>>> gr1 <- GRanges("chr1", IRanges(start=100, width=1001), strand="+", ID="feat1") > >>>>> gr2 <- GRanges("chr1", IRanges(start=500, width=101), strand="+", ID="feat2") > >>>>> gr <- c(gr1, gr2) > >>>>> > >>>>> ## All of the three current modes in summarizeOverlaps return a count of zero > >>>>> ## because of its inter_feature_overlap awareness: > >>>>> assays(summarizeOverlaps(gr, rd, mode="Union", ignore.strand=TRUE))$counts > >>>>> assays(summarizeOverlaps(gr, rd, mode="IntersectionStrict", ignore.strand=TRUE))$counts > >>>>> assays(summarizeOverlaps(gr, rd, mode="IntersectionNotEmpty", ignore.strand=TRUE))$counts > >>>>> [,1] > >>>>> [1,] 0 > >>>>> [2,] 0 > >>>>> > >>>>> ## However, countOverlaps handles this correctly, but is not the best choice when > >>>>> ## counting multiple range features. > >>>>> countOverlaps(gr, rd) > >>>>> [1] 1 1 > >>>>> > >>>>> > >>>>> Thomas > >>>>> > >>>>>> sessionInfo() > >>>>> R version 3.0.0 (2013-04-03) > >>>>> Platform: x86_64-apple-darwin10.8.0 (64-bit) > >>>>> > >>>>> locale: > >>>>> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 > >>>>> > >>>>> attached base packages: > >>>>> [1] parallel stats graphics grDevices utils datasets methods > >>>>> [8] base > >>>>> > >>>>> other attached packages: > >>>>> [1] Rsamtools_1.12.0 Biostrings_2.28.0 GenomicRanges_1.12.1 > >>>>> [4] IRanges_1.18.0 BiocGenerics_0.6.0 > >>>>> > >>>>> loaded via a namespace (and not attached): > >>>>> [1] bitops_1.0-5 stats4_3.0.0 tools_3.0.0 zlibbioc_1.6.0 > >>>>> > >>>>> > >>>>> On Tue, Apr 09, 2013 at 03:42:03PM +0000, Valerie Obenchain wrote: > >>>>>> Hi Thomas, > >>>>>> > >>>>>> On 04/08/2013 05:52 PM, Thomas Girke wrote: > >>>>>>> Dear Valerie, > >>>>>>> > >>>>>>> Is there currently any way to run summarizeOverlaps in a feature-overlap > >>>>>>> unaware mode, e.g with an ignorefeatureOL=FALSE/TRUE setting? Currently, > >>>>>>> one can switch back to countOverlaps when feature overlap unawareness is > >>>>>>> the more appropriate counting mode for a biological question, but then > >>>>>>> double counting of reads mapping to multiple-range features is not > >>>>>>> accounted for. It would be really nice to have such a feature-overlap > >>>>>>> unaware option directly in summarizeOverlaps. > >>>>>> > >>>>>> No, we don't currently have an option to ignore feature- overlap. It > >>>>>> sounds like you want to count with countOverlaps() but still want the > >>>>>> double counting to be resolved. This is essentially what the other modes > >>>>>> are doing so I must be missing something. > >>>>>> > >>>>>> In this example 2 reads hit feature A, 1 read hits feature B. With > >>>>>> something like ignorefeature0L=TRUE, what results would you expect to > >>>>>> see? Maybe you have another, more descriptive example? > >>>>>> > >>>>>> reads <- GRanges("chr1", IRanges(c(1, 5, 20), width=3)) > >>>>>> features <- GRanges("chr1", IRanges(c(1, 20), width=10, > >>>>>> names=c("A", "B"))) > >>>>>> > >>>>>> > countOverlaps(features, reads) > >>>>>> [1] 2 1 > >>>>>> > >>>>>> > >>>>>>> > >>>>>>> Another question relates to the memory usage of summarizeOverlaps. Has > >>>>>>> this been optimized yet? On a typical bam file with ~50-100 million > >>>>>>> reads the memory usage of summarizeOverlaps is often around 10-20GB. To > >>>>>>> use the function on a desktop computer or in large-scale RNA-Seq > >>>>>>> projects on a commodity compute cluster, it would be desirable if every > >>>>>>> counting instance would consume not more than 5GB of RAM. > >>>>>> > >>>>>> Have you tried the BamFileList-method? There is an example at the bottom > >>>>>> of the ?BamFileList man page using summarizeOverlaps(). As Ryan > >>>>>> mentioned, the key is to set the 'yieldSize' parameter when creating the > >>>>>> BamFile. This method also makes use of mclapply(). > >>>>>> > >>>>>> Valerie > >>>>>> > >>>>>>> > >>>>>>> Thanks in advance for your help and suggestions, > >>>>>>> > >>>>>>> Thomas > >>>>>>> > >>>>>>> _______________________________________________ > >>>>>>> Bioconductor mailing list > >>>>>>> Bioconductor at r-project.org > >>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor > >>>>>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > >>>>>>>
ADD COMMENTlink written 4.6 years ago by Thomas Girke1.6k
On 04/10/2013 12:42 PM, Thomas Girke wrote: > Thanks. Adding an inter-feature unaware mode will be very helpful and also > broaden summarizeOverlaps' application spectrum for a lot of use cases. I'm probably being quite dense here, and am mostly an outside observer. What I hear you saying is that there are currently three modes -- Union, IntersectionStrict, IntersectionNonEmpty. These modes are summarized in the seven rows of figure 1 of vignette("summarizeOverlaps", package="GenomicRanges") Let's say there is a flag inter_feature, and when its value is TRUE then the current counting schemes are obtained. These modes differ in the way counting works for reads illustrated in rows 5, 6, and 7 of the figure. You'd like a count scored where a '1' appears in the table below. With inter_feature=TRUE reads are 'never counted more than once' ('Hits per read' is <= 1) |----------------------+-----+-----------+-----------+---------------| | Mode | Row | Feature 1 | Feature 2 | Hits per read | |----------------------+-----+-----------+-----------+---------------| | Union | 5 | 1 | 0 | 1 | | | 6 | 0 | 0 | 0 | | | 7 | 0 | 0 | 0 | | IntersectionStrict | 5 | 1 | 0 | 1 | | | 6 | 1 | 0 | 1 | | | 7 | 0 | 0 | 0 | | IntersectionNotEmpty | 5 | 1 | 0 | 1 | | | 6 | 1 | 0 | 1 | | | 7 | 0 | 0 | 0 | |----------------------+-----+-----------+-----------+---------------| You'd like to add 3 more modes, with inter_feature=FALSE. Reads are sometimes 'counted twice' |----------------------+-----+-----------+-----------+---------------| | Mode | Row | Feature 1 | Feature 2 | Hits per read | |----------------------+-----+-----------+-----------+---------------| | Union | 5 | 1 | 0 | 1 | | | 6 | 1 | 1 | 2 | | | 7 | 1 | 1 | 2 | | IntersectionStrict | 5 | 1 | 0 | 1 | | | 6 | 1 | 0 | 1 | | | 7 | 1 | 1 | 2 | | IntersectionNotEmpty | 5 | 1 | 0 | 1 | | | 6 | 1 | 1 | 2 | | | 7 | 1 | 1 | 2 | |----------------------+-----+-----------+-----------+---------------| Martin > > Thomas > > > On Wed, Apr 10, 2013 at 04:59:49PM +0000, Valerie Obenchain wrote: >> Hi Thomas, >> >> On 04/10/2013 09:23 AM, Thomas Girke wrote: >>> Valerie, >>> >>> Please see my inserted comments below. >>> >>> On Wed, Apr 10, 2013 at 03:50:54PM +0000, Valerie Obenchain wrote: >>>> Ah, I see. You'd like to count with one of the existing modes but have >>>> the option to pick up counts for these inter-feature reads (fall >>>> completely 'within' >1 feature). These inter-feature reads would be >>>> double (triple, quadruple, etc.) counted. Essentially they would add one >>>> count to each feature they hit. Right? >>> >>> Correct. Perhaps let's discuss this with a very common example of >>> transcript-level counting rather than counting on the unified (virtual) >>> exonic regions of genes. With the current description provided in the >>> summarizeOverlaps vignette at >>> http://www.bioconductor.org/packages/2.12/bioc/vignettes/GenomicRa nges/inst/doc/summarizeOverlaps.pdf >>> I don't see how this can be achieved without falling back to using >>> countOverlaps without loosing the new counting modes provided by >>> summarizeOverlaps? >> >> It can't be achieved with the function as is but we could add an >> argument to handle this (as you suggested from the start). If >> 'inter-feature=TRUE' then these counts would be added to the counts >> already obtained from using a particular 'mode'. I will move ahead with >> implementing this argument. >> >>> >>>> >>>> One more thought about memory usage. If you are working with single-end >>>> reads the summarizeOverlaps,BamFileList-method I mentioned below should >>>> work fine. The approach is slightly different with paired-end reads. Our >>>> current algorithm for pairing paired-end reads requires the whole file >>>> to be read into memory. A different approach is currently being >>>> developed but in the meantime you can take the qname-sorted approach. >>>> The Bam file will need to be sorted by qname and both the 'yieldSize' >>>> and 'obeyQname=TRUE' set when creating the BamFile/BamFileList. An >>>> example is on ?BamFile. >>> >>> Thanks for pointing this out. My fault that I didn't read through all >>> the linked documentation. Perhaps it may not be a bad idea to make the >>> memory restricted bam read instances the default setting in the future. >>> This will definitely help biologists using those utilities without >>> crashing their machines on the first attempt. >> >> Good suggestion, thanks. >> >> Valerie >> >>> >>> Thomas >>> >>> >>>> >>>> Valerie >>>> >>>> >>>> On 04/09/2013 08:01 PM, Thomas Girke wrote: >>>>> Thanks for the tip. I guess doing it this way reverses the counting mode >>>>> back to countOverlaps, but how can I use at the same time >>>>> "IntersectionStrict" or any of the other modes provided by >>>>> summarizeOverlaps if its mode argument is already used and countOverlaps >>>>> doesn't accept one? >>>>> >>>>> Thomas >>>>> >>>>> On Tue, Apr 09, 2013 at 09:08:02PM +0000, Valerie Obenchain wrote: >>>>>> Thanks for the example. You're right, none of the modes will count a >>>>>> read falling completely within >1 feature. >>>>>> >>>>>> You can supply your own function as a 'mode'. Two requirements must be met: >>>>>> >>>>>> (1) The function must have 3 arguments that correspond to >>>>>> 'features', 'reads' and 'ignore.strand', in that order. >>>>>> >>>>>> (2) The function must return a vector of counts the same length >>>>>> as 'features' >>>>>> >>>>>> Here is an example using countOverlaps() which I think gets at the >>>>>> counting you want. >>>>>> >>>>>> counter <- function(x, y, ignore.strand) >>>>>> countOverlaps(y, x, ignore.strand=ignore.strand) >>>>>> >>>>>> > assays(summarizeOverlaps(gr, rd, mode=counter))$counts >>>>>> [,1] >>>>>> [1,] 1 >>>>>> [2,] 1 >>>>>> >>>>>> >>>>>> Valerie >>>>>> >>>>>> On 04/09/2013 09:37 AM, Thomas Girke wrote: >>>>>>> Hi Valerie, >>>>>>> >>>>>>> Perhaps let's call this more explicitly an ignore_inter_feature_overlap option >>>>>>> that we often need for cases like this: >>>>>>> >>>>>>> Read1 ---- >>>>>>> F1 ---------------- >>>>>>> F2 ----------- >>>>>>> >>>>>>> where we would like to have at least the option to assign Read1 to both F1 and F2: >>>>>>> >>>>>>> F1: 1 >>>>>>> F2: 1 >>>>>>> >>>>>>> However, summarizeOverlapse doesn't count Read1 at all in all of its currently >>>>>>> available modes that I am aware of. This lack of an ignore_inter_feature_overlap >>>>>>> option is frequently a problem when working with poorly annotated genomes (high >>>>>>> uncertainty about hidden/incorrect feature overlaps) or various >>>>>>> RNA/ChIP-Seq _engineering_ projects where I rather take the risk of ambiguous read >>>>>>> assignments than not counting at all as shown above. >>>>>>> >>>>>>> ## Here is a code example illustrating the same case: >>>>>>> library(GenomicRanges); library(Rsamtools) >>>>>>> rd <- GappedAlignments(letters[1], seqnames = Rle(rep("chr1",1)), >>>>>>> pos = as.integer(c(500)), >>>>>>> cigar = rep("100M", 1), strand = strand(rep("+", 1))) >>>>>>> gr1 <- GRanges("chr1", IRanges(start=100, width=1001), strand="+", ID="feat1") >>>>>>> gr2 <- GRanges("chr1", IRanges(start=500, width=101), strand="+", ID="feat2") >>>>>>> gr <- c(gr1, gr2) >>>>>>> >>>>>>> ## All of the three current modes in summarizeOverlaps return a count of zero >>>>>>> ## because of its inter_feature_overlap awareness: >>>>>>> assays(summarizeOverlaps(gr, rd, mode="Union", ignore.strand=TRUE))$counts >>>>>>> assays(summarizeOverlaps(gr, rd, mode="IntersectionStrict", ignore.strand=TRUE))$counts >>>>>>> assays(summarizeOverlaps(gr, rd, mode="IntersectionNotEmpty", ignore.strand=TRUE))$counts >>>>>>> [,1] >>>>>>> [1,] 0 >>>>>>> [2,] 0 >>>>>>> >>>>>>> ## However, countOverlaps handles this correctly, but is not the best choice when >>>>>>> ## counting multiple range features. >>>>>>> countOverlaps(gr, rd) >>>>>>> [1] 1 1 >>>>>>> >>>>>>> >>>>>>> Thomas >>>>>>> >>>>>>>> sessionInfo() >>>>>>> R version 3.0.0 (2013-04-03) >>>>>>> Platform: x86_64-apple-darwin10.8.0 (64-bit) >>>>>>> >>>>>>> locale: >>>>>>> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 >>>>>>> >>>>>>> attached base packages: >>>>>>> [1] parallel stats graphics grDevices utils datasets methods >>>>>>> [8] base >>>>>>> >>>>>>> other attached packages: >>>>>>> [1] Rsamtools_1.12.0 Biostrings_2.28.0 GenomicRanges_1.12.1 >>>>>>> [4] IRanges_1.18.0 BiocGenerics_0.6.0 >>>>>>> >>>>>>> loaded via a namespace (and not attached): >>>>>>> [1] bitops_1.0-5 stats4_3.0.0 tools_3.0.0 zlibbioc_1.6.0 >>>>>>> >>>>>>> >>>>>>> On Tue, Apr 09, 2013 at 03:42:03PM +0000, Valerie Obenchain wrote: >>>>>>>> Hi Thomas, >>>>>>>> >>>>>>>> On 04/08/2013 05:52 PM, Thomas Girke wrote: >>>>>>>>> Dear Valerie, >>>>>>>>> >>>>>>>>> Is there currently any way to run summarizeOverlaps in a feature-overlap >>>>>>>>> unaware mode, e.g with an ignorefeatureOL=FALSE/TRUE setting? Currently, >>>>>>>>> one can switch back to countOverlaps when feature overlap unawareness is >>>>>>>>> the more appropriate counting mode for a biological question, but then >>>>>>>>> double counting of reads mapping to multiple-range features is not >>>>>>>>> accounted for. It would be really nice to have such a feature-overlap >>>>>>>>> unaware option directly in summarizeOverlaps. >>>>>>>> >>>>>>>> No, we don't currently have an option to ignore feature- overlap. It >>>>>>>> sounds like you want to count with countOverlaps() but still want the >>>>>>>> double counting to be resolved. This is essentially what the other modes >>>>>>>> are doing so I must be missing something. >>>>>>>> >>>>>>>> In this example 2 reads hit feature A, 1 read hits feature B. With >>>>>>>> something like ignorefeature0L=TRUE, what results would you expect to >>>>>>>> see? Maybe you have another, more descriptive example? >>>>>>>> >>>>>>>> reads <- GRanges("chr1", IRanges(c(1, 5, 20), width=3)) >>>>>>>> features <- GRanges("chr1", IRanges(c(1, 20), width=10, >>>>>>>> names=c("A", "B"))) >>>>>>>> >>>>>>>> > countOverlaps(features, reads) >>>>>>>> [1] 2 1 >>>>>>>> >>>>>>>> >>>>>>>>> >>>>>>>>> Another question relates to the memory usage of summarizeOverlaps. Has >>>>>>>>> this been optimized yet? On a typical bam file with ~50-100 million >>>>>>>>> reads the memory usage of summarizeOverlaps is often around 10-20GB. To >>>>>>>>> use the function on a desktop computer or in large-scale RNA-Seq >>>>>>>>> projects on a commodity compute cluster, it would be desirable if every >>>>>>>>> counting instance would consume not more than 5GB of RAM. >>>>>>>> >>>>>>>> Have you tried the BamFileList-method? There is an example at the bottom >>>>>>>> of the ?BamFileList man page using summarizeOverlaps(). As Ryan >>>>>>>> mentioned, the key is to set the 'yieldSize' parameter when creating the >>>>>>>> BamFile. This method also makes use of mclapply(). >>>>>>>> >>>>>>>> Valerie >>>>>>>> >>>>>>>>> >>>>>>>>> Thanks in advance for your help and suggestions, >>>>>>>>> >>>>>>>>> Thomas >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> Bioconductor mailing list >>>>>>>>> Bioconductor at r-project.org >>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>>>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>>>>>> > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793
ADD REPLYlink written 4.6 years ago by Martin Morgan ♦♦ 20k
0
gravatar for Thomas Girke
4.6 years ago by
Thomas Girke1.6k
United States
Thomas Girke1.6k wrote:
Yes, when trying to switch from countOverlaps to summarizeOverlaps I was hoping the latter would provide all the functionalities of the former plus some of the new features. Achieving the desirable result with a custom function would be fine for me or other more advanced R users, but I don't think it is a substitute for predefined and ~FDA approved~ counting modes where we want to have assurance that scientist A can communicate to scientist B what counter s/he used and both get the same result while not requiring any expert knowledge in R. I feel this is particularly important for read counting since it is so fundamental to many application areas in NGS sequence analysis. Thomas On Wed, Apr 10, 2013 at 04:59:53PM +0000, Valerie Obenchain wrote: > I think he wants both the counts from using a 'mode' and counts for the > intra-feature reads. Using countOverlaps(..., type="within") would get > the counts for the intra-feature reads only and then you'd need to add > them to the counts from using 'mode'. > > The interface for using findOverlaps/countOverlaps from > summarizeOverlaps() alreay exists but may not be user-friendly or > intuitive enough. For example, if you wanted to count with > countOverlaps(..., type="within"), where you are interested in the reads > falling "within" the features, the counter would be something like, > > counter <- function(x, y, ignore.strand) > { > counts <- subjectHits(findOverlaps(y, x, type="within", > ignore.strand=ignore.strand)) > structure(tabulate(counts, NROW(x)), names=names(x)) > } > > > Too cryptic? > > Valerie > > > On 04/10/2013 09:16 AM, Michael Lawrence wrote: > > It sounds like what Thomas wants should be achievable with > > countOverlaps. For example, IntersectionStrict (if treating all features > > independently) would equate to type="within". So we just need a > > summarizeOverlaps-style interface to those simple utilities. I agree > > this would be very useful. > > > > Michael > > > > > > On Wed, Apr 10, 2013 at 8:50 AM, Valerie Obenchain <vobencha at="" fhcrc.org=""> > <mailto:vobencha at="" fhcrc.org="">> wrote: > > > > Ah, I see. You'd like to count with one of the existing modes but > > have the option to pick up counts for these inter-feature reads > > (fall completely 'within' >1 feature). These inter-feature reads > > would be double (triple, quadruple, etc.) counted. Essentially they > > would add one count to each feature they hit. Right? > > > > One more thought about memory usage. If you are working with > > single-end reads the summarizeOverlaps,BamFileList-__method I > > mentioned below should work fine. The approach is slightly different > > with paired-end reads. Our current algorithm for pairing paired-end > > reads requires the whole file to be read into memory. A different > > approach is currently being developed but in the meantime you can > > take the qname-sorted approach. The Bam file will need to be sorted > > by qname and both the 'yieldSize' and 'obeyQname=TRUE' set when > > creating the BamFile/BamFileList. An example is on ?BamFile. > > > > Valerie > > > > > > > > On 04/09/2013 08:01 PM, Thomas Girke wrote: > > > > Thanks for the tip. I guess doing it this way reverses the > > counting mode > > back to countOverlaps, but how can I use at the same time > > "IntersectionStrict" or any of the other modes provided by > > summarizeOverlaps if its mode argument is already used and > > countOverlaps > > doesn't accept one? > > > > Thomas > > > > On Tue, Apr 09, 2013 at 09:08:02PM +0000, Valerie Obenchain wrote: > > > > Thanks for the example. You're right, none of the modes will > > count a > > read falling completely within >1 feature. > > > > You can supply your own function as a 'mode'. Two > > requirements must be met: > > > > (1) The function must have 3 arguments that correspond to > > 'features', 'reads' and 'ignore.strand', in that order. > > > > (2) The function must return a vector of counts the same length > > as 'features' > > > > Here is an example using countOverlaps() which I think gets > > at the > > counting you want. > > > > counter <- function(x, y, ignore.strand) > > countOverlaps(y, x, ignore.strand=ignore.strand) > > > > > assays(summarizeOverlaps(gr, rd, mode=counter))$counts > > [,1] > > [1,] 1 > > [2,] 1 > > > > > > Valerie > > > > On 04/09/2013 09:37 AM, Thomas Girke wrote: > > > > Hi Valerie, > > > > Perhaps let's call this more explicitly an > > ignore_inter_feature_overlap option > > that we often need for cases like this: > > > > Read1 ---- > > F1 ---------------- > > F2 ----------- > > > > where we would like to have at least the option to > > assign Read1 to both F1 and F2: > > > > F1: 1 > > F2: 1 > > > > However, summarizeOverlapse doesn't count Read1 at all > > in all of its currently > > available modes that I am aware of. This lack of an > > ignore_inter_feature_overlap > > option is frequently a problem when working with poorly > > annotated genomes (high > > uncertainty about hidden/incorrect feature overlaps) or > > various > > RNA/ChIP-Seq _engineering_ projects where I rather take > > the risk of ambiguous read > > assignments than not counting at all as shown above. > > > > ## Here is a code example illustrating the same case: > > library(GenomicRanges); library(Rsamtools) > > rd <- GappedAlignments(letters[1], seqnames = > > Rle(rep("chr1",1)), > > pos = as.integer(c(500)), > > cigar = rep("100M", 1), strand = > > strand(rep("+", 1))) > > gr1 <- GRanges("chr1", IRanges(start=100, width=1001), > > strand="+", ID="feat1") > > gr2 <- GRanges("chr1", IRanges(start=500, width=101), > > strand="+", ID="feat2") > > gr <- c(gr1, gr2) > > > > ## All of the three current modes in summarizeOverlaps > > return a count of zero > > ## because of its inter_feature_overlap awareness: > > assays(summarizeOverlaps(gr, rd, mode="Union", > > ignore.strand=TRUE))$counts > > assays(summarizeOverlaps(gr, rd, > > mode="IntersectionStrict", ignore.strand=TRUE))$counts > > assays(summarizeOverlaps(gr, rd, > > mode="IntersectionNotEmpty", ignore.strand=TRUE))$counts > > [,1] > > [1,] 0 > > [2,] 0 > > > > ## However, countOverlaps handles this correctly, but is > > not the best choice when > > ## counting multiple range features. > > countOverlaps(gr, rd) > > [1] 1 1 > > > > > > Thomas > > > > sessionInfo() > > > > R version 3.0.0 (2013-04-03) > > Platform: x86_64-apple-darwin10.8.0 (64-bit) > > > > locale: > > [1] > > en_US.UTF-8/en_US.UTF-8/en_US.__UTF-8/C/en_US.UTF-8/en_US.UTF-__8 > > > > attached base packages: > > [1] parallel stats graphics grDevices utils > > datasets methods > > [8] base > > > > other attached packages: > > [1] Rsamtools_1.12.0 Biostrings_2.28.0 > > GenomicRanges_1.12.1 > > [4] IRanges_1.18.0 BiocGenerics_0.6.0 > > > > loaded via a namespace (and not attached): > > [1] bitops_1.0-5 stats4_3.0.0 tools_3.0.0 > > zlibbioc_1.6.0 > > > > > > On Tue, Apr 09, 2013 at 03:42:03PM +0000, Valerie > > Obenchain wrote: > > > > Hi Thomas, > > > > On 04/08/2013 05:52 PM, Thomas Girke wrote: > > > > Dear Valerie, > > > > Is there currently any way to run > > summarizeOverlaps in a feature-overlap > > unaware mode, e.g with an > > ignorefeatureOL=FALSE/TRUE setting? Currently, > > one can switch back to countOverlaps when > > feature overlap unawareness is > > the more appropriate counting mode for a > > biological question, but then > > double counting of reads mapping to > > multiple-range features is not > > accounted for. It would be really nice to have > > such a feature-overlap > > unaware option directly in summarizeOverlaps. > > > > > > No, we don't currently have an option to ignore > > feature-overlap. It > > sounds like you want to count with countOverlaps() > > but still want the > > double counting to be resolved. This is essentially > > what the other modes > > are doing so I must be missing something. > > > > In this example 2 reads hit feature A, 1 read hits > > feature B. With > > something like ignorefeature0L=TRUE, what results > > would you expect to > > see? Maybe you have another, more descriptive example? > > > > reads <- GRanges("chr1", IRanges(c(1, 5, 20), width=3)) > > features <- GRanges("chr1", IRanges(c(1, 20), width=10, > > names=c("A", "B"))) > > > > > countOverlaps(features, reads) > > [1] 2 1 > > > > > > > > Another question relates to the memory usage of > > summarizeOverlaps. Has > > this been optimized yet? On a typical bam file > > with ~50-100 million > > reads the memory usage of summarizeOverlaps is > > often around 10-20GB. To > > use the function on a desktop computer or in > > large-scale RNA-Seq > > projects on a commodity compute cluster, it > > would be desirable if every > > counting instance would consume not more than > > 5GB of RAM. > > > > > > Have you tried the BamFileList-method? There is an > > example at the bottom > > of the ?BamFileList man page using > > summarizeOverlaps(). As Ryan > > mentioned, the key is to set the 'yieldSize' > > parameter when creating the > > BamFile. This method also makes use of mclapply(). > > > > Valerie > > > > > > Thanks in advance for your help and suggestions, > > > > Thomas > > > > _________________________________________________ > > Bioconductor mailing list > > Bioconductor at r-project.org > > <mailto:bioconductor at="" r-project.org=""> > > https://stat.ethz.ch/mailman/__listinfo/bioconductor > > <https: stat.ethz.ch="" mailman="" listinfo="" bioconductor=""> > > Search the archives: > > http://news.gmane.org/gmane.__science.biology.informatics.__conductor > > <http: news.gmane.org="" gmane.science.biology.informatics.conductor=""> > > > > > > _________________________________________________ > > Bioconductor mailing list > > Bioconductor at r-project.org <mailto:bioconductor at="" r-project.org=""> > > https://stat.ethz.ch/mailman/__listinfo/bioconductor > > <https: stat.ethz.ch="" mailman="" listinfo="" bioconductor=""> > > Search the archives: > > http://news.gmane.org/gmane.__science.biology.informatics.__conductor <http: news.gmane.org="" gmane.science.biology.informatics.conductor=""> > > > >
ADD COMMENTlink written 4.6 years ago by Thomas Girke1.6k
0
gravatar for Thomas Girke
4.6 years ago by
Thomas Girke1.6k
United States
Thomas Girke1.6k wrote:
Hi Martin, Yes, inter_feature=TRUE would maintain the current counting mode(s) that prohibits counting of reads mapping to multiple features. This is a special case of counting that is very useful for counting exonic regions of genes. However, one also wants to be able to turn off this behavior by ignoring inter feature overlaps (just like ignore.strand=T/F). Otherwise we cannot use summarizeOverlaps along with its current modes for important operations like transcript level counting because many transcript variants from the same gene will mask each others reads when inter_feature=TRUE. Providing the option to output the results from both inter_feature=TRUE and inter_feature=FALSE could be a very sensible solution and time saver for users working with new genomes/GFFs, where one cannot trust every nested annotation for various reasons, and inter_feature=TRUE can quickly become a very risky counting strategy since it tends to erase counts. Every biologist will scream in your face if the counter tells them that their favorite gene has zero counts just because of some overlap with some annotation error:). For multiple range features stored in GRangesList objects, I would currently favor making "inter_feature=FALSE" ignore the overlaps occurring among different list components, but not necessarily those within a list component (e.g. exon ranges of a gene). This way one can benefit from the current infrastructure by restricting its feature overlap scope to the range sets stored within individual list components but ignoring those among different list components. Utilities like reduce and other range modifier functions could handle situations where one wants to ignore all feature overlaps within and among list components. However, I am sure there could be other solutions to this. Long story short if inter_feature=TRUE/FALSE could be used in combination with modes=Union/IntersectionStrict/IntersectionNonEmpty resulting in six different counting flavors, I would be happy and, I am sure, many other Bioc users as well. I hope this makes sense. Thomas On Wed, Apr 10, 2013 at 11:37:04PM +0000, Martin Morgan wrote: > On 04/10/2013 12:42 PM, Thomas Girke wrote: > > Thanks. Adding an inter-feature unaware mode will be very helpful and also > > broaden summarizeOverlaps' application spectrum for a lot of use cases. > > I'm probably being quite dense here, and am mostly an outside observer. What I > hear you saying is that there are currently three modes -- Union, > IntersectionStrict, IntersectionNonEmpty. These modes are summarized in the > seven rows of figure 1 of > > vignette("summarizeOverlaps", package="GenomicRanges") > > Let's say there is a flag inter_feature, and when its value is TRUE then the > current counting schemes are obtained. These modes differ in the way counting > works for reads illustrated in rows 5, 6, and 7 of the figure. You'd like a > count scored where a '1' appears in the table below. With inter_feature=TRUE > reads are 'never counted more than once' ('Hits per read' is <= 1) > > |----------------------+-----+-----------+-----------+---------------| > | Mode | Row | Feature 1 | Feature 2 | Hits per read | > |----------------------+-----+-----------+-----------+---------------| > | Union | 5 | 1 | 0 | 1 | > | | 6 | 0 | 0 | 0 | > | | 7 | 0 | 0 | 0 | > | IntersectionStrict | 5 | 1 | 0 | 1 | > | | 6 | 1 | 0 | 1 | > | | 7 | 0 | 0 | 0 | > | IntersectionNotEmpty | 5 | 1 | 0 | 1 | > | | 6 | 1 | 0 | 1 | > | | 7 | 0 | 0 | 0 | > |----------------------+-----+-----------+-----------+---------------| > > > You'd like to add 3 more modes, with inter_feature=FALSE. Reads are sometimes > 'counted twice' > > |----------------------+-----+-----------+-----------+---------------| > | Mode | Row | Feature 1 | Feature 2 | Hits per read | > |----------------------+-----+-----------+-----------+---------------| > | Union | 5 | 1 | 0 | 1 | > | | 6 | 1 | 1 | 2 | > | | 7 | 1 | 1 | 2 | > | IntersectionStrict | 5 | 1 | 0 | 1 | > | | 6 | 1 | 0 | 1 | > | | 7 | 1 | 1 | 2 | > | IntersectionNotEmpty | 5 | 1 | 0 | 1 | > | | 6 | 1 | 1 | 2 | > | | 7 | 1 | 1 | 2 | > |----------------------+-----+-----------+-----------+---------------| > > > Martin > > > > > Thomas > > > > > > On Wed, Apr 10, 2013 at 04:59:49PM +0000, Valerie Obenchain wrote: > >> Hi Thomas, > >> > >> On 04/10/2013 09:23 AM, Thomas Girke wrote: > >>> Valerie, > >>> > >>> Please see my inserted comments below. > >>> > >>> On Wed, Apr 10, 2013 at 03:50:54PM +0000, Valerie Obenchain wrote: > >>>> Ah, I see. You'd like to count with one of the existing modes but have > >>>> the option to pick up counts for these inter-feature reads (fall > >>>> completely 'within' >1 feature). These inter-feature reads would be > >>>> double (triple, quadruple, etc.) counted. Essentially they would add one > >>>> count to each feature they hit. Right? > >>> > >>> Correct. Perhaps let's discuss this with a very common example of > >>> transcript-level counting rather than counting on the unified (virtual) > >>> exonic regions of genes. With the current description provided in the > >>> summarizeOverlaps vignette at > >>> http://www.bioconductor.org/packages/2.12/bioc/vignettes/Genomic Ranges/inst/doc/summarizeOverlaps.pdf > >>> I don't see how this can be achieved without falling back to using > >>> countOverlaps without loosing the new counting modes provided by > >>> summarizeOverlaps? > >> > >> It can't be achieved with the function as is but we could add an > >> argument to handle this (as you suggested from the start). If > >> 'inter-feature=TRUE' then these counts would be added to the counts > >> already obtained from using a particular 'mode'. I will move ahead with > >> implementing this argument. > >> > >>> > >>>> > >>>> One more thought about memory usage. If you are working with single-end > >>>> reads the summarizeOverlaps,BamFileList-method I mentioned below should > >>>> work fine. The approach is slightly different with paired-end reads. Our > >>>> current algorithm for pairing paired-end reads requires the whole file > >>>> to be read into memory. A different approach is currently being > >>>> developed but in the meantime you can take the qname-sorted approach. > >>>> The Bam file will need to be sorted by qname and both the 'yieldSize' > >>>> and 'obeyQname=TRUE' set when creating the BamFile/BamFileList. An > >>>> example is on ?BamFile. > >>> > >>> Thanks for pointing this out. My fault that I didn't read through all > >>> the linked documentation. Perhaps it may not be a bad idea to make the > >>> memory restricted bam read instances the default setting in the future. > >>> This will definitely help biologists using those utilities without > >>> crashing their machines on the first attempt. > >> > >> Good suggestion, thanks. > >> > >> Valerie > >> > >>> > >>> Thomas > >>> > >>> > >>>> > >>>> Valerie > >>>> > >>>> > >>>> On 04/09/2013 08:01 PM, Thomas Girke wrote: > >>>>> Thanks for the tip. I guess doing it this way reverses the counting mode > >>>>> back to countOverlaps, but how can I use at the same time > >>>>> "IntersectionStrict" or any of the other modes provided by > >>>>> summarizeOverlaps if its mode argument is already used and countOverlaps > >>>>> doesn't accept one? > >>>>> > >>>>> Thomas > >>>>> > >>>>> On Tue, Apr 09, 2013 at 09:08:02PM +0000, Valerie Obenchain wrote: > >>>>>> Thanks for the example. You're right, none of the modes will count a > >>>>>> read falling completely within >1 feature. > >>>>>> > >>>>>> You can supply your own function as a 'mode'. Two requirements must be met: > >>>>>> > >>>>>> (1) The function must have 3 arguments that correspond to > >>>>>> 'features', 'reads' and 'ignore.strand', in that order. > >>>>>> > >>>>>> (2) The function must return a vector of counts the same length > >>>>>> as 'features' > >>>>>> > >>>>>> Here is an example using countOverlaps() which I think gets at the > >>>>>> counting you want. > >>>>>> > >>>>>> counter <- function(x, y, ignore.strand) > >>>>>> countOverlaps(y, x, ignore.strand=ignore.strand) > >>>>>> > >>>>>> > assays(summarizeOverlaps(gr, rd, mode=counter))$counts > >>>>>> [,1] > >>>>>> [1,] 1 > >>>>>> [2,] 1 > >>>>>> > >>>>>> > >>>>>> Valerie > >>>>>> > >>>>>> On 04/09/2013 09:37 AM, Thomas Girke wrote: > >>>>>>> Hi Valerie, > >>>>>>> > >>>>>>> Perhaps let's call this more explicitly an ignore_inter_feature_overlap option > >>>>>>> that we often need for cases like this: > >>>>>>> > >>>>>>> Read1 ---- > >>>>>>> F1 ---------------- > >>>>>>> F2 ----------- > >>>>>>> > >>>>>>> where we would like to have at least the option to assign Read1 to both F1 and F2: > >>>>>>> > >>>>>>> F1: 1 > >>>>>>> F2: 1 > >>>>>>> > >>>>>>> However, summarizeOverlapse doesn't count Read1 at all in all of its currently > >>>>>>> available modes that I am aware of. This lack of an ignore_inter_feature_overlap > >>>>>>> option is frequently a problem when working with poorly annotated genomes (high > >>>>>>> uncertainty about hidden/incorrect feature overlaps) or various > >>>>>>> RNA/ChIP-Seq _engineering_ projects where I rather take the risk of ambiguous read > >>>>>>> assignments than not counting at all as shown above. > >>>>>>> > >>>>>>> ## Here is a code example illustrating the same case: > >>>>>>> library(GenomicRanges); library(Rsamtools) > >>>>>>> rd <- GappedAlignments(letters[1], seqnames = Rle(rep("chr1",1)), > >>>>>>> pos = as.integer(c(500)), > >>>>>>> cigar = rep("100M", 1), strand = strand(rep("+", 1))) > >>>>>>> gr1 <- GRanges("chr1", IRanges(start=100, width=1001), strand="+", ID="feat1") > >>>>>>> gr2 <- GRanges("chr1", IRanges(start=500, width=101), strand="+", ID="feat2") > >>>>>>> gr <- c(gr1, gr2) > >>>>>>> > >>>>>>> ## All of the three current modes in summarizeOverlaps return a count of zero > >>>>>>> ## because of its inter_feature_overlap awareness: > >>>>>>> assays(summarizeOverlaps(gr, rd, mode="Union", ignore.strand=TRUE))$counts > >>>>>>> assays(summarizeOverlaps(gr, rd, mode="IntersectionStrict", ignore.strand=TRUE))$counts > >>>>>>> assays(summarizeOverlaps(gr, rd, mode="IntersectionNotEmpty", ignore.strand=TRUE))$counts > >>>>>>> [,1] > >>>>>>> [1,] 0 > >>>>>>> [2,] 0 > >>>>>>> > >>>>>>> ## However, countOverlaps handles this correctly, but is not the best choice when > >>>>>>> ## counting multiple range features. > >>>>>>> countOverlaps(gr, rd) > >>>>>>> [1] 1 1 > >>>>>>> > >>>>>>> > >>>>>>> Thomas > >>>>>>> > >>>>>>>> sessionInfo() > >>>>>>> R version 3.0.0 (2013-04-03) > >>>>>>> Platform: x86_64-apple-darwin10.8.0 (64-bit) > >>>>>>> > >>>>>>> locale: > >>>>>>> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 > >>>>>>> > >>>>>>> attached base packages: > >>>>>>> [1] parallel stats graphics grDevices utils datasets methods > >>>>>>> [8] base > >>>>>>> > >>>>>>> other attached packages: > >>>>>>> [1] Rsamtools_1.12.0 Biostrings_2.28.0 GenomicRanges_1.12.1 > >>>>>>> [4] IRanges_1.18.0 BiocGenerics_0.6.0 > >>>>>>> > >>>>>>> loaded via a namespace (and not attached): > >>>>>>> [1] bitops_1.0-5 stats4_3.0.0 tools_3.0.0 zlibbioc_1.6.0 > >>>>>>> > >>>>>>> > >>>>>>> On Tue, Apr 09, 2013 at 03:42:03PM +0000, Valerie Obenchain wrote: > >>>>>>>> Hi Thomas, > >>>>>>>> > >>>>>>>> On 04/08/2013 05:52 PM, Thomas Girke wrote: > >>>>>>>>> Dear Valerie, > >>>>>>>>> > >>>>>>>>> Is there currently any way to run summarizeOverlaps in a feature-overlap > >>>>>>>>> unaware mode, e.g with an ignorefeatureOL=FALSE/TRUE setting? Currently, > >>>>>>>>> one can switch back to countOverlaps when feature overlap unawareness is > >>>>>>>>> the more appropriate counting mode for a biological question, but then > >>>>>>>>> double counting of reads mapping to multiple-range features is not > >>>>>>>>> accounted for. It would be really nice to have such a feature-overlap > >>>>>>>>> unaware option directly in summarizeOverlaps. > >>>>>>>> > >>>>>>>> No, we don't currently have an option to ignore feature- overlap. It > >>>>>>>> sounds like you want to count with countOverlaps() but still want the > >>>>>>>> double counting to be resolved. This is essentially what the other modes > >>>>>>>> are doing so I must be missing something. > >>>>>>>> > >>>>>>>> In this example 2 reads hit feature A, 1 read hits feature B. With > >>>>>>>> something like ignorefeature0L=TRUE, what results would you expect to > >>>>>>>> see? Maybe you have another, more descriptive example? > >>>>>>>> > >>>>>>>> reads <- GRanges("chr1", IRanges(c(1, 5, 20), width=3)) > >>>>>>>> features <- GRanges("chr1", IRanges(c(1, 20), width=10, > >>>>>>>> names=c("A", "B"))) > >>>>>>>> > >>>>>>>> > countOverlaps(features, reads) > >>>>>>>> [1] 2 1 > >>>>>>>> > >>>>>>>> > >>>>>>>>> > >>>>>>>>> Another question relates to the memory usage of summarizeOverlaps. Has > >>>>>>>>> this been optimized yet? On a typical bam file with ~50-100 million > >>>>>>>>> reads the memory usage of summarizeOverlaps is often around 10-20GB. To > >>>>>>>>> use the function on a desktop computer or in large-scale RNA-Seq > >>>>>>>>> projects on a commodity compute cluster, it would be desirable if every > >>>>>>>>> counting instance would consume not more than 5GB of RAM. > >>>>>>>> > >>>>>>>> Have you tried the BamFileList-method? There is an example at the bottom > >>>>>>>> of the ?BamFileList man page using summarizeOverlaps(). As Ryan > >>>>>>>> mentioned, the key is to set the 'yieldSize' parameter when creating the > >>>>>>>> BamFile. This method also makes use of mclapply(). > >>>>>>>> > >>>>>>>> Valerie > >>>>>>>> > >>>>>>>>> > >>>>>>>>> Thanks in advance for your help and suggestions, > >>>>>>>>> > >>>>>>>>> Thomas > >>>>>>>>> > >>>>>>>>> _______________________________________________ > >>>>>>>>> Bioconductor mailing list > >>>>>>>>> Bioconductor at r-project.org > >>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor > >>>>>>>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > >>>>>>>>> > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor at r-project.org > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > > -- > Computational Biology / Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N. > PO Box 19024 Seattle, WA 98109 > > Location: Arnold Building M1 B861 > Phone: (206) 667-2793
ADD COMMENTlink written 4.6 years ago by Thomas Girke1.6k
On Wed, Apr 10, 2013 at 7:09 PM, Thomas Girke <thomas.girke@ucr.edu> wrote: > Hi Martin, > > Yes, inter_feature=TRUE would maintain the current counting mode(s) that > prohibits counting of reads mapping to multiple features. This is a > special case of counting that is very useful for counting exonic regions > of genes. However, one also wants to be able to turn off this behavior > by ignoring inter feature overlaps (just like ignore.strand=T/F). > Otherwise we cannot use summarizeOverlaps along with its current modes > for important operations like transcript level counting because many > transcript variants from the same gene will mask each others reads when > inter_feature=TRUE. Providing the option to output the results from both > inter_feature=TRUE and inter_feature=FALSE could be a very sensible > solution and time saver for users working with new genomes/GFFs, where > one cannot trust every nested annotation for various reasons, and > inter_feature=TRUE can quickly become a very risky counting strategy > since it tends to erase counts. Every biologist will scream in your > face if the counter tells them that their favorite gene has zero counts > just because of some overlap with some annotation error:). > > For multiple range features stored in GRangesList objects, I would > currently favor making "inter_feature=FALSE" ignore the overlaps > occurring among different list components, but not necessarily those > within a list component (e.g. exon ranges of a gene). This way one > can benefit from the current infrastructure by restricting its feature > overlap scope to the range sets stored within individual list components > but ignoring those among different list components. Utilities like reduce > and other range modifier functions could handle situations where one wants > to ignore all feature overlaps within and among list components. However, > I am sure there could be other solutions to this. > > Long story short if inter_feature=TRUE/FALSE could be used in > combination with modes=Union/IntersectionStrict/IntersectionNonEmpty > resulting in six different counting flavors, I would be happy and, I am > sure, many other Bioc users as well. > > Have you taken a look at findSpliceOverlaps? Maybe summarizeSpliceOverlaps could be completed to satisfy your use case? Somehow we need to control the proliferation of counting functions and modes. Having some idea of your high-level use case might help. Also, for spliced alignments, I recommend giving GSNAP a try if you haven't already. It's accessible though the gmapR package. Michael I hope this makes sense. > > Thomas > > > > On Wed, Apr 10, 2013 at 11:37:04PM +0000, Martin Morgan wrote: > > On 04/10/2013 12:42 PM, Thomas Girke wrote: > > > Thanks. Adding an inter-feature unaware mode will be very helpful and > also > > > broaden summarizeOverlaps' application spectrum for a lot of use cases. > > > > I'm probably being quite dense here, and am mostly an outside observer. > What I > > hear you saying is that there are currently three modes -- Union, > > IntersectionStrict, IntersectionNonEmpty. These modes are summarized in > the > > seven rows of figure 1 of > > > > vignette("summarizeOverlaps", package="GenomicRanges") > > > > Let's say there is a flag inter_feature, and when its value is TRUE then > the > > current counting schemes are obtained. These modes differ in the way > counting > > works for reads illustrated in rows 5, 6, and 7 of the figure. You'd > like a > > count scored where a '1' appears in the table below. With > inter_feature=TRUE > > reads are 'never counted more than once' ('Hits per read' is <= 1) > > > > |----------------------+-----+-----------+-----------+---------------| > > | Mode | Row | Feature 1 | Feature 2 | Hits per read | > > |----------------------+-----+-----------+-----------+---------------| > > | Union | 5 | 1 | 0 | 1 | > > | | 6 | 0 | 0 | 0 | > > | | 7 | 0 | 0 | 0 | > > | IntersectionStrict | 5 | 1 | 0 | 1 | > > | | 6 | 1 | 0 | 1 | > > | | 7 | 0 | 0 | 0 | > > | IntersectionNotEmpty | 5 | 1 | 0 | 1 | > > | | 6 | 1 | 0 | 1 | > > | | 7 | 0 | 0 | 0 | > > |----------------------+-----+-----------+-----------+---------------| > > > > > > You'd like to add 3 more modes, with inter_feature=FALSE. Reads are > sometimes > > 'counted twice' > > > > |----------------------+-----+-----------+-----------+---------------| > > | Mode | Row | Feature 1 | Feature 2 | Hits per read | > > |----------------------+-----+-----------+-----------+---------------| > > | Union | 5 | 1 | 0 | 1 | > > | | 6 | 1 | 1 | 2 | > > | | 7 | 1 | 1 | 2 | > > | IntersectionStrict | 5 | 1 | 0 | 1 | > > | | 6 | 1 | 0 | 1 | > > | | 7 | 1 | 1 | 2 | > > | IntersectionNotEmpty | 5 | 1 | 0 | 1 | > > | | 6 | 1 | 1 | 2 | > > | | 7 | 1 | 1 | 2 | > > |----------------------+-----+-----------+-----------+---------------| > > > > > > Martin > > > > > > > > Thomas > > > > > > > > > On Wed, Apr 10, 2013 at 04:59:49PM +0000, Valerie Obenchain wrote: > > >> Hi Thomas, > > >> > > >> On 04/10/2013 09:23 AM, Thomas Girke wrote: > > >>> Valerie, > > >>> > > >>> Please see my inserted comments below. > > >>> > > >>> On Wed, Apr 10, 2013 at 03:50:54PM +0000, Valerie Obenchain wrote: > > >>>> Ah, I see. You'd like to count with one of the existing modes but > have > > >>>> the option to pick up counts for these inter-feature reads (fall > > >>>> completely 'within' >1 feature). These inter-feature reads would be > > >>>> double (triple, quadruple, etc.) counted. Essentially they would > add one > > >>>> count to each feature they hit. Right? > > >>> > > >>> Correct. Perhaps let's discuss this with a very common example of > > >>> transcript-level counting rather than counting on the unified > (virtual) > > >>> exonic regions of genes. With the current description provided in the > > >>> summarizeOverlaps vignette at > > >>> > http://www.bioconductor.org/packages/2.12/bioc/vignettes/GenomicRang es/inst/doc/summarizeOverlaps.pdf > > >>> I don't see how this can be achieved without falling back to using > > >>> countOverlaps without loosing the new counting modes provided by > > >>> summarizeOverlaps? > > >> > > >> It can't be achieved with the function as is but we could add an > > >> argument to handle this (as you suggested from the start). If > > >> 'inter-feature=TRUE' then these counts would be added to the counts > > >> already obtained from using a particular 'mode'. I will move ahead > with > > >> implementing this argument. > > >> > > >>> > > >>>> > > >>>> One more thought about memory usage. If you are working with > single-end > > >>>> reads the summarizeOverlaps,BamFileList-method I mentioned below > should > > >>>> work fine. The approach is slightly different with paired-end > reads. Our > > >>>> current algorithm for pairing paired-end reads requires the whole > file > > >>>> to be read into memory. A different approach is currently being > > >>>> developed but in the meantime you can take the qname-sorted > approach. > > >>>> The Bam file will need to be sorted by qname and both the > 'yieldSize' > > >>>> and 'obeyQname=TRUE' set when creating the BamFile/BamFileList. An > > >>>> example is on ?BamFile. > > >>> > > >>> Thanks for pointing this out. My fault that I didn't read through all > > >>> the linked documentation. Perhaps it may not be a bad idea to make > the > > >>> memory restricted bam read instances the default setting in the > future. > > >>> This will definitely help biologists using those utilities without > > >>> crashing their machines on the first attempt. > > >> > > >> Good suggestion, thanks. > > >> > > >> Valerie > > >> > > >>> > > >>> Thomas > > >>> > > >>> > > >>>> > > >>>> Valerie > > >>>> > > >>>> > > >>>> On 04/09/2013 08:01 PM, Thomas Girke wrote: > > >>>>> Thanks for the tip. I guess doing it this way reverses the > counting mode > > >>>>> back to countOverlaps, but how can I use at the same time > > >>>>> "IntersectionStrict" or any of the other modes provided by > > >>>>> summarizeOverlaps if its mode argument is already used and > countOverlaps > > >>>>> doesn't accept one? > > >>>>> > > >>>>> Thomas > > >>>>> > > >>>>> On Tue, Apr 09, 2013 at 09:08:02PM +0000, Valerie Obenchain wrote: > > >>>>>> Thanks for the example. You're right, none of the modes will > count a > > >>>>>> read falling completely within >1 feature. > > >>>>>> > > >>>>>> You can supply your own function as a 'mode'. Two requirements > must be met: > > >>>>>> > > >>>>>> (1) The function must have 3 arguments that correspond to > > >>>>>> 'features', 'reads' and 'ignore.strand', in that order. > > >>>>>> > > >>>>>> (2) The function must return a vector of counts the same length > > >>>>>> as 'features' > > >>>>>> > > >>>>>> Here is an example using countOverlaps() which I think gets at the > > >>>>>> counting you want. > > >>>>>> > > >>>>>> counter <- function(x, y, ignore.strand) > > >>>>>> countOverlaps(y, x, ignore.strand=ignore.strand) > > >>>>>> > > >>>>>> > assays(summarizeOverlaps(gr, rd, mode=counter))$counts > > >>>>>> [,1] > > >>>>>> [1,] 1 > > >>>>>> [2,] 1 > > >>>>>> > > >>>>>> > > >>>>>> Valerie > > >>>>>> > > >>>>>> On 04/09/2013 09:37 AM, Thomas Girke wrote: > > >>>>>>> Hi Valerie, > > >>>>>>> > > >>>>>>> Perhaps let's call this more explicitly an > ignore_inter_feature_overlap option > > >>>>>>> that we often need for cases like this: > > >>>>>>> > > >>>>>>> Read1 ---- > > >>>>>>> F1 ---------------- > > >>>>>>> F2 ----------- > > >>>>>>> > > >>>>>>> where we would like to have at least the option to assign Read1 > to both F1 and F2: > > >>>>>>> > > >>>>>>> F1: 1 > > >>>>>>> F2: 1 > > >>>>>>> > > >>>>>>> However, summarizeOverlapse doesn't count Read1 at all in all of > its currently > > >>>>>>> available modes that I am aware of. This lack of an > ignore_inter_feature_overlap > > >>>>>>> option is frequently a problem when working with poorly > annotated genomes (high > > >>>>>>> uncertainty about hidden/incorrect feature overlaps) or various > > >>>>>>> RNA/ChIP-Seq _engineering_ projects where I rather take the risk > of ambiguous read > > >>>>>>> assignments than not counting at all as shown above. > > >>>>>>> > > >>>>>>> ## Here is a code example illustrating the same case: > > >>>>>>> library(GenomicRanges); library(Rsamtools) > > >>>>>>> rd <- GappedAlignments(letters[1], seqnames = Rle(rep("chr1",1)), > > >>>>>>> pos = as.integer(c(500)), > > >>>>>>> cigar = rep("100M", 1), strand = strand(rep("+", 1))) > > >>>>>>> gr1 <- GRanges("chr1", IRanges(start=100, width=1001), > strand="+", ID="feat1") > > >>>>>>> gr2 <- GRanges("chr1", IRanges(start=500, width=101), > strand="+", ID="feat2") > > >>>>>>> gr <- c(gr1, gr2) > > >>>>>>> > > >>>>>>> ## All of the three current modes in summarizeOverlaps return a > count of zero > > >>>>>>> ## because of its inter_feature_overlap awareness: > > >>>>>>> assays(summarizeOverlaps(gr, rd, mode="Union", > ignore.strand=TRUE))$counts > > >>>>>>> assays(summarizeOverlaps(gr, rd, mode="IntersectionStrict", > ignore.strand=TRUE))$counts > > >>>>>>> assays(summarizeOverlaps(gr, rd, mode="IntersectionNotEmpty", > ignore.strand=TRUE))$counts > > >>>>>>> [,1] > > >>>>>>> [1,] 0 > > >>>>>>> [2,] 0 > > >>>>>>> > > >>>>>>> ## However, countOverlaps handles this correctly, but is not the > best choice when > > >>>>>>> ## counting multiple range features. > > >>>>>>> countOverlaps(gr, rd) > > >>>>>>> [1] 1 1 > > >>>>>>> > > >>>>>>> > > >>>>>>> Thomas > > >>>>>>> > > >>>>>>>> sessionInfo() > > >>>>>>> R version 3.0.0 (2013-04-03) > > >>>>>>> Platform: x86_64-apple-darwin10.8.0 (64-bit) > > >>>>>>> > > >>>>>>> locale: > > >>>>>>> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 > > >>>>>>> > > >>>>>>> attached base packages: > > >>>>>>> [1] parallel stats graphics grDevices utils datasets > methods > > >>>>>>> [8] base > > >>>>>>> > > >>>>>>> other attached packages: > > >>>>>>> [1] Rsamtools_1.12.0 Biostrings_2.28.0 > GenomicRanges_1.12.1 > > >>>>>>> [4] IRanges_1.18.0 BiocGenerics_0.6.0 > > >>>>>>> > > >>>>>>> loaded via a namespace (and not attached): > > >>>>>>> [1] bitops_1.0-5 stats4_3.0.0 tools_3.0.0 zlibbioc_1.6.0 > > >>>>>>> > > >>>>>>> > > >>>>>>> On Tue, Apr 09, 2013 at 03:42:03PM +0000, Valerie Obenchain > wrote: > > >>>>>>>> Hi Thomas, > > >>>>>>>> > > >>>>>>>> On 04/08/2013 05:52 PM, Thomas Girke wrote: > > >>>>>>>>> Dear Valerie, > > >>>>>>>>> > > >>>>>>>>> Is there currently any way to run summarizeOverlaps in a > feature-overlap > > >>>>>>>>> unaware mode, e.g with an ignorefeatureOL=FALSE/TRUE setting? > Currently, > > >>>>>>>>> one can switch back to countOverlaps when feature overlap > unawareness is > > >>>>>>>>> the more appropriate counting mode for a biological question, > but then > > >>>>>>>>> double counting of reads mapping to multiple-range features is > not > > >>>>>>>>> accounted for. It would be really nice to have such a > feature-overlap > > >>>>>>>>> unaware option directly in summarizeOverlaps. > > >>>>>>>> > > >>>>>>>> No, we don't currently have an option to ignore > feature-overlap. It > > >>>>>>>> sounds like you want to count with countOverlaps() but still > want the > > >>>>>>>> double counting to be resolved. This is essentially what the > other modes > > >>>>>>>> are doing so I must be missing something. > > >>>>>>>> > > >>>>>>>> In this example 2 reads hit feature A, 1 read hits feature B. > With > > >>>>>>>> something like ignorefeature0L=TRUE, what results would you > expect to > > >>>>>>>> see? Maybe you have another, more descriptive example? > > >>>>>>>> > > >>>>>>>> reads <- GRanges("chr1", IRanges(c(1, 5, 20), width=3)) > > >>>>>>>> features <- GRanges("chr1", IRanges(c(1, 20), width=10, > > >>>>>>>> names=c("A", "B"))) > > >>>>>>>> > > >>>>>>>> > countOverlaps(features, reads) > > >>>>>>>> [1] 2 1 > > >>>>>>>> > > >>>>>>>> > > >>>>>>>>> > > >>>>>>>>> Another question relates to the memory usage of > summarizeOverlaps. Has > > >>>>>>>>> this been optimized yet? On a typical bam file with ~50-100 > million > > >>>>>>>>> reads the memory usage of summarizeOverlaps is often around > 10-20GB. To > > >>>>>>>>> use the function on a desktop computer or in large-scale > RNA-Seq > > >>>>>>>>> projects on a commodity compute cluster, it would be desirable > if every > > >>>>>>>>> counting instance would consume not more than 5GB of RAM. > > >>>>>>>> > > >>>>>>>> Have you tried the BamFileList-method? There is an example at > the bottom > > >>>>>>>> of the ?BamFileList man page using summarizeOverlaps(). As Ryan > > >>>>>>>> mentioned, the key is to set the 'yieldSize' parameter when > creating the > > >>>>>>>> BamFile. This method also makes use of mclapply(). > > >>>>>>>> > > >>>>>>>> Valerie > > >>>>>>>> > > >>>>>>>>> > > >>>>>>>>> Thanks in advance for your help and suggestions, > > >>>>>>>>> > > >>>>>>>>> Thomas > > >>>>>>>>> > > >>>>>>>>> _______________________________________________ > > >>>>>>>>> Bioconductor mailing list > > >>>>>>>>> Bioconductor@r-project.org > > >>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor > > >>>>>>>>> Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > >>>>>>>>> > > > > > > _______________________________________________ > > > Bioconductor mailing list > > > Bioconductor@r-project.org > > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > > > > > > -- > > Computational Biology / Fred Hutchinson Cancer Research Center > > 1100 Fairview Ave. N. > > PO Box 19024 Seattle, WA 98109 > > > > Location: Arnold Building M1 B861 > > Phone: (206) 667-2793 > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]
ADD REPLYlink written 4.6 years ago by Michael Lawrence9.8k
Hi Michael, I could not find the 'summarizeSpliceOverlaps' function in the GenomicFeatures package (1.12.0). The findSpliceOverlaps function seems to count reads more than once as well. I ran the sample code in the help page for this function and found that the single fragment was assigned to both transcripts: > genes <- GRangesList( + GRanges("chr1", IRanges(c(5, 20), c(10, 25)), "+"), + GRanges("chr1", IRanges(c(5, 22), c(15, 25)), "+")) > genes GRangesList of length 2: [[1]] GRanges with 2 ranges and 0 metadata columns: seqnames ranges strand <rle> <iranges> <rle> [1] chr1 [ 5, 10] + [2] chr1 [20, 25] + [[2]] GRanges with 2 ranges and 0 metadata columns: seqnames ranges strand [1] chr1 [ 5, 15] + [2] chr1 [22, 25] + --- seqlengths: chr1 NA > galp <- GappedAlignmentPairs( + GappedAlignments("chr1", 5L, "11M4N6M", strand("+")), + GappedAlignments("chr1", 50L, "6M", strand("-")), + isProperPair=TRUE) > galp GappedAlignmentPairs with 1 alignment pair and 0 metadata columns: seqnames strand : ranges -- ranges <rle> <rle> : <iranges> -- <iranges> [1] chr1 + : [5, 25] -- [50, 55] --- seqlengths: chr1 NA > findSpliceOverlaps(galp, genes) Hits of length 2 queryLength: 1 subjectLength: 2 queryHits subjectHits compatible unique coding strandSpecific <integer> <integer> <logical> <logical> <logical> <logical> 1 1 1 FALSE FALSE NA TRUE 2 1 2 FALSE FALSE NA TRUE > sessionInfo() R version 3.0.0 (2013-04-03) Platform: x86_64-apple-darwin10.8.0 (64-bit) locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] parallel stats graphics grDevices utils datasets methods base other attached packages: [1] limma_3.16.1 GenomicFeatures_1.12.0 AnnotationDbi_1.22.1 Biobase_2.20.0 GenomicRanges_1.12.1 [6] IRanges_1.18.0 BiocGenerics_0.6.0 Rsubread_1.10.1 BiocInstaller_1.10.0 loaded via a namespace (and not attached): [1] biomaRt_2.16.0 Biostrings_2.28.0 bitops_1.0-5 BSgenome_1.28.0 DBI_0.2-5 RCurl_1.95-4.1 [7] Rsamtools_1.12.0 RSQLite_0.11.2 rtracklayer_1.20.0 stats4_3.0.0 tools_3.0.0 XML_3.95-0.2 [13] zlibbioc_1.6.0 > Wei On Apr 11, 2013, at 11:36 PM, Michael Lawrence wrote: > On Wed, Apr 10, 2013 at 7:09 PM, Thomas Girke <thomas.girke at="" ucr.edu=""> wrote: > >> Hi Martin, >> >> Yes, inter_feature=TRUE would maintain the current counting mode(s) that >> prohibits counting of reads mapping to multiple features. This is a >> special case of counting that is very useful for counting exonic regions >> of genes. However, one also wants to be able to turn off this behavior >> by ignoring inter feature overlaps (just like ignore.strand=T/F). >> Otherwise we cannot use summarizeOverlaps along with its current modes >> for important operations like transcript level counting because many >> transcript variants from the same gene will mask each others reads when >> inter_feature=TRUE. Providing the option to output the results from both >> inter_feature=TRUE and inter_feature=FALSE could be a very sensible >> solution and time saver for users working with new genomes/GFFs, where >> one cannot trust every nested annotation for various reasons, and >> inter_feature=TRUE can quickly become a very risky counting strategy >> since it tends to erase counts. Every biologist will scream in your >> face if the counter tells them that their favorite gene has zero counts >> just because of some overlap with some annotation error:). >> >> For multiple range features stored in GRangesList objects, I would >> currently favor making "inter_feature=FALSE" ignore the overlaps >> occurring among different list components, but not necessarily those >> within a list component (e.g. exon ranges of a gene). This way one >> can benefit from the current infrastructure by restricting its feature >> overlap scope to the range sets stored within individual list components >> but ignoring those among different list components. Utilities like reduce >> and other range modifier functions could handle situations where one wants >> to ignore all feature overlaps within and among list components. However, >> I am sure there could be other solutions to this. >> >> Long story short if inter_feature=TRUE/FALSE could be used in >> combination with modes=Union/IntersectionStrict/IntersectionNonEmpty >> resulting in six different counting flavors, I would be happy and, I am >> sure, many other Bioc users as well. >> >> > Have you taken a look at findSpliceOverlaps? Maybe summarizeSpliceOverlaps > could be completed to satisfy your use case? Somehow we need to control the > proliferation of counting functions and modes. Having some idea of your > high-level use case might help. > > Also, for spliced alignments, I recommend giving GSNAP a try if you haven't > already. It's accessible though the gmapR package. > > Michael > > I hope this makes sense. >> >> Thomas >> >> >> >> On Wed, Apr 10, 2013 at 11:37:04PM +0000, Martin Morgan wrote: >>> On 04/10/2013 12:42 PM, Thomas Girke wrote: >>>> Thanks. Adding an inter-feature unaware mode will be very helpful and >> also >>>> broaden summarizeOverlaps' application spectrum for a lot of use cases. >>> >>> I'm probably being quite dense here, and am mostly an outside observer. >> What I >>> hear you saying is that there are currently three modes -- Union, >>> IntersectionStrict, IntersectionNonEmpty. These modes are summarized in >> the >>> seven rows of figure 1 of >>> >>> vignette("summarizeOverlaps", package="GenomicRanges") >>> >>> Let's say there is a flag inter_feature, and when its value is TRUE then >> the >>> current counting schemes are obtained. These modes differ in the way >> counting >>> works for reads illustrated in rows 5, 6, and 7 of the figure. You'd >> like a >>> count scored where a '1' appears in the table below. With >> inter_feature=TRUE >>> reads are 'never counted more than once' ('Hits per read' is <= 1) >>> >>> |----------------------+-----+-----------+-----------+---------------| >>> | Mode | Row | Feature 1 | Feature 2 | Hits per read | >>> |----------------------+-----+-----------+-----------+---------------| >>> | Union | 5 | 1 | 0 | 1 | >>> | | 6 | 0 | 0 | 0 | >>> | | 7 | 0 | 0 | 0 | >>> | IntersectionStrict | 5 | 1 | 0 | 1 | >>> | | 6 | 1 | 0 | 1 | >>> | | 7 | 0 | 0 | 0 | >>> | IntersectionNotEmpty | 5 | 1 | 0 | 1 | >>> | | 6 | 1 | 0 | 1 | >>> | | 7 | 0 | 0 | 0 | >>> |----------------------+-----+-----------+-----------+---------------| >>> >>> >>> You'd like to add 3 more modes, with inter_feature=FALSE. Reads are >> sometimes >>> 'counted twice' >>> >>> |----------------------+-----+-----------+-----------+---------------| >>> | Mode | Row | Feature 1 | Feature 2 | Hits per read | >>> |----------------------+-----+-----------+-----------+---------------| >>> | Union | 5 | 1 | 0 | 1 | >>> | | 6 | 1 | 1 | 2 | >>> | | 7 | 1 | 1 | 2 | >>> | IntersectionStrict | 5 | 1 | 0 | 1 | >>> | | 6 | 1 | 0 | 1 | >>> | | 7 | 1 | 1 | 2 | >>> | IntersectionNotEmpty | 5 | 1 | 0 | 1 | >>> | | 6 | 1 | 1 | 2 | >>> | | 7 | 1 | 1 | 2 | >>> |----------------------+-----+-----------+-----------+---------------| >>> >>> >>> Martin >>> >>>> >>>> Thomas >>>> >>>> >>>> On Wed, Apr 10, 2013 at 04:59:49PM +0000, Valerie Obenchain wrote: >>>>> Hi Thomas, >>>>> >>>>> On 04/10/2013 09:23 AM, Thomas Girke wrote: >>>>>> Valerie, >>>>>> >>>>>> Please see my inserted comments below. >>>>>> >>>>>> On Wed, Apr 10, 2013 at 03:50:54PM +0000, Valerie Obenchain wrote: >>>>>>> Ah, I see. You'd like to count with one of the existing modes but >> have >>>>>>> the option to pick up counts for these inter-feature reads (fall >>>>>>> completely 'within' >1 feature). These inter-feature reads would be >>>>>>> double (triple, quadruple, etc.) counted. Essentially they would >> add one >>>>>>> count to each feature they hit. Right? >>>>>> >>>>>> Correct. Perhaps let's discuss this with a very common example of >>>>>> transcript-level counting rather than counting on the unified >> (virtual) >>>>>> exonic regions of genes. With the current description provided in the >>>>>> summarizeOverlaps vignette at >>>>>> >> http://www.bioconductor.org/packages/2.12/bioc/vignettes/GenomicRan ges/inst/doc/summarizeOverlaps.pdf >>>>>> I don't see how this can be achieved without falling back to using >>>>>> countOverlaps without loosing the new counting modes provided by >>>>>> summarizeOverlaps? >>>>> >>>>> It can't be achieved with the function as is but we could add an >>>>> argument to handle this (as you suggested from the start). If >>>>> 'inter-feature=TRUE' then these counts would be added to the counts >>>>> already obtained from using a particular 'mode'. I will move ahead >> with >>>>> implementing this argument. >>>>> >>>>>> >>>>>>> >>>>>>> One more thought about memory usage. If you are working with >> single-end >>>>>>> reads the summarizeOverlaps,BamFileList-method I mentioned below >> should >>>>>>> work fine. The approach is slightly different with paired-end >> reads. Our >>>>>>> current algorithm for pairing paired-end reads requires the whole >> file >>>>>>> to be read into memory. A different approach is currently being >>>>>>> developed but in the meantime you can take the qname-sorted >> approach. >>>>>>> The Bam file will need to be sorted by qname and both the >> 'yieldSize' >>>>>>> and 'obeyQname=TRUE' set when creating the BamFile/BamFileList. An >>>>>>> example is on ?BamFile. >>>>>> >>>>>> Thanks for pointing this out. My fault that I didn't read through all >>>>>> the linked documentation. Perhaps it may not be a bad idea to make >> the >>>>>> memory restricted bam read instances the default setting in the >> future. >>>>>> This will definitely help biologists using those utilities without >>>>>> crashing their machines on the first attempt. >>>>> >>>>> Good suggestion, thanks. >>>>> >>>>> Valerie >>>>> >>>>>> >>>>>> Thomas >>>>>> >>>>>> >>>>>>> >>>>>>> Valerie >>>>>>> >>>>>>> >>>>>>> On 04/09/2013 08:01 PM, Thomas Girke wrote: >>>>>>>> Thanks for the tip. I guess doing it this way reverses the >> counting mode >>>>>>>> back to countOverlaps, but how can I use at the same time >>>>>>>> "IntersectionStrict" or any of the other modes provided by >>>>>>>> summarizeOverlaps if its mode argument is already used and >> countOverlaps >>>>>>>> doesn't accept one? >>>>>>>> >>>>>>>> Thomas >>>>>>>> >>>>>>>> On Tue, Apr 09, 2013 at 09:08:02PM +0000, Valerie Obenchain wrote: >>>>>>>>> Thanks for the example. You're right, none of the modes will >> count a >>>>>>>>> read falling completely within >1 feature. >>>>>>>>> >>>>>>>>> You can supply your own function as a 'mode'. Two requirements >> must be met: >>>>>>>>> >>>>>>>>> (1) The function must have 3 arguments that correspond to >>>>>>>>> 'features', 'reads' and 'ignore.strand', in that order. >>>>>>>>> >>>>>>>>> (2) The function must return a vector of counts the same length >>>>>>>>> as 'features' >>>>>>>>> >>>>>>>>> Here is an example using countOverlaps() which I think gets at the >>>>>>>>> counting you want. >>>>>>>>> >>>>>>>>> counter <- function(x, y, ignore.strand) >>>>>>>>> countOverlaps(y, x, ignore.strand=ignore.strand) >>>>>>>>> >>>>>>>>>> assays(summarizeOverlaps(gr, rd, mode=counter))$counts >>>>>>>>> [,1] >>>>>>>>> [1,] 1 >>>>>>>>> [2,] 1 >>>>>>>>> >>>>>>>>> >>>>>>>>> Valerie >>>>>>>>> >>>>>>>>> On 04/09/2013 09:37 AM, Thomas Girke wrote: >>>>>>>>>> Hi Valerie, >>>>>>>>>> >>>>>>>>>> Perhaps let's call this more explicitly an >> ignore_inter_feature_overlap option >>>>>>>>>> that we often need for cases like this: >>>>>>>>>> >>>>>>>>>> Read1 ---- >>>>>>>>>> F1 ---------------- >>>>>>>>>> F2 ----------- >>>>>>>>>> >>>>>>>>>> where we would like to have at least the option to assign Read1 >> to both F1 and F2: >>>>>>>>>> >>>>>>>>>> F1: 1 >>>>>>>>>> F2: 1 >>>>>>>>>> >>>>>>>>>> However, summarizeOverlapse doesn't count Read1 at all in all of >> its currently >>>>>>>>>> available modes that I am aware of. This lack of an >> ignore_inter_feature_overlap >>>>>>>>>> option is frequently a problem when working with poorly >> annotated genomes (high >>>>>>>>>> uncertainty about hidden/incorrect feature overlaps) or various >>>>>>>>>> RNA/ChIP-Seq _engineering_ projects where I rather take the risk >> of ambiguous read >>>>>>>>>> assignments than not counting at all as shown above. >>>>>>>>>> >>>>>>>>>> ## Here is a code example illustrating the same case: >>>>>>>>>> library(GenomicRanges); library(Rsamtools) >>>>>>>>>> rd <- GappedAlignments(letters[1], seqnames = Rle(rep("chr1",1)), >>>>>>>>>> pos = as.integer(c(500)), >>>>>>>>>> cigar = rep("100M", 1), strand = strand(rep("+", 1))) >>>>>>>>>> gr1 <- GRanges("chr1", IRanges(start=100, width=1001), >> strand="+", ID="feat1") >>>>>>>>>> gr2 <- GRanges("chr1", IRanges(start=500, width=101), >> strand="+", ID="feat2") >>>>>>>>>> gr <- c(gr1, gr2) >>>>>>>>>> >>>>>>>>>> ## All of the three current modes in summarizeOverlaps return a >> count of zero >>>>>>>>>> ## because of its inter_feature_overlap awareness: >>>>>>>>>> assays(summarizeOverlaps(gr, rd, mode="Union", >> ignore.strand=TRUE))$counts >>>>>>>>>> assays(summarizeOverlaps(gr, rd, mode="IntersectionStrict", >> ignore.strand=TRUE))$counts >>>>>>>>>> assays(summarizeOverlaps(gr, rd, mode="IntersectionNotEmpty", >> ignore.strand=TRUE))$counts >>>>>>>>>> [,1] >>>>>>>>>> [1,] 0 >>>>>>>>>> [2,] 0 >>>>>>>>>> >>>>>>>>>> ## However, countOverlaps handles this correctly, but is not the >> best choice when >>>>>>>>>> ## counting multiple range features. >>>>>>>>>> countOverlaps(gr, rd) >>>>>>>>>> [1] 1 1 >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Thomas >>>>>>>>>> >>>>>>>>>>> sessionInfo() >>>>>>>>>> R version 3.0.0 (2013-04-03) >>>>>>>>>> Platform: x86_64-apple-darwin10.8.0 (64-bit) >>>>>>>>>> >>>>>>>>>> locale: >>>>>>>>>> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 >>>>>>>>>> >>>>>>>>>> attached base packages: >>>>>>>>>> [1] parallel stats graphics grDevices utils datasets >> methods >>>>>>>>>> [8] base >>>>>>>>>> >>>>>>>>>> other attached packages: >>>>>>>>>> [1] Rsamtools_1.12.0 Biostrings_2.28.0 >> GenomicRanges_1.12.1 >>>>>>>>>> [4] IRanges_1.18.0 BiocGenerics_0.6.0 >>>>>>>>>> >>>>>>>>>> loaded via a namespace (and not attached): >>>>>>>>>> [1] bitops_1.0-5 stats4_3.0.0 tools_3.0.0 zlibbioc_1.6.0 >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Tue, Apr 09, 2013 at 03:42:03PM +0000, Valerie Obenchain >> wrote: >>>>>>>>>>> Hi Thomas, >>>>>>>>>>> >>>>>>>>>>> On 04/08/2013 05:52 PM, Thomas Girke wrote: >>>>>>>>>>>> Dear Valerie, >>>>>>>>>>>> >>>>>>>>>>>> Is there currently any way to run summarizeOverlaps in a >> feature-overlap >>>>>>>>>>>> unaware mode, e.g with an ignorefeatureOL=FALSE/TRUE setting? >> Currently, >>>>>>>>>>>> one can switch back to countOverlaps when feature overlap >> unawareness is >>>>>>>>>>>> the more appropriate counting mode for a biological question, >> but then >>>>>>>>>>>> double counting of reads mapping to multiple-range features is >> not >>>>>>>>>>>> accounted for. It would be really nice to have such a >> feature-overlap >>>>>>>>>>>> unaware option directly in summarizeOverlaps. >>>>>>>>>>> >>>>>>>>>>> No, we don't currently have an option to ignore >> feature-overlap. It >>>>>>>>>>> sounds like you want to count with countOverlaps() but still >> want the >>>>>>>>>>> double counting to be resolved. This is essentially what the >> other modes >>>>>>>>>>> are doing so I must be missing something. >>>>>>>>>>> >>>>>>>>>>> In this example 2 reads hit feature A, 1 read hits feature B. >> With >>>>>>>>>>> something like ignorefeature0L=TRUE, what results would you >> expect to >>>>>>>>>>> see? Maybe you have another, more descriptive example? >>>>>>>>>>> >>>>>>>>>>> reads <- GRanges("chr1", IRanges(c(1, 5, 20), width=3)) >>>>>>>>>>> features <- GRanges("chr1", IRanges(c(1, 20), width=10, >>>>>>>>>>> names=c("A", "B"))) >>>>>>>>>>> >>>>>>>>>>>> countOverlaps(features, reads) >>>>>>>>>>> [1] 2 1 >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Another question relates to the memory usage of >> summarizeOverlaps. Has >>>>>>>>>>>> this been optimized yet? On a typical bam file with ~50-100 >> million >>>>>>>>>>>> reads the memory usage of summarizeOverlaps is often around >> 10-20GB. To >>>>>>>>>>>> use the function on a desktop computer or in large-scale >> RNA-Seq >>>>>>>>>>>> projects on a commodity compute cluster, it would be desirable >> if every >>>>>>>>>>>> counting instance would consume not more than 5GB of RAM. >>>>>>>>>>> >>>>>>>>>>> Have you tried the BamFileList-method? There is an example at >> the bottom >>>>>>>>>>> of the ?BamFileList man page using summarizeOverlaps(). As Ryan >>>>>>>>>>> mentioned, the key is to set the 'yieldSize' parameter when >> creating the >>>>>>>>>>> BamFile. This method also makes use of mclapply(). >>>>>>>>>>> >>>>>>>>>>> Valerie >>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Thanks in advance for your help and suggestions, >>>>>>>>>>>> >>>>>>>>>>>> Thomas >>>>>>>>>>>> >>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>> Bioconductor mailing list >>>>>>>>>>>> Bioconductor at r-project.org >>>>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>>>>>>>> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>>>>>>>>> >>>> >>>> _______________________________________________ >>>> Bioconductor mailing list >>>> Bioconductor at r-project.org >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>> >>> >>> >>> -- >>> Computational Biology / Fred Hutchinson Cancer Research Center >>> 1100 Fairview Ave. N. >>> PO Box 19024 Seattle, WA 98109 >>> >>> Location: Arnold Building M1 B861 >>> Phone: (206) 667-2793 >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:6}}
ADD REPLYlink written 4.6 years ago by Wei Shi2.7k
Hi Wei, summarizeSpliceOverlaps does not yet exist. We held off on that until we determined whether findSpliceOverlaps was useful. And yes, findSpliceOverlaps counts reads on a per-feature basis, so it will double count reads in that way. That's totally intentional (and I'm pretty sure what Thomas was wanting). Michael On Thu, Apr 11, 2013 at 7:11 PM, Wei Shi <shi@wehi.edu.au> wrote: > Hi Michael, > > I could not find the 'summarizeSpliceOverlaps' function in the > GenomicFeatures package (1.12.0). > > The findSpliceOverlaps function seems to count reads more than once as > well. I ran the sample code in the help page for this function and found > that the single fragment was assigned to both transcripts: > > > genes <- GRangesList( > + GRanges("chr1", IRanges(c(5, 20), c(10, 25)), "+"), > + GRanges("chr1", IRanges(c(5, 22), c(15, 25)), "+")) > > genes > GRangesList of length 2: > [[1]] > GRanges with 2 ranges and 0 metadata columns: > seqnames ranges strand > <rle> <iranges> <rle> > [1] chr1 [ 5, 10] + > [2] chr1 [20, 25] + > > [[2]] > GRanges with 2 ranges and 0 metadata columns: > seqnames ranges strand > [1] chr1 [ 5, 15] + > [2] chr1 [22, 25] + > > --- > seqlengths: > chr1 > NA > > galp <- GappedAlignmentPairs( > + GappedAlignments("chr1", 5L, "11M4N6M", strand("+")), > + GappedAlignments("chr1", 50L, "6M", strand("-")), > + isProperPair=TRUE) > > galp > GappedAlignmentPairs with 1 alignment pair and 0 metadata columns: > seqnames strand : ranges -- ranges > <rle> <rle> : <iranges> -- <iranges> > [1] chr1 + : [5, 25] -- [50, 55] > --- > seqlengths: > chr1 > NA > > findSpliceOverlaps(galp, genes) > Hits of length 2 > queryLength: 1 > subjectLength: 2 > queryHits subjectHits compatible unique coding strandSpecific > <integer> <integer> <logical> <logical> <logical> <logical> > 1 1 1 FALSE FALSE NA TRUE > 2 1 2 FALSE FALSE NA TRUE > > > sessionInfo() > R version 3.0.0 (2013-04-03) > Platform: x86_64-apple-darwin10.8.0 (64-bit) > > locale: > [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 > > attached base packages: > [1] parallel stats graphics grDevices utils datasets methods > base > > other attached packages: > [1] limma_3.16.1 GenomicFeatures_1.12.0 AnnotationDbi_1.22.1 > Biobase_2.20.0 GenomicRanges_1.12.1 > [6] IRanges_1.18.0 BiocGenerics_0.6.0 Rsubread_1.10.1 > BiocInstaller_1.10.0 > > loaded via a namespace (and not attached): > [1] biomaRt_2.16.0 Biostrings_2.28.0 bitops_1.0-5 > BSgenome_1.28.0 DBI_0.2-5 RCurl_1.95-4.1 > [7] Rsamtools_1.12.0 RSQLite_0.11.2 rtracklayer_1.20.0 stats4_3.0.0 > tools_3.0.0 XML_3.95-0.2 > [13] zlibbioc_1.6.0 > > > > > Wei > > On Apr 11, 2013, at 11:36 PM, Michael Lawrence wrote: > > > On Wed, Apr 10, 2013 at 7:09 PM, Thomas Girke <thomas.girke@ucr.edu> > wrote: > > > >> Hi Martin, > >> > >> Yes, inter_feature=TRUE would maintain the current counting mode(s) that > >> prohibits counting of reads mapping to multiple features. This is a > >> special case of counting that is very useful for counting exonic regions > >> of genes. However, one also wants to be able to turn off this behavior > >> by ignoring inter feature overlaps (just like ignore.strand=T/F). > >> Otherwise we cannot use summarizeOverlaps along with its current modes > >> for important operations like transcript level counting because many > >> transcript variants from the same gene will mask each others reads when > >> inter_feature=TRUE. Providing the option to output the results from both > >> inter_feature=TRUE and inter_feature=FALSE could be a very sensible > >> solution and time saver for users working with new genomes/GFFs, where > >> one cannot trust every nested annotation for various reasons, and > >> inter_feature=TRUE can quickly become a very risky counting strategy > >> since it tends to erase counts. Every biologist will scream in your > >> face if the counter tells them that their favorite gene has zero counts > >> just because of some overlap with some annotation error:). > >> > >> For multiple range features stored in GRangesList objects, I would > >> currently favor making "inter_feature=FALSE" ignore the overlaps > >> occurring among different list components, but not necessarily those > >> within a list component (e.g. exon ranges of a gene). This way one > >> can benefit from the current infrastructure by restricting its feature > >> overlap scope to the range sets stored within individual list components > >> but ignoring those among different list components. Utilities like > reduce > >> and other range modifier functions could handle situations where one > wants > >> to ignore all feature overlaps within and among list components. > However, > >> I am sure there could be other solutions to this. > >> > >> Long story short if inter_feature=TRUE/FALSE could be used in > >> combination with modes=Union/IntersectionStrict/IntersectionNonEmpty > >> resulting in six different counting flavors, I would be happy and, I am > >> sure, many other Bioc users as well. > >> > >> > > Have you taken a look at findSpliceOverlaps? Maybe > summarizeSpliceOverlaps > > could be completed to satisfy your use case? Somehow we need to control > the > > proliferation of counting functions and modes. Having some idea of your > > high-level use case might help. > > > > Also, for spliced alignments, I recommend giving GSNAP a try if you > haven't > > already. It's accessible though the gmapR package. > > > > Michael > > > > I hope this makes sense. > >> > >> Thomas > >> > >> > >> > >> On Wed, Apr 10, 2013 at 11:37:04PM +0000, Martin Morgan wrote: > >>> On 04/10/2013 12:42 PM, Thomas Girke wrote: > >>>> Thanks. Adding an inter-feature unaware mode will be very helpful and > >> also > >>>> broaden summarizeOverlaps' application spectrum for a lot of use > cases. > >>> > >>> I'm probably being quite dense here, and am mostly an outside observer. > >> What I > >>> hear you saying is that there are currently three modes -- Union, > >>> IntersectionStrict, IntersectionNonEmpty. These modes are summarized in > >> the > >>> seven rows of figure 1 of > >>> > >>> vignette("summarizeOverlaps", package="GenomicRanges") > >>> > >>> Let's say there is a flag inter_feature, and when its value is TRUE > then > >> the > >>> current counting schemes are obtained. These modes differ in the way > >> counting > >>> works for reads illustrated in rows 5, 6, and 7 of the figure. You'd > >> like a > >>> count scored where a '1' appears in the table below. With > >> inter_feature=TRUE > >>> reads are 'never counted more than once' ('Hits per read' is <= 1) > >>> > >>> |----------------------+-----+-----------+-----------+---------------| > >>> | Mode | Row | Feature 1 | Feature 2 | Hits per read | > >>> |----------------------+-----+-----------+-----------+---------------| > >>> | Union | 5 | 1 | 0 | 1 | > >>> | | 6 | 0 | 0 | 0 | > >>> | | 7 | 0 | 0 | 0 | > >>> | IntersectionStrict | 5 | 1 | 0 | 1 | > >>> | | 6 | 1 | 0 | 1 | > >>> | | 7 | 0 | 0 | 0 | > >>> | IntersectionNotEmpty | 5 | 1 | 0 | 1 | > >>> | | 6 | 1 | 0 | 1 | > >>> | | 7 | 0 | 0 | 0 | > >>> |----------------------+-----+-----------+-----------+---------------| > >>> > >>> > >>> You'd like to add 3 more modes, with inter_feature=FALSE. Reads are > >> sometimes > >>> 'counted twice' > >>> > >>> |----------------------+-----+-----------+-----------+---------------| > >>> | Mode | Row | Feature 1 | Feature 2 | Hits per read | > >>> |----------------------+-----+-----------+-----------+---------------| > >>> | Union | 5 | 1 | 0 | 1 | > >>> | | 6 | 1 | 1 | 2 | > >>> | | 7 | 1 | 1 | 2 | > >>> | IntersectionStrict | 5 | 1 | 0 | 1 | > >>> | | 6 | 1 | 0 | 1 | > >>> | | 7 | 1 | 1 | 2 | > >>> | IntersectionNotEmpty | 5 | 1 | 0 | 1 | > >>> | | 6 | 1 | 1 | 2 | > >>> | | 7 | 1 | 1 | 2 | > >>> |----------------------+-----+-----------+-----------+---------------| > >>> > >>> > >>> Martin > >>> > >>>> > >>>> Thomas > >>>> > >>>> > >>>> On Wed, Apr 10, 2013 at 04:59:49PM +0000, Valerie Obenchain wrote: > >>>>> Hi Thomas, > >>>>> > >>>>> On 04/10/2013 09:23 AM, Thomas Girke wrote: > >>>>>> Valerie, > >>>>>> > >>>>>> Please see my inserted comments below. > >>>>>> > >>>>>> On Wed, Apr 10, 2013 at 03:50:54PM +0000, Valerie Obenchain wrote: > >>>>>>> Ah, I see. You'd like to count with one of the existing modes but > >> have > >>>>>>> the option to pick up counts for these inter-feature reads (fall > >>>>>>> completely 'within' >1 feature). These inter-feature reads would be > >>>>>>> double (triple, quadruple, etc.) counted. Essentially they would > >> add one > >>>>>>> count to each feature they hit. Right? > >>>>>> > >>>>>> Correct. Perhaps let's discuss this with a very common example of > >>>>>> transcript-level counting rather than counting on the unified > >> (virtual) > >>>>>> exonic regions of genes. With the current description provided in > the > >>>>>> summarizeOverlaps vignette at > >>>>>> > >> > http://www.bioconductor.org/packages/2.12/bioc/vignettes/GenomicRang es/inst/doc/summarizeOverlaps.pdf > >>>>>> I don't see how this can be achieved without falling back to using > >>>>>> countOverlaps without loosing the new counting modes provided by > >>>>>> summarizeOverlaps? > >>>>> > >>>>> It can't be achieved with the function as is but we could add an > >>>>> argument to handle this (as you suggested from the start). If > >>>>> 'inter-feature=TRUE' then these counts would be added to the counts > >>>>> already obtained from using a particular 'mode'. I will move ahead > >> with > >>>>> implementing this argument. > >>>>> > >>>>>> > >>>>>>> > >>>>>>> One more thought about memory usage. If you are working with > >> single-end > >>>>>>> reads the summarizeOverlaps,BamFileList-method I mentioned below > >> should > >>>>>>> work fine. The approach is slightly different with paired- end > >> reads. Our > >>>>>>> current algorithm for pairing paired-end reads requires the whole > >> file > >>>>>>> to be read into memory. A different approach is currently being > >>>>>>> developed but in the meantime you can take the qname-sorted > >> approach. > >>>>>>> The Bam file will need to be sorted by qname and both the > >> 'yieldSize' > >>>>>>> and 'obeyQname=TRUE' set when creating the BamFile/BamFileList. An > >>>>>>> example is on ?BamFile. > >>>>>> > >>>>>> Thanks for pointing this out. My fault that I didn't read through > all > >>>>>> the linked documentation. Perhaps it may not be a bad idea to make > >> the > >>>>>> memory restricted bam read instances the default setting in the > >> future. > >>>>>> This will definitely help biologists using those utilities without > >>>>>> crashing their machines on the first attempt. > >>>>> > >>>>> Good suggestion, thanks. > >>>>> > >>>>> Valerie > >>>>> > >>>>>> > >>>>>> Thomas > >>>>>> > >>>>>> > >>>>>>> > >>>>>>> Valerie > >>>>>>> > >>>>>>> > >>>>>>> On 04/09/2013 08:01 PM, Thomas Girke wrote: > >>>>>>>> Thanks for the tip. I guess doing it this way reverses the > >> counting mode > >>>>>>>> back to countOverlaps, but how can I use at the same time > >>>>>>>> "IntersectionStrict" or any of the other modes provided by > >>>>>>>> summarizeOverlaps if its mode argument is already used and > >> countOverlaps > >>>>>>>> doesn't accept one? > >>>>>>>> > >>>>>>>> Thomas > >>>>>>>> > >>>>>>>> On Tue, Apr 09, 2013 at 09:08:02PM +0000, Valerie Obenchain wrote: > >>>>>>>>> Thanks for the example. You're right, none of the modes will > >> count a > >>>>>>>>> read falling completely within >1 feature. > >>>>>>>>> > >>>>>>>>> You can supply your own function as a 'mode'. Two requirements > >> must be met: > >>>>>>>>> > >>>>>>>>> (1) The function must have 3 arguments that correspond to > >>>>>>>>> 'features', 'reads' and 'ignore.strand', in that order. > >>>>>>>>> > >>>>>>>>> (2) The function must return a vector of counts the same length > >>>>>>>>> as 'features' > >>>>>>>>> > >>>>>>>>> Here is an example using countOverlaps() which I think gets at > the > >>>>>>>>> counting you want. > >>>>>>>>> > >>>>>>>>> counter <- function(x, y, ignore.strand) > >>>>>>>>> countOverlaps(y, x, ignore.strand=ignore.strand) > >>>>>>>>> > >>>>>>>>>> assays(summarizeOverlaps(gr, rd, mode=counter))$counts > >>>>>>>>> [,1] > >>>>>>>>> [1,] 1 > >>>>>>>>> [2,] 1 > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> Valerie > >>>>>>>>> > >>>>>>>>> On 04/09/2013 09:37 AM, Thomas Girke wrote: > >>>>>>>>>> Hi Valerie, > >>>>>>>>>> > >>>>>>>>>> Perhaps let's call this more explicitly an > >> ignore_inter_feature_overlap option > >>>>>>>>>> that we often need for cases like this: > >>>>>>>>>> > >>>>>>>>>> Read1 ---- > >>>>>>>>>> F1 ---------------- > >>>>>>>>>> F2 ----------- > >>>>>>>>>> > >>>>>>>>>> where we would like to have at least the option to assign Read1 > >> to both F1 and F2: > >>>>>>>>>> > >>>>>>>>>> F1: 1 > >>>>>>>>>> F2: 1 > >>>>>>>>>> > >>>>>>>>>> However, summarizeOverlapse doesn't count Read1 at all in all of > >> its currently > >>>>>>>>>> available modes that I am aware of. This lack of an > >> ignore_inter_feature_overlap > >>>>>>>>>> option is frequently a problem when working with poorly > >> annotated genomes (high > >>>>>>>>>> uncertainty about hidden/incorrect feature overlaps) or various > >>>>>>>>>> RNA/ChIP-Seq _engineering_ projects where I rather take the risk > >> of ambiguous read > >>>>>>>>>> assignments than not counting at all as shown above. > >>>>>>>>>> > >>>>>>>>>> ## Here is a code example illustrating the same case: > >>>>>>>>>> library(GenomicRanges); library(Rsamtools) > >>>>>>>>>> rd <- GappedAlignments(letters[1], seqnames = > Rle(rep("chr1",1)), > >>>>>>>>>> pos = as.integer(c(500)), > >>>>>>>>>> cigar = rep("100M", 1), strand = strand(rep("+", 1))) > >>>>>>>>>> gr1 <- GRanges("chr1", IRanges(start=100, width=1001), > >> strand="+", ID="feat1") > >>>>>>>>>> gr2 <- GRanges("chr1", IRanges(start=500, width=101), > >> strand="+", ID="feat2") > >>>>>>>>>> gr <- c(gr1, gr2) > >>>>>>>>>> > >>>>>>>>>> ## All of the three current modes in summarizeOverlaps return a > >> count of zero > >>>>>>>>>> ## because of its inter_feature_overlap awareness: > >>>>>>>>>> assays(summarizeOverlaps(gr, rd, mode="Union", > >> ignore.strand=TRUE))$counts > >>>>>>>>>> assays(summarizeOverlaps(gr, rd, mode="IntersectionStrict", > >> ignore.strand=TRUE))$counts > >>>>>>>>>> assays(summarizeOverlaps(gr, rd, mode="IntersectionNotEmpty", > >> ignore.strand=TRUE))$counts > >>>>>>>>>> [,1] > >>>>>>>>>> [1,] 0 > >>>>>>>>>> [2,] 0 > >>>>>>>>>> > >>>>>>>>>> ## However, countOverlaps handles this correctly, but is not the > >> best choice when > >>>>>>>>>> ## counting multiple range features. > >>>>>>>>>> countOverlaps(gr, rd) > >>>>>>>>>> [1] 1 1 > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> Thomas > >>>>>>>>>> > >>>>>>>>>>> sessionInfo() > >>>>>>>>>> R version 3.0.0 (2013-04-03) > >>>>>>>>>> Platform: x86_64-apple-darwin10.8.0 (64-bit) > >>>>>>>>>> > >>>>>>>>>> locale: > >>>>>>>>>> [1] > en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 > >>>>>>>>>> > >>>>>>>>>> attached base packages: > >>>>>>>>>> [1] parallel stats graphics grDevices utils datasets > >> methods > >>>>>>>>>> [8] base > >>>>>>>>>> > >>>>>>>>>> other attached packages: > >>>>>>>>>> [1] Rsamtools_1.12.0 Biostrings_2.28.0 > >> GenomicRanges_1.12.1 > >>>>>>>>>> [4] IRanges_1.18.0 BiocGenerics_0.6.0 > >>>>>>>>>> > >>>>>>>>>> loaded via a namespace (and not attached): > >>>>>>>>>> [1] bitops_1.0-5 stats4_3.0.0 tools_3.0.0 zlibbioc_1.6.0 > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> On Tue, Apr 09, 2013 at 03:42:03PM +0000, Valerie Obenchain > >> wrote: > >>>>>>>>>>> Hi Thomas, > >>>>>>>>>>> > >>>>>>>>>>> On 04/08/2013 05:52 PM, Thomas Girke wrote: > >>>>>>>>>>>> Dear Valerie, > >>>>>>>>>>>> > >>>>>>>>>>>> Is there currently any way to run summarizeOverlaps in a > >> feature-overlap > >>>>>>>>>>>> unaware mode, e.g with an ignorefeatureOL=FALSE/TRUE setting? > >> Currently, > >>>>>>>>>>>> one can switch back to countOverlaps when feature overlap > >> unawareness is > >>>>>>>>>>>> the more appropriate counting mode for a biological question, > >> but then > >>>>>>>>>>>> double counting of reads mapping to multiple-range features is > >> not > >>>>>>>>>>>> accounted for. It would be really nice to have such a > >> feature-overlap > >>>>>>>>>>>> unaware option directly in summarizeOverlaps. > >>>>>>>>>>> > >>>>>>>>>>> No, we don't currently have an option to ignore > >> feature-overlap. It > >>>>>>>>>>> sounds like you want to count with countOverlaps() but still > >> want the > >>>>>>>>>>> double counting to be resolved. This is essentially what the > >> other modes > >>>>>>>>>>> are doing so I must be missing something. > >>>>>>>>>>> > >>>>>>>>>>> In this example 2 reads hit feature A, 1 read hits feature B. > >> With > >>>>>>>>>>> something like ignorefeature0L=TRUE, what results would you > >> expect to > >>>>>>>>>>> see? Maybe you have another, more descriptive example? > >>>>>>>>>>> > >>>>>>>>>>> reads <- GRanges("chr1", IRanges(c(1, 5, 20), width=3)) > >>>>>>>>>>> features <- GRanges("chr1", IRanges(c(1, 20), width=10, > >>>>>>>>>>> names=c("A", "B"))) > >>>>>>>>>>> > >>>>>>>>>>>> countOverlaps(features, reads) > >>>>>>>>>>> [1] 2 1 > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> Another question relates to the memory usage of > >> summarizeOverlaps. Has > >>>>>>>>>>>> this been optimized yet? On a typical bam file with ~50-100 > >> million > >>>>>>>>>>>> reads the memory usage of summarizeOverlaps is often around > >> 10-20GB. To > >>>>>>>>>>>> use the function on a desktop computer or in large- scale > >> RNA-Seq > >>>>>>>>>>>> projects on a commodity compute cluster, it would be desirable > >> if every > >>>>>>>>>>>> counting instance would consume not more than 5GB of RAM. > >>>>>>>>>>> > >>>>>>>>>>> Have you tried the BamFileList-method? There is an example at > >> the bottom > >>>>>>>>>>> of the ?BamFileList man page using summarizeOverlaps(). As Ryan > >>>>>>>>>>> mentioned, the key is to set the 'yieldSize' parameter when > >> creating the > >>>>>>>>>>> BamFile. This method also makes use of mclapply(). > >>>>>>>>>>> > >>>>>>>>>>> Valerie > >>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> Thanks in advance for your help and suggestions, > >>>>>>>>>>>> > >>>>>>>>>>>> Thomas > >>>>>>>>>>>> > >>>>>>>>>>>> _______________________________________________ > >>>>>>>>>>>> Bioconductor mailing list > >>>>>>>>>>>> Bioconductor@r-project.org > >>>>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor > >>>>>>>>>>>> Search the archives: > >> http://news.gmane.org/gmane.science.biology.informatics.conductor > >>>>>>>>>>>> > >>>> > >>>> _______________________________________________ > >>>> Bioconductor mailing list > >>>> Bioconductor@r-project.org > >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor > >>>> Search the archives: > >> http://news.gmane.org/gmane.science.biology.informatics.conductor > >>>> > >>> > >>> > >>> -- > >>> Computational Biology / Fred Hutchinson Cancer Research Center > >>> 1100 Fairview Ave. N. > >>> PO Box 19024 Seattle, WA 98109 > >>> > >>> Location: Arnold Building M1 B861 > >>> Phone: (206) 667-2793 > >> > >> _______________________________________________ > >> Bioconductor mailing list > >> Bioconductor@r-project.org > >> https://stat.ethz.ch/mailman/listinfo/bioconductor > >> Search the archives: > >> http://news.gmane.org/gmane.science.biology.informatics.conductor > >> > > > > [[alternative HTML version deleted]] > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor@r-project.org > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > ______________________________________________________________________ > The information in this email is confidential and inte...{{dropped:10}}
ADD REPLYlink written 4.6 years ago by Michael Lawrence9.8k
On 04/12/2013 06:53 AM, Michael Lawrence wrote: > Hi Wei, > > summarizeSpliceOverlaps does not yet exist. We held off on that until we > determined whether findSpliceOverlaps was useful. And yes, > findSpliceOverlaps counts reads on a per-feature basis, so it will double > count reads in that way. That's totally intentional (and I'm pretty sure > what Thomas was wanting). also the object returned can be easily manipulated? > olaps = findSpliceOverlaps(galp, genes) > olaps Hits of length 2 queryLength: 1 subjectLength: 2 queryHits subjectHits compatible unique coding strandSpecific <integer> <integer> <logical> <logical> <logical> <logical> 1 1 1 FALSE FALSE NA TRUE 2 1 2 FALSE FALSE NA TRUE > olaps[mcols(olaps)$unique] Hits of length 0 queryLength: 1 subjectLength: 2 and canonically ulaps = olaps[mcols(olaps)$unique] tabulate(subjectHits(ulaps), subjectLength(ulaps)) see ?Hits Martin > > Michael > > > On Thu, Apr 11, 2013 at 7:11 PM, Wei Shi <shi at="" wehi.edu.au=""> wrote: > >> Hi Michael, >> >> I could not find the 'summarizeSpliceOverlaps' function in the >> GenomicFeatures package (1.12.0). >> >> The findSpliceOverlaps function seems to count reads more than once as >> well. I ran the sample code in the help page for this function and found >> that the single fragment was assigned to both transcripts: >> >>> genes <- GRangesList( >> + GRanges("chr1", IRanges(c(5, 20), c(10, 25)), "+"), >> + GRanges("chr1", IRanges(c(5, 22), c(15, 25)), "+")) >>> genes >> GRangesList of length 2: >> [[1]] >> GRanges with 2 ranges and 0 metadata columns: >> seqnames ranges strand >> <rle> <iranges> <rle> >> [1] chr1 [ 5, 10] + >> [2] chr1 [20, 25] + >> >> [[2]] >> GRanges with 2 ranges and 0 metadata columns: >> seqnames ranges strand >> [1] chr1 [ 5, 15] + >> [2] chr1 [22, 25] + >> >> --- >> seqlengths: >> chr1 >> NA >>> galp <- GappedAlignmentPairs( >> + GappedAlignments("chr1", 5L, "11M4N6M", strand("+")), >> + GappedAlignments("chr1", 50L, "6M", strand("-")), >> + isProperPair=TRUE) >>> galp >> GappedAlignmentPairs with 1 alignment pair and 0 metadata columns: >> seqnames strand : ranges -- ranges >> <rle> <rle> : <iranges> -- <iranges> >> [1] chr1 + : [5, 25] -- [50, 55] >> --- >> seqlengths: >> chr1 >> NA >>> findSpliceOverlaps(galp, genes) >> Hits of length 2 >> queryLength: 1 >> subjectLength: 2 >> queryHits subjectHits compatible unique coding strandSpecific >> <integer> <integer> <logical> <logical> <logical> <logical> >> 1 1 1 FALSE FALSE NA TRUE >> 2 1 2 FALSE FALSE NA TRUE >> >>> sessionInfo() >> R version 3.0.0 (2013-04-03) >> Platform: x86_64-apple-darwin10.8.0 (64-bit) >> >> locale: >> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 >> >> attached base packages: >> [1] parallel stats graphics grDevices utils datasets methods >> base >> >> other attached packages: >> [1] limma_3.16.1 GenomicFeatures_1.12.0 AnnotationDbi_1.22.1 >> Biobase_2.20.0 GenomicRanges_1.12.1 >> [6] IRanges_1.18.0 BiocGenerics_0.6.0 Rsubread_1.10.1 >> BiocInstaller_1.10.0 >> >> loaded via a namespace (and not attached): >> [1] biomaRt_2.16.0 Biostrings_2.28.0 bitops_1.0-5 >> BSgenome_1.28.0 DBI_0.2-5 RCurl_1.95-4.1 >> [7] Rsamtools_1.12.0 RSQLite_0.11.2 rtracklayer_1.20.0 stats4_3.0.0 >> tools_3.0.0 XML_3.95-0.2 >> [13] zlibbioc_1.6.0 >>> >> >> >> Wei >> >> On Apr 11, 2013, at 11:36 PM, Michael Lawrence wrote: >> >>> On Wed, Apr 10, 2013 at 7:09 PM, Thomas Girke <thomas.girke at="" ucr.edu=""> >> wrote: >>> >>>> Hi Martin, >>>> >>>> Yes, inter_feature=TRUE would maintain the current counting mode(s) that >>>> prohibits counting of reads mapping to multiple features. This is a >>>> special case of counting that is very useful for counting exonic regions >>>> of genes. However, one also wants to be able to turn off this behavior >>>> by ignoring inter feature overlaps (just like ignore.strand=T/F). >>>> Otherwise we cannot use summarizeOverlaps along with its current modes >>>> for important operations like transcript level counting because many >>>> transcript variants from the same gene will mask each others reads when >>>> inter_feature=TRUE. Providing the option to output the results from both >>>> inter_feature=TRUE and inter_feature=FALSE could be a very sensible >>>> solution and time saver for users working with new genomes/GFFs, where >>>> one cannot trust every nested annotation for various reasons, and >>>> inter_feature=TRUE can quickly become a very risky counting strategy >>>> since it tends to erase counts. Every biologist will scream in your >>>> face if the counter tells them that their favorite gene has zero counts >>>> just because of some overlap with some annotation error:). >>>> >>>> For multiple range features stored in GRangesList objects, I would >>>> currently favor making "inter_feature=FALSE" ignore the overlaps >>>> occurring among different list components, but not necessarily those >>>> within a list component (e.g. exon ranges of a gene). This way one >>>> can benefit from the current infrastructure by restricting its feature >>>> overlap scope to the range sets stored within individual list components >>>> but ignoring those among different list components. Utilities like >> reduce >>>> and other range modifier functions could handle situations where one >> wants >>>> to ignore all feature overlaps within and among list components. >> However, >>>> I am sure there could be other solutions to this. >>>> >>>> Long story short if inter_feature=TRUE/FALSE could be used in >>>> combination with modes=Union/IntersectionStrict/IntersectionNonEmpty >>>> resulting in six different counting flavors, I would be happy and, I am >>>> sure, many other Bioc users as well. >>>> >>>> >>> Have you taken a look at findSpliceOverlaps? Maybe >> summarizeSpliceOverlaps >>> could be completed to satisfy your use case? Somehow we need to control >> the >>> proliferation of counting functions and modes. Having some idea of your >>> high-level use case might help. >>> >>> Also, for spliced alignments, I recommend giving GSNAP a try if you >> haven't >>> already. It's accessible though the gmapR package. >>> >>> Michael >>> >>> I hope this makes sense. >>>> >>>> Thomas >>>> >>>> >>>> >>>> On Wed, Apr 10, 2013 at 11:37:04PM +0000, Martin Morgan wrote: >>>>> On 04/10/2013 12:42 PM, Thomas Girke wrote: >>>>>> Thanks. Adding an inter-feature unaware mode will be very helpful and >>>> also >>>>>> broaden summarizeOverlaps' application spectrum for a lot of use >> cases. >>>>> >>>>> I'm probably being quite dense here, and am mostly an outside observer. >>>> What I >>>>> hear you saying is that there are currently three modes -- Union, >>>>> IntersectionStrict, IntersectionNonEmpty. These modes are summarized in >>>> the >>>>> seven rows of figure 1 of >>>>> >>>>> vignette("summarizeOverlaps", package="GenomicRanges") >>>>> >>>>> Let's say there is a flag inter_feature, and when its value is TRUE >> then >>>> the >>>>> current counting schemes are obtained. These modes differ in the way >>>> counting >>>>> works for reads illustrated in rows 5, 6, and 7 of the figure. You'd >>>> like a >>>>> count scored where a '1' appears in the table below. With >>>> inter_feature=TRUE >>>>> reads are 'never counted more than once' ('Hits per read' is <= 1) >>>>> >>>>> |----------------------+-----+-----------+-----------+---------------| >>>>> | Mode | Row | Feature 1 | Feature 2 | Hits per read | >>>>> |----------------------+-----+-----------+-----------+---------------| >>>>> | Union | 5 | 1 | 0 | 1 | >>>>> | | 6 | 0 | 0 | 0 | >>>>> | | 7 | 0 | 0 | 0 | >>>>> | IntersectionStrict | 5 | 1 | 0 | 1 | >>>>> | | 6 | 1 | 0 | 1 | >>>>> | | 7 | 0 | 0 | 0 | >>>>> | IntersectionNotEmpty | 5 | 1 | 0 | 1 | >>>>> | | 6 | 1 | 0 | 1 | >>>>> | | 7 | 0 | 0 | 0 | >>>>> |----------------------+-----+-----------+-----------+---------------| >>>>> >>>>> >>>>> You'd like to add 3 more modes, with inter_feature=FALSE. Reads are >>>> sometimes >>>>> 'counted twice' >>>>> >>>>> |----------------------+-----+-----------+-----------+---------------| >>>>> | Mode | Row | Feature 1 | Feature 2 | Hits per read | >>>>> |----------------------+-----+-----------+-----------+---------------| >>>>> | Union | 5 | 1 | 0 | 1 | >>>>> | | 6 | 1 | 1 | 2 | >>>>> | | 7 | 1 | 1 | 2 | >>>>> | IntersectionStrict | 5 | 1 | 0 | 1 | >>>>> | | 6 | 1 | 0 | 1 | >>>>> | | 7 | 1 | 1 | 2 | >>>>> | IntersectionNotEmpty | 5 | 1 | 0 | 1 | >>>>> | | 6 | 1 | 1 | 2 | >>>>> | | 7 | 1 | 1 | 2 | >>>>> |----------------------+-----+-----------+-----------+---------------| >>>>> >>>>> >>>>> Martin >>>>> >>>>>> >>>>>> Thomas >>>>>> >>>>>> >>>>>> On Wed, Apr 10, 2013 at 04:59:49PM +0000, Valerie Obenchain wrote: >>>>>>> Hi Thomas, >>>>>>> >>>>>>> On 04/10/2013 09:23 AM, Thomas Girke wrote: >>>>>>>> Valerie, >>>>>>>> >>>>>>>> Please see my inserted comments below. >>>>>>>> >>>>>>>> On Wed, Apr 10, 2013 at 03:50:54PM +0000, Valerie Obenchain wrote: >>>>>>>>> Ah, I see. You'd like to count with one of the existing modes but >>>> have >>>>>>>>> the option to pick up counts for these inter-feature reads (fall >>>>>>>>> completely 'within' >1 feature). These inter-feature reads would be >>>>>>>>> double (triple, quadruple, etc.) counted. Essentially they would >>>> add one >>>>>>>>> count to each feature they hit. Right? >>>>>>>> >>>>>>>> Correct. Perhaps let's discuss this with a very common example of >>>>>>>> transcript-level counting rather than counting on the unified >>>> (virtual) >>>>>>>> exonic regions of genes. With the current description provided in >> the >>>>>>>> summarizeOverlaps vignette at >>>>>>>> >>>> >> http://www.bioconductor.org/packages/2.12/bioc/vignettes/GenomicRan ges/inst/doc/summarizeOverlaps.pdf >>>>>>>> I don't see how this can be achieved without falling back to using >>>>>>>> countOverlaps without loosing the new counting modes provided by >>>>>>>> summarizeOverlaps? >>>>>>> >>>>>>> It can't be achieved with the function as is but we could add an >>>>>>> argument to handle this (as you suggested from the start). If >>>>>>> 'inter-feature=TRUE' then these counts would be added to the counts >>>>>>> already obtained from using a particular 'mode'. I will move ahead >>>> with >>>>>>> implementing this argument. >>>>>>> >>>>>>>> >>>>>>>>> >>>>>>>>> One more thought about memory usage. If you are working with >>>> single-end >>>>>>>>> reads the summarizeOverlaps,BamFileList-method I mentioned below >>>> should >>>>>>>>> work fine. The approach is slightly different with paired- end >>>> reads. Our >>>>>>>>> current algorithm for pairing paired-end reads requires the whole >>>> file >>>>>>>>> to be read into memory. A different approach is currently being >>>>>>>>> developed but in the meantime you can take the qname-sorted >>>> approach. >>>>>>>>> The Bam file will need to be sorted by qname and both the >>>> 'yieldSize' >>>>>>>>> and 'obeyQname=TRUE' set when creating the BamFile/BamFileList. An >>>>>>>>> example is on ?BamFile. >>>>>>>> >>>>>>>> Thanks for pointing this out. My fault that I didn't read through >> all >>>>>>>> the linked documentation. Perhaps it may not be a bad idea to make >>>> the >>>>>>>> memory restricted bam read instances the default setting in the >>>> future. >>>>>>>> This will definitely help biologists using those utilities without >>>>>>>> crashing their machines on the first attempt. >>>>>>> >>>>>>> Good suggestion, thanks. >>>>>>> >>>>>>> Valerie >>>>>>> >>>>>>>> >>>>>>>> Thomas >>>>>>>> >>>>>>>> >>>>>>>>> >>>>>>>>> Valerie >>>>>>>>> >>>>>>>>> >>>>>>>>> On 04/09/2013 08:01 PM, Thomas Girke wrote: >>>>>>>>>> Thanks for the tip. I guess doing it this way reverses the >>>> counting mode >>>>>>>>>> back to countOverlaps, but how can I use at the same time >>>>>>>>>> "IntersectionStrict" or any of the other modes provided by >>>>>>>>>> summarizeOverlaps if its mode argument is already used and >>>> countOverlaps >>>>>>>>>> doesn't accept one? >>>>>>>>>> >>>>>>>>>> Thomas >>>>>>>>>> >>>>>>>>>> On Tue, Apr 09, 2013 at 09:08:02PM +0000, Valerie Obenchain wrote: >>>>>>>>>>> Thanks for the example. You're right, none of the modes will >>>> count a >>>>>>>>>>> read falling completely within >1 feature. >>>>>>>>>>> >>>>>>>>>>> You can supply your own function as a 'mode'. Two requirements >>>> must be met: >>>>>>>>>>> >>>>>>>>>>> (1) The function must have 3 arguments that correspond to >>>>>>>>>>> 'features', 'reads' and 'ignore.strand', in that order. >>>>>>>>>>> >>>>>>>>>>> (2) The function must return a vector of counts the same length >>>>>>>>>>> as 'features' >>>>>>>>>>> >>>>>>>>>>> Here is an example using countOverlaps() which I think gets at >> the >>>>>>>>>>> counting you want. >>>>>>>>>>> >>>>>>>>>>> counter <- function(x, y, ignore.strand) >>>>>>>>>>> countOverlaps(y, x, ignore.strand=ignore.strand) >>>>>>>>>>> >>>>>>>>>>>> assays(summarizeOverlaps(gr, rd, mode=counter))$counts >>>>>>>>>>> [,1] >>>>>>>>>>> [1,] 1 >>>>>>>>>>> [2,] 1 >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Valerie >>>>>>>>>>> >>>>>>>>>>> On 04/09/2013 09:37 AM, Thomas Girke wrote: >>>>>>>>>>>> Hi Valerie, >>>>>>>>>>>> >>>>>>>>>>>> Perhaps let's call this more explicitly an >>>> ignore_inter_feature_overlap option >>>>>>>>>>>> that we often need for cases like this: >>>>>>>>>>>> >>>>>>>>>>>> Read1 ---- >>>>>>>>>>>> F1 ---------------- >>>>>>>>>>>> F2 ----------- >>>>>>>>>>>> >>>>>>>>>>>> where we would like to have at least the option to assign Read1 >>>> to both F1 and F2: >>>>>>>>>>>> >>>>>>>>>>>> F1: 1 >>>>>>>>>>>> F2: 1 >>>>>>>>>>>> >>>>>>>>>>>> However, summarizeOverlapse doesn't count Read1 at all in all of >>>> its currently >>>>>>>>>>>> available modes that I am aware of. This lack of an >>>> ignore_inter_feature_overlap >>>>>>>>>>>> option is frequently a problem when working with poorly >>>> annotated genomes (high >>>>>>>>>>>> uncertainty about hidden/incorrect feature overlaps) or various >>>>>>>>>>>> RNA/ChIP-Seq _engineering_ projects where I rather take the risk >>>> of ambiguous read >>>>>>>>>>>> assignments than not counting at all as shown above. >>>>>>>>>>>> >>>>>>>>>>>> ## Here is a code example illustrating the same case: >>>>>>>>>>>> library(GenomicRanges); library(Rsamtools) >>>>>>>>>>>> rd <- GappedAlignments(letters[1], seqnames = >> Rle(rep("chr1",1)), >>>>>>>>>>>> pos = as.integer(c(500)), >>>>>>>>>>>> cigar = rep("100M", 1), strand = strand(rep("+", 1))) >>>>>>>>>>>> gr1 <- GRanges("chr1", IRanges(start=100, width=1001), >>>> strand="+", ID="feat1") >>>>>>>>>>>> gr2 <- GRanges("chr1", IRanges(start=500, width=101), >>>> strand="+", ID="feat2") >>>>>>>>>>>> gr <- c(gr1, gr2) >>>>>>>>>>>> >>>>>>>>>>>> ## All of the three current modes in summarizeOverlaps return a >>>> count of zero >>>>>>>>>>>> ## because of its inter_feature_overlap awareness: >>>>>>>>>>>> assays(summarizeOverlaps(gr, rd, mode="Union", >>>> ignore.strand=TRUE))$counts >>>>>>>>>>>> assays(summarizeOverlaps(gr, rd, mode="IntersectionStrict", >>>> ignore.strand=TRUE))$counts >>>>>>>>>>>> assays(summarizeOverlaps(gr, rd, mode="IntersectionNotEmpty", >>>> ignore.strand=TRUE))$counts >>>>>>>>>>>> [,1] >>>>>>>>>>>> [1,] 0 >>>>>>>>>>>> [2,] 0 >>>>>>>>>>>> >>>>>>>>>>>> ## However, countOverlaps handles this correctly, but is not the >>>> best choice when >>>>>>>>>>>> ## counting multiple range features. >>>>>>>>>>>> countOverlaps(gr, rd) >>>>>>>>>>>> [1] 1 1 >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Thomas >>>>>>>>>>>> >>>>>>>>>>>>> sessionInfo() >>>>>>>>>>>> R version 3.0.0 (2013-04-03) >>>>>>>>>>>> Platform: x86_64-apple-darwin10.8.0 (64-bit) >>>>>>>>>>>> >>>>>>>>>>>> locale: >>>>>>>>>>>> [1] >> en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 >>>>>>>>>>>> >>>>>>>>>>>> attached base packages: >>>>>>>>>>>> [1] parallel stats graphics grDevices utils datasets >>>> methods >>>>>>>>>>>> [8] base >>>>>>>>>>>> >>>>>>>>>>>> other attached packages: >>>>>>>>>>>> [1] Rsamtools_1.12.0 Biostrings_2.28.0 >>>> GenomicRanges_1.12.1 >>>>>>>>>>>> [4] IRanges_1.18.0 BiocGenerics_0.6.0 >>>>>>>>>>>> >>>>>>>>>>>> loaded via a namespace (and not attached): >>>>>>>>>>>> [1] bitops_1.0-5 stats4_3.0.0 tools_3.0.0 zlibbioc_1.6.0 >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Tue, Apr 09, 2013 at 03:42:03PM +0000, Valerie Obenchain >>>> wrote: >>>>>>>>>>>>> Hi Thomas, >>>>>>>>>>>>> >>>>>>>>>>>>> On 04/08/2013 05:52 PM, Thomas Girke wrote: >>>>>>>>>>>>>> Dear Valerie, >>>>>>>>>>>>>> >>>>>>>>>>>>>> Is there currently any way to run summarizeOverlaps in a >>>> feature-overlap >>>>>>>>>>>>>> unaware mode, e.g with an ignorefeatureOL=FALSE/TRUE setting? >>>> Currently, >>>>>>>>>>>>>> one can switch back to countOverlaps when feature overlap >>>> unawareness is >>>>>>>>>>>>>> the more appropriate counting mode for a biological question, >>>> but then >>>>>>>>>>>>>> double counting of reads mapping to multiple-range features is >>>> not >>>>>>>>>>>>>> accounted for. It would be really nice to have such a >>>> feature-overlap >>>>>>>>>>>>>> unaware option directly in summarizeOverlaps. >>>>>>>>>>>>> >>>>>>>>>>>>> No, we don't currently have an option to ignore >>>> feature-overlap. It >>>>>>>>>>>>> sounds like you want to count with countOverlaps() but still >>>> want the >>>>>>>>>>>>> double counting to be resolved. This is essentially what the >>>> other modes >>>>>>>>>>>>> are doing so I must be missing something. >>>>>>>>>>>>> >>>>>>>>>>>>> In this example 2 reads hit feature A, 1 read hits feature B. >>>> With >>>>>>>>>>>>> something like ignorefeature0L=TRUE, what results would you >>>> expect to >>>>>>>>>>>>> see? Maybe you have another, more descriptive example? >>>>>>>>>>>>> >>>>>>>>>>>>> reads <- GRanges("chr1", IRanges(c(1, 5, 20), width=3)) >>>>>>>>>>>>> features <- GRanges("chr1", IRanges(c(1, 20), width=10, >>>>>>>>>>>>> names=c("A", "B"))) >>>>>>>>>>>>> >>>>>>>>>>>>>> countOverlaps(features, reads) >>>>>>>>>>>>> [1] 2 1 >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> Another question relates to the memory usage of >>>> summarizeOverlaps. Has >>>>>>>>>>>>>> this been optimized yet? On a typical bam file with ~50-100 >>>> million >>>>>>>>>>>>>> reads the memory usage of summarizeOverlaps is often around >>>> 10-20GB. To >>>>>>>>>>>>>> use the function on a desktop computer or in large- scale >>>> RNA-Seq >>>>>>>>>>>>>> projects on a commodity compute cluster, it would be desirable >>>> if every >>>>>>>>>>>>>> counting instance would consume not more than 5GB of RAM. >>>>>>>>>>>>> >>>>>>>>>>>>> Have you tried the BamFileList-method? There is an example at >>>> the bottom >>>>>>>>>>>>> of the ?BamFileList man page using summarizeOverlaps(). As Ryan >>>>>>>>>>>>> mentioned, the key is to set the 'yieldSize' parameter when >>>> creating the >>>>>>>>>>>>> BamFile. This method also makes use of mclapply(). >>>>>>>>>>>>> >>>>>>>>>>>>> Valerie >>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks in advance for your help and suggestions, >>>>>>>>>>>>>> >>>>>>>>>>>>>> Thomas >>>>>>>>>>>>>> >>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>> Bioconductor mailing list >>>>>>>>>>>>>> Bioconductor at r-project.org >>>>>>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>>>>>>>>>> Search the archives: >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>>>>>>>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> Bioconductor mailing list >>>>>> Bioconductor at r-project.org >>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>> Search the archives: >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>>> >>>>> >>>>> >>>>> -- >>>>> Computational Biology / Fred Hutchinson Cancer Research Center >>>>> 1100 Fairview Ave. N. >>>>> PO Box 19024 Seattle, WA 98109 >>>>> >>>>> Location: Arnold Building M1 B861 >>>>> Phone: (206) 667-2793 >>>> >>>> _______________________________________________ >>>> Bioconductor mailing list >>>> Bioconductor at r-project.org >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> Search the archives: >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>> >>> >>> [[alternative HTML version deleted]] >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> ______________________________________________________________________ >> The information in this email is confidential and inte...{{dropped:10}} > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793
ADD REPLYlink written 4.6 years ago by Martin Morgan ♦♦ 20k
Yes, that's right, there are a few annotations on the result. Compatible means that the splices need to agree between the read and the transcript. Unique means that the read *compatibly* maps to only a single transcript. So unique is only TRUE when compatible is TRUE. Michael On Fri, Apr 12, 2013 at 7:08 AM, Martin Morgan <mtmorgan@fhcrc.org> wrote: > On 04/12/2013 06:53 AM, Michael Lawrence wrote: > >> Hi Wei, >> >> summarizeSpliceOverlaps does not yet exist. We held off on that until we >> determined whether findSpliceOverlaps was useful. And yes, >> findSpliceOverlaps counts reads on a per-feature basis, so it will double >> count reads in that way. That's totally intentional (and I'm pretty sure >> what Thomas was wanting). >> > > also the object returned can be easily manipulated? > > > olaps = findSpliceOverlaps(galp, genes) > > olaps > > Hits of length 2 > queryLength: 1 > subjectLength: 2 > queryHits subjectHits compatible unique coding strandSpecific > <integer> <integer> <logical> <logical> <logical> <logical> > 1 1 1 FALSE FALSE NA TRUE > 2 1 2 FALSE FALSE NA TRUE > > olaps[mcols(olaps)$unique] > Hits of length 0 > queryLength: 1 > subjectLength: 2 > > and canonically > > ulaps = olaps[mcols(olaps)$unique] > tabulate(subjectHits(ulaps), subjectLength(ulaps)) > > see ?Hits > > Martin > >> >> Michael >> >> >> On Thu, Apr 11, 2013 at 7:11 PM, Wei Shi <shi@wehi.edu.au> wrote: >> >> Hi Michael, >>> >>> I could not find the 'summarizeSpliceOverlaps' function in the >>> GenomicFeatures package (1.12.0). >>> >>> The findSpliceOverlaps function seems to count reads more than once as >>> well. I ran the sample code in the help page for this function and found >>> that the single fragment was assigned to both transcripts: >>> >>> genes <- GRangesList( >>>> >>> + GRanges("chr1", IRanges(c(5, 20), c(10, 25)), "+"), >>> + GRanges("chr1", IRanges(c(5, 22), c(15, 25)), "+")) >>> >>>> genes >>>> >>> GRangesList of length 2: >>> [[1]] >>> GRanges with 2 ranges and 0 metadata columns: >>> seqnames ranges strand >>> <rle> <iranges> <rle> >>> [1] chr1 [ 5, 10] + >>> [2] chr1 [20, 25] + >>> >>> [[2]] >>> GRanges with 2 ranges and 0 metadata columns: >>> seqnames ranges strand >>> [1] chr1 [ 5, 15] + >>> [2] chr1 [22, 25] + >>> >>> --- >>> seqlengths: >>> chr1 >>> NA >>> >>>> galp <- GappedAlignmentPairs( >>>> >>> + GappedAlignments("chr1", 5L, "11M4N6M", strand("+")), >>> + GappedAlignments("chr1", 50L, "6M", strand("-")), >>> + isProperPair=TRUE) >>> >>>> galp >>>> >>> GappedAlignmentPairs with 1 alignment pair and 0 metadata columns: >>> seqnames strand : ranges -- ranges >>> <rle> <rle> : <iranges> -- <iranges> >>> [1] chr1 + : [5, 25] -- [50, 55] >>> --- >>> seqlengths: >>> chr1 >>> NA >>> >>>> findSpliceOverlaps(galp, genes) >>>> >>> Hits of length 2 >>> queryLength: 1 >>> subjectLength: 2 >>> queryHits subjectHits compatible unique coding strandSpecific >>> <integer> <integer> <logical> <logical> <logical> <logical> >>> 1 1 1 FALSE FALSE NA TRUE >>> 2 1 2 FALSE FALSE NA TRUE >>> >>> sessionInfo() >>>> >>> R version 3.0.0 (2013-04-03) >>> Platform: x86_64-apple-darwin10.8.0 (64-bit) >>> >>> locale: >>> [1] en_US.UTF-8/en_US.UTF-8/en_US.**UTF-8/C/en_US.UTF-8/en_US.UTF-**8 >>> >>> attached base packages: >>> [1] parallel stats graphics grDevices utils datasets methods >>> base >>> >>> other attached packages: >>> [1] limma_3.16.1 GenomicFeatures_1.12.0 AnnotationDbi_1.22.1 >>> Biobase_2.20.0 GenomicRanges_1.12.1 >>> [6] IRanges_1.18.0 BiocGenerics_0.6.0 Rsubread_1.10.1 >>> BiocInstaller_1.10.0 >>> >>> loaded via a namespace (and not attached): >>> [1] biomaRt_2.16.0 Biostrings_2.28.0 bitops_1.0-5 >>> BSgenome_1.28.0 DBI_0.2-5 RCurl_1.95-4.1 >>> [7] Rsamtools_1.12.0 RSQLite_0.11.2 rtracklayer_1.20.0 >>> stats4_3.0.0 >>> tools_3.0.0 XML_3.95-0.2 >>> [13] zlibbioc_1.6.0 >>> >>>> >>>> >>> >>> Wei >>> >>> On Apr 11, 2013, at 11:36 PM, Michael Lawrence wrote: >>> >>> On Wed, Apr 10, 2013 at 7:09 PM, Thomas Girke <thomas.girke@ucr.edu> >>>> >>> wrote: >>> >>>> >>>> Hi Martin, >>>>> >>>>> Yes, inter_feature=TRUE would maintain the current counting mode(s) >>>>> that >>>>> prohibits counting of reads mapping to multiple features. This is a >>>>> special case of counting that is very useful for counting exonic >>>>> regions >>>>> of genes. However, one also wants to be able to turn off this behavior >>>>> by ignoring inter feature overlaps (just like ignore.strand=T/F). >>>>> Otherwise we cannot use summarizeOverlaps along with its current modes >>>>> for important operations like transcript level counting because many >>>>> transcript variants from the same gene will mask each others reads when >>>>> inter_feature=TRUE. Providing the option to output the results from >>>>> both >>>>> inter_feature=TRUE and inter_feature=FALSE could be a very sensible >>>>> solution and time saver for users working with new genomes/GFFs, where >>>>> one cannot trust every nested annotation for various reasons, and >>>>> inter_feature=TRUE can quickly become a very risky counting strategy >>>>> since it tends to erase counts. Every biologist will scream in your >>>>> face if the counter tells them that their favorite gene has zero counts >>>>> just because of some overlap with some annotation error:). >>>>> >>>>> For multiple range features stored in GRangesList objects, I would >>>>> currently favor making "inter_feature=FALSE" ignore the overlaps >>>>> occurring among different list components, but not necessarily those >>>>> within a list component (e.g. exon ranges of a gene). This way one >>>>> can benefit from the current infrastructure by restricting its feature >>>>> overlap scope to the range sets stored within individual list >>>>> components >>>>> but ignoring those among different list components. Utilities like >>>>> >>>> reduce >>> >>>> and other range modifier functions could handle situations where one >>>>> >>>> wants >>> >>>> to ignore all feature overlaps within and among list components. >>>>> >>>> However, >>> >>>> I am sure there could be other solutions to this. >>>>> >>>>> Long story short if inter_feature=TRUE/FALSE could be used in >>>>> combination with modes=Union/**IntersectionStrict/** >>>>> IntersectionNonEmpty >>>>> resulting in six different counting flavors, I would be happy and, I am >>>>> sure, many other Bioc users as well. >>>>> >>>>> >>>>> Have you taken a look at findSpliceOverlaps? Maybe >>>> >>> summarizeSpliceOverlaps >>> >>>> could be completed to satisfy your use case? Somehow we need to control >>>> >>> the >>> >>>> proliferation of counting functions and modes. Having some idea of your >>>> high-level use case might help. >>>> >>>> Also, for spliced alignments, I recommend giving GSNAP a try if you >>>> >>> haven't >>> >>>> already. It's accessible though the gmapR package. >>>> >>>> Michael >>>> >>>> I hope this makes sense. >>>> >>>>> >>>>> Thomas >>>>> >>>>> >>>>> >>>>> On Wed, Apr 10, 2013 at 11:37:04PM +0000, Martin Morgan wrote: >>>>> >>>>>> On 04/10/2013 12:42 PM, Thomas Girke wrote: >>>>>> >>>>>>> Thanks. Adding an inter-feature unaware mode will be very helpful and >>>>>>> >>>>>> also >>>>> >>>>>> broaden summarizeOverlaps' application spectrum for a lot of use >>>>>>> >>>>>> cases. >>> >>>> >>>>>> I'm probably being quite dense here, and am mostly an outside >>>>>> observer. >>>>>> >>>>> What I >>>>> >>>>>> hear you saying is that there are currently three modes -- Union, >>>>>> IntersectionStrict, IntersectionNonEmpty. These modes are summarized >>>>>> in >>>>>> >>>>> the >>>>> >>>>>> seven rows of figure 1 of >>>>>> >>>>>> vignette("summarizeOverlaps", package="GenomicRanges") >>>>>> >>>>>> Let's say there is a flag inter_feature, and when its value is TRUE >>>>>> >>>>> then >>> >>>> the >>>>> >>>>>> current counting schemes are obtained. These modes differ in the way >>>>>> >>>>> counting >>>>> >>>>>> works for reads illustrated in rows 5, 6, and 7 of the figure. You'd >>>>>> >>>>> like a >>>>> >>>>>> count scored where a '1' appears in the table below. With >>>>>> >>>>> inter_feature=TRUE >>>>> >>>>>> reads are 'never counted more than once' ('Hits per read' is <= 1) >>>>>> >>>>>> |----------------------+-----+**-----------+-----------+------** >>>>>> ---------| >>>>>> | Mode | Row | Feature 1 | Feature 2 | Hits per read | >>>>>> |----------------------+-----+**-----------+-----------+------** >>>>>> ---------| >>>>>> | Union | 5 | 1 | 0 | 1 | >>>>>> | | 6 | 0 | 0 | 0 | >>>>>> | | 7 | 0 | 0 | 0 | >>>>>> | IntersectionStrict | 5 | 1 | 0 | 1 | >>>>>> | | 6 | 1 | 0 | 1 | >>>>>> | | 7 | 0 | 0 | 0 | >>>>>> | IntersectionNotEmpty | 5 | 1 | 0 | 1 | >>>>>> | | 6 | 1 | 0 | 1 | >>>>>> | | 7 | 0 | 0 | 0 | >>>>>> |----------------------+-----+**-----------+-----------+------** >>>>>> ---------| >>>>>> >>>>>> >>>>>> You'd like to add 3 more modes, with inter_feature=FALSE. Reads are >>>>>> >>>>> sometimes >>>>> >>>>>> 'counted twice' >>>>>> >>>>>> |----------------------+-----+**-----------+-----------+------** >>>>>> ---------| >>>>>> | Mode | Row | Feature 1 | Feature 2 | Hits per read | >>>>>> |----------------------+-----+**-----------+-----------+------** >>>>>> ---------| >>>>>> | Union | 5 | 1 | 0 | 1 | >>>>>> | | 6 | 1 | 1 | 2 | >>>>>> | | 7 | 1 | 1 | 2 | >>>>>> | IntersectionStrict | 5 | 1 | 0 | 1 | >>>>>> | | 6 | 1 | 0 | 1 | >>>>>> | | 7 | 1 | 1 | 2 | >>>>>> | IntersectionNotEmpty | 5 | 1 | 0 | 1 | >>>>>> | | 6 | 1 | 1 | 2 | >>>>>> | | 7 | 1 | 1 | 2 | >>>>>> |----------------------+-----+**-----------+-----------+------** >>>>>> ---------| >>>>>> >>>>>> >>>>>> Martin >>>>>> >>>>>> >>>>>>> Thomas >>>>>>> >>>>>>> >>>>>>> On Wed, Apr 10, 2013 at 04:59:49PM +0000, Valerie Obenchain wrote: >>>>>>> >>>>>>>> Hi Thomas, >>>>>>>> >>>>>>>> On 04/10/2013 09:23 AM, Thomas Girke wrote: >>>>>>>> >>>>>>>>> Valerie, >>>>>>>>> >>>>>>>>> Please see my inserted comments below. >>>>>>>>> >>>>>>>>> On Wed, Apr 10, 2013 at 03:50:54PM +0000, Valerie Obenchain wrote: >>>>>>>>> >>>>>>>>>> Ah, I see. You'd like to count with one of the existing modes but >>>>>>>>>> >>>>>>>>> have >>>>> >>>>>> the option to pick up counts for these inter-feature reads (fall >>>>>>>>>> completely 'within' >1 feature). These inter-feature reads would >>>>>>>>>> be >>>>>>>>>> double (triple, quadruple, etc.) counted. Essentially they would >>>>>>>>>> >>>>>>>>> add one >>>>> >>>>>> count to each feature they hit. Right? >>>>>>>>>> >>>>>>>>> >>>>>>>>> Correct. Perhaps let's discuss this with a very common example of >>>>>>>>> transcript-level counting rather than counting on the unified >>>>>>>>> >>>>>>>> (virtual) >>>>> >>>>>> exonic regions of genes. With the current description provided in >>>>>>>>> >>>>>>>> the >>> >>>> summarizeOverlaps vignette at >>>>>>>>> >>>>>>>>> >>>>> http://www.bioconductor.org/**packages/2.12/bioc/vignettes/** >>> GenomicRanges/inst/doc/**summarizeOverlaps.pdf<http: www.biocondu="" ctor.org="" packages="" 2.12="" bioc="" vignettes="" genomicranges="" inst="" doc="" summarize="" overlaps.pdf=""> >>> >>>> I don't see how this can be achieved without falling back to using >>>>>>>>> countOverlaps without loosing the new counting modes provided by >>>>>>>>> summarizeOverlaps? >>>>>>>>> >>>>>>>> >>>>>>>> It can't be achieved with the function as is but we could add an >>>>>>>> argument to handle this (as you suggested from the start). If >>>>>>>> 'inter-feature=TRUE' then these counts would be added to the counts >>>>>>>> already obtained from using a particular 'mode'. I will move ahead >>>>>>>> >>>>>>> with >>>>> >>>>>> implementing this argument. >>>>>>>> >>>>>>>> >>>>>>>>> >>>>>>>>>> One more thought about memory usage. If you are working with >>>>>>>>>> >>>>>>>>> single-end >>>>> >>>>>> reads the summarizeOverlaps,BamFileList-**method I mentioned below >>>>>>>>>> >>>>>>>>> should >>>>> >>>>>> work fine. The approach is slightly different with paired-end >>>>>>>>>> >>>>>>>>> reads. Our >>>>> >>>>>> current algorithm for pairing paired-end reads requires the whole >>>>>>>>>> >>>>>>>>> file >>>>> >>>>>> to be read into memory. A different approach is currently being >>>>>>>>>> developed but in the meantime you can take the qname-sorted >>>>>>>>>> >>>>>>>>> approach. >>>>> >>>>>> The Bam file will need to be sorted by qname and both the >>>>>>>>>> >>>>>>>>> 'yieldSize' >>>>> >>>>>> and 'obeyQname=TRUE' set when creating the BamFile/BamFileList. An >>>>>>>>>> example is on ?BamFile. >>>>>>>>>> >>>>>>>>> >>>>>>>>> Thanks for pointing this out. My fault that I didn't read through >>>>>>>>> >>>>>>>> all >>> >>>> the linked documentation. Perhaps it may not be a bad idea to make >>>>>>>>> >>>>>>>> the >>>>> >>>>>> memory restricted bam read instances the default setting in the >>>>>>>>> >>>>>>>> future. >>>>> >>>>>> This will definitely help biologists using those utilities without >>>>>>>>> crashing their machines on the first attempt. >>>>>>>>> >>>>>>>> >>>>>>>> Good suggestion, thanks. >>>>>>>> >>>>>>>> Valerie >>>>>>>> >>>>>>>> >>>>>>>>> Thomas >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>>> Valerie >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On 04/09/2013 08:01 PM, Thomas Girke wrote: >>>>>>>>>> >>>>>>>>>>> Thanks for the tip. I guess doing it this way reverses the >>>>>>>>>>> >>>>>>>>>> counting mode >>>>> >>>>>> back to countOverlaps, but how can I use at the same time >>>>>>>>>>> "IntersectionStrict" or any of the other modes provided by >>>>>>>>>>> summarizeOverlaps if its mode argument is already used and >>>>>>>>>>> >>>>>>>>>> countOverlaps >>>>> >>>>>> doesn't accept one? >>>>>>>>>>> >>>>>>>>>>> Thomas >>>>>>>>>>> >>>>>>>>>>> On Tue, Apr 09, 2013 at 09:08:02PM +0000, Valerie Obenchain >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> Thanks for the example. You're right, none of the modes will >>>>>>>>>>>> >>>>>>>>>>> count a >>>>> >>>>>> read falling completely within >1 feature. >>>>>>>>>>>> >>>>>>>>>>>> You can supply your own function as a 'mode'. Two requirements >>>>>>>>>>>> >>>>>>>>>>> must be met: >>>>> >>>>>> >>>>>>>>>>>> (1) The function must have 3 arguments that correspond to >>>>>>>>>>>> 'features', 'reads' and 'ignore.strand', in that order. >>>>>>>>>>>> >>>>>>>>>>>> (2) The function must return a vector of counts the same length >>>>>>>>>>>> as 'features' >>>>>>>>>>>> >>>>>>>>>>>> Here is an example using countOverlaps() which I think gets at >>>>>>>>>>>> >>>>>>>>>>> the >>> >>>> counting you want. >>>>>>>>>>>> >>>>>>>>>>>> counter <- function(x, y, ignore.strand) >>>>>>>>>>>> countOverlaps(y, x, ignore.strand=ignore.strand) >>>>>>>>>>>> >>>>>>>>>>>> assays(summarizeOverlaps(gr, rd, mode=counter))$counts >>>>>>>>>>>>> >>>>>>>>>>>> [,1] >>>>>>>>>>>> [1,] 1 >>>>>>>>>>>> [2,] 1 >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Valerie >>>>>>>>>>>> >>>>>>>>>>>> On 04/09/2013 09:37 AM, Thomas Girke wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Hi Valerie, >>>>>>>>>>>>> >>>>>>>>>>>>> Perhaps let's call this more explicitly an >>>>>>>>>>>>> >>>>>>>>>>>> ignore_inter_feature_overlap option >>>>> >>>>>> that we often need for cases like this: >>>>>>>>>>>>> >>>>>>>>>>>>> Read1 ---- >>>>>>>>>>>>> F1 ---------------- >>>>>>>>>>>>> F2 ----------- >>>>>>>>>>>>> >>>>>>>>>>>>> where we would like to have at least the option to assign Read1 >>>>>>>>>>>>> >>>>>>>>>>>> to both F1 and F2: >>>>> >>>>>> >>>>>>>>>>>>> F1: 1 >>>>>>>>>>>>> F2: 1 >>>>>>>>>>>>> >>>>>>>>>>>>> However, summarizeOverlapse doesn't count Read1 at all in all >>>>>>>>>>>>> of >>>>>>>>>>>>> >>>>>>>>>>>> its currently >>>>> >>>>>> available modes that I am aware of. This lack of an >>>>>>>>>>>>> >>>>>>>>>>>> ignore_inter_feature_overlap >>>>> >>>>>> option is frequently a problem when working with poorly >>>>>>>>>>>>> >>>>>>>>>>>> annotated genomes (high >>>>> >>>>>> uncertainty about hidden/incorrect feature overlaps) or various >>>>>>>>>>>>> RNA/ChIP-Seq _engineering_ projects where I rather take the >>>>>>>>>>>>> risk >>>>>>>>>>>>> >>>>>>>>>>>> of ambiguous read >>>>> >>>>>> assignments than not counting at all as shown above. >>>>>>>>>>>>> >>>>>>>>>>>>> ## Here is a code example illustrating the same case: >>>>>>>>>>>>> library(GenomicRanges); library(Rsamtools) >>>>>>>>>>>>> rd <- GappedAlignments(letters[1], seqnames = >>>>>>>>>>>>> >>>>>>>>>>>> Rle(rep("chr1",1)), >>> >>>> pos = as.integer(c(500)), >>>>>>>>>>>>> cigar = rep("100M", 1), strand = strand(rep("+", 1))) >>>>>>>>>>>>> gr1 <- GRanges("chr1", IRanges(start=100, width=1001), >>>>>>>>>>>>> >>>>>>>>>>>> strand="+", ID="feat1") >>>>> >>>>>> gr2 <- GRanges("chr1", IRanges(start=500, width=101), >>>>>>>>>>>>> >>>>>>>>>>>> strand="+", ID="feat2") >>>>> >>>>>> gr <- c(gr1, gr2) >>>>>>>>>>>>> >>>>>>>>>>>>> ## All of the three current modes in summarizeOverlaps return a >>>>>>>>>>>>> >>>>>>>>>>>> count of zero >>>>> >>>>>> ## because of its inter_feature_overlap awareness: >>>>>>>>>>>>> assays(summarizeOverlaps(gr, rd, mode="Union", >>>>>>>>>>>>> >>>>>>>>>>>> ignore.strand=TRUE))$counts >>>>> >>>>>> assays(summarizeOverlaps(gr, rd, mode="IntersectionStrict", >>>>>>>>>>>>> >>>>>>>>>>>> ignore.strand=TRUE))$counts >>>>> >>>>>> assays(summarizeOverlaps(gr, rd, mode="IntersectionNotEmpty", >>>>>>>>>>>>> >>>>>>>>>>>> ignore.strand=TRUE))$counts >>>>> >>>>>> [,1] >>>>>>>>>>>>> [1,] 0 >>>>>>>>>>>>> [2,] 0 >>>>>>>>>>>>> >>>>>>>>>>>>> ## However, countOverlaps handles this correctly, but is not >>>>>>>>>>>>> the >>>>>>>>>>>>> >>>>>>>>>>>> best choice when >>>>> >>>>>> ## counting multiple range features. >>>>>>>>>>>>> countOverlaps(gr, rd) >>>>>>>>>>>>> [1] 1 1 >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> Thomas >>>>>>>>>>>>> >>>>>>>>>>>>> sessionInfo() >>>>>>>>>>>>>> >>>>>>>>>>>>> R version 3.0.0 (2013-04-03) >>>>>>>>>>>>> Platform: x86_64-apple-darwin10.8.0 (64-bit) >>>>>>>>>>>>> >>>>>>>>>>>>> locale: >>>>>>>>>>>>> [1] >>>>>>>>>>>>> >>>>>>>>>>>> en_US.UTF-8/en_US.UTF-8/en_US.**UTF-8/C/en_US.UTF-8/en_US.UTF-* >>> *8 >>> >>>> >>>>>>>>>>>>> attached base packages: >>>>>>>>>>>>> [1] parallel stats graphics grDevices utils datasets >>>>>>>>>>>>> >>>>>>>>>>>> methods >>>>> >>>>>> [8] base >>>>>>>>>>>>> >>>>>>>>>>>>> other attached packages: >>>>>>>>>>>>> [1] Rsamtools_1.12.0 Biostrings_2.28.0 >>>>>>>>>>>>> >>>>>>>>>>>> GenomicRanges_1.12.1 >>>>> >>>>>> [4] IRanges_1.18.0 BiocGenerics_0.6.0 >>>>>>>>>>>>> >>>>>>>>>>>>> loaded via a namespace (and not attached): >>>>>>>>>>>>> [1] bitops_1.0-5 stats4_3.0.0 tools_3.0.0 zlibbioc_1.6.0 >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On Tue, Apr 09, 2013 at 03:42:03PM +0000, Valerie Obenchain >>>>>>>>>>>>> >>>>>>>>>>>> wrote: >>>>> >>>>>> Hi Thomas, >>>>>>>>>>>>>> >>>>>>>>>>>>>> On 04/08/2013 05:52 PM, Thomas Girke wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Dear Valerie, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Is there currently any way to run summarizeOverlaps in a >>>>>>>>>>>>>>> >>>>>>>>>>>>>> feature-overlap >>>>> >>>>>> unaware mode, e.g with an ignorefeatureOL=FALSE/TRUE setting? >>>>>>>>>>>>>>> >>>>>>>>>>>>>> Currently, >>>>> >>>>>> one can switch back to countOverlaps when feature overlap >>>>>>>>>>>>>>> >>>>>>>>>>>>>> unawareness is >>>>> >>>>>> the more appropriate counting mode for a biological question, >>>>>>>>>>>>>>> >>>>>>>>>>>>>> but then >>>>> >>>>>> double counting of reads mapping to multiple-range features is >>>>>>>>>>>>>>> >>>>>>>>>>>>>> not >>>>> >>>>>> accounted for. It would be really nice to have such a >>>>>>>>>>>>>>> >>>>>>>>>>>>>> feature-overlap >>>>> >>>>>> unaware option directly in summarizeOverlaps. >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> No, we don't currently have an option to ignore >>>>>>>>>>>>>> >>>>>>>>>>>>> feature-overlap. It >>>>> >>>>>> sounds like you want to count with countOverlaps() but still >>>>>>>>>>>>>> >>>>>>>>>>>>> want the >>>>> >>>>>> double counting to be resolved. This is essentially what the >>>>>>>>>>>>>> >>>>>>>>>>>>> other modes >>>>> >>>>>> are doing so I must be missing something. >>>>>>>>>>>>>> >>>>>>>>>>>>>> In this example 2 reads hit feature A, 1 read hits feature B. >>>>>>>>>>>>>> >>>>>>>>>>>>> With >>>>> >>>>>> something like ignorefeature0L=TRUE, what results would you >>>>>>>>>>>>>> >>>>>>>>>>>>> expect to >>>>> >>>>>> see? Maybe you have another, more descriptive example? >>>>>>>>>>>>>> >>>>>>>>>>>>>> reads <- GRanges("chr1", IRanges(c(1, 5, 20), width=3)) >>>>>>>>>>>>>> features <- GRanges("chr1", IRanges(c(1, 20), width=10, >>>>>>>>>>>>>> names=c("A", "B"))) >>>>>>>>>>>>>> >>>>>>>>>>>>>> countOverlaps(features, reads) >>>>>>>>>>>>>>> >>>>>>>>>>>>>> [1] 2 1 >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Another question relates to the memory usage of >>>>>>>>>>>>>>> >>>>>>>>>>>>>> summarizeOverlaps. Has >>>>> >>>>>> this been optimized yet? On a typical bam file with ~50-100 >>>>>>>>>>>>>>> >>>>>>>>>>>>>> million >>>>> >>>>>> reads the memory usage of summarizeOverlaps is often around >>>>>>>>>>>>>>> >>>>>>>>>>>>>> 10-20GB. To >>>>> >>>>>> use the function on a desktop computer or in large-scale >>>>>>>>>>>>>>> >>>>>>>>>>>>>> RNA-Seq >>>>> >>>>>> projects on a commodity compute cluster, it would be desirable >>>>>>>>>>>>>>> >>>>>>>>>>>>>> if every >>>>> >>>>>> counting instance would consume not more than 5GB of RAM. >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> Have you tried the BamFileList-method? There is an example at >>>>>>>>>>>>>> >>>>>>>>>>>>> the bottom >>>>> >>>>>> of the ?BamFileList man page using summarizeOverlaps(). As Ryan >>>>>>>>>>>>>> mentioned, the key is to set the 'yieldSize' parameter when >>>>>>>>>>>>>> >>>>>>>>>>>>> creating the >>>>> >>>>>> BamFile. This method also makes use of mclapply(). >>>>>>>>>>>>>> >>>>>>>>>>>>>> Valerie >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Thanks in advance for your help and suggestions, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Thomas >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> ______________________________**_________________ >>>>>>>>>>>>>>> Bioconductor mailing list >>>>>>>>>>>>>>> Bioconductor@r-project.org >>>>>>>>>>>>>>> https://stat.ethz.ch/mailman/**listinfo/bioconductor<h ttps:="" stat.ethz.ch="" mailman="" listinfo="" bioconductor=""> >>>>>>>>>>>>>>> Search the archives: >>>>>>>>>>>>>>> >>>>>>>>>>>>>> http://news.gmane.org/gmane.**science.biology.informatics.** >>>>> conductor<http: news.gmane.org="" gmane.science.biology.informatic="" s.conductor=""> >>>>> >>>>>> >>>>>>>>>>>>>>> >>>>>>> ______________________________**_________________ >>>>>>> Bioconductor mailing list >>>>>>> Bioconductor@r-project.org >>>>>>> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https: s="" tat.ethz.ch="" mailman="" listinfo="" bioconductor=""> >>>>>>> Search the archives: >>>>>>> >>>>>> http://news.gmane.org/gmane.**science.biology.informatics.**con ductor<http: news.gmane.org="" gmane.science.biology.informatics.conduct="" or=""> >>>>> >>>>>> >>>>>>> >>>>>> >>>>>> -- >>>>>> Computational Biology / Fred Hutchinson Cancer Research Center >>>>>> 1100 Fairview Ave. N. >>>>>> PO Box 19024 Seattle, WA 98109 >>>>>> >>>>>> Location: Arnold Building M1 B861 >>>>>> Phone: (206) 667-2793 >>>>>> >>>>> >>>>> ______________________________**_________________ >>>>> Bioconductor mailing list >>>>> Bioconductor@r-project.org >>>>> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https: sta="" t.ethz.ch="" mailman="" listinfo="" bioconductor=""> >>>>> Search the archives: >>>>> http://news.gmane.org/gmane.**science.biology.informatics.**cond uctor<http: news.gmane.org="" gmane.science.biology.informatics.conducto="" r=""> >>>>> >>>>> >>>> [[alternative HTML version deleted]] >>>> >>>> ______________________________**_________________ >>>> Bioconductor mailing list >>>> Bioconductor@r-project.org >>>> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https: stat="" .ethz.ch="" mailman="" listinfo="" bioconductor=""> >>>> Search the archives: >>>> >>> http://news.gmane.org/gmane.**science.biology.informatics.**conduc tor<http: news.gmane.org="" gmane.science.biology.informatics.conductor=""> >>> >>> ______________________________**______________________________** >>> __________ >>> The information in this email is confidential and inte...{{dropped:10}} >>> >> >> ______________________________**_________________ >> Bioconductor mailing list >> Bioconductor@r-project.org >> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https: stat.e="" thz.ch="" mailman="" listinfo="" bioconductor=""> >> Search the archives: http://news.gmane.org/gmane.** >> science.biology.informatics.**conductor<http: news.gmane.org="" gmane="" .science.biology.informatics.conductor=""> >> >> > > -- > Computational Biology / Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N. > PO Box 19024 Seattle, WA 98109 > > Location: Arnold Building M1 B861 > Phone: (206) 667-2793 > [[alternative HTML version deleted]]
ADD REPLYlink written 4.6 years ago by Michael Lawrence9.8k
On Apr 13, 2013, at 12:08 AM, Martin Morgan wrote: > On 04/12/2013 06:53 AM, Michael Lawrence wrote: >> Hi Wei, >> >> summarizeSpliceOverlaps does not yet exist. We held off on that until we >> determined whether findSpliceOverlaps was useful. And yes, >> findSpliceOverlaps counts reads on a per-feature basis, so it will double >> count reads in that way. That's totally intentional (and I'm pretty sure >> what Thomas was wanting). > > also the object returned can be easily manipulated? > > > olaps = findSpliceOverlaps(galp, genes) > > olaps > Hits of length 2 > queryLength: 1 > subjectLength: 2 > queryHits subjectHits compatible unique coding strandSpecific > <integer> <integer> <logical> <logical> <logical> <logical> > 1 1 1 FALSE FALSE NA TRUE > 2 1 2 FALSE FALSE NA TRUE > > olaps[mcols(olaps)$unique] > Hits of length 0 > queryLength: 1 > subjectLength: 2 > > and canonically > > ulaps = olaps[mcols(olaps)$unique] > tabulate(subjectHits(ulaps), subjectLength(ulaps)) > > see ?Hits > > Martin But I got no fragments assigned to transcripts. Is this what you meant to do? > ulaps = olaps[mcols(olaps)$unique] > tabulate(subjectHits(ulaps), subjectLength(ulaps)) [1] 0 0 Cheers Wei ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}}
ADD REPLYlink written 4.6 years ago by Wei Shi2.7k
On 04/12/2013 03:57 PM, Wei Shi wrote: > > On Apr 13, 2013, at 12:08 AM, Martin Morgan wrote: > >> On 04/12/2013 06:53 AM, Michael Lawrence wrote: >>> Hi Wei, >>> >>> summarizeSpliceOverlaps does not yet exist. We held off on that until we >>> determined whether findSpliceOverlaps was useful. And yes, >>> findSpliceOverlaps counts reads on a per-feature basis, so it will >>> double count reads in that way. That's totally intentional (and I'm >>> pretty sure what Thomas was wanting). >> >> also the object returned can be easily manipulated? >> >>> olaps = findSpliceOverlaps(galp, genes) olaps >> Hits of length 2 queryLength: 1 subjectLength: 2 queryHits subjectHits >> compatible unique coding strandSpecific <integer> <integer> >> <logical> <logical> <logical> <logical> 1 1 1 >> FALSE FALSE NA TRUE 2 1 2 FALSE >> FALSE NA TRUE >>> olaps[mcols(olaps)$unique] >> Hits of length 0 queryLength: 1 subjectLength: 2 >> >> and canonically >> >> ulaps = olaps[mcols(olaps)$unique] tabulate(subjectHits(ulaps), >> subjectLength(ulaps)) >> >> see ?Hits >> >> Martin > > But I got no fragments assigned to transcripts. Is this what you meant to > do? > >> ulaps = olaps[mcols(olaps)$unique] tabulate(subjectHits(ulaps), >> subjectLength(ulaps)) > [1] 0 0 I meant only to illustrate that the object is easily manipulated to implement different counting schemes. You said > The findSpliceOverlaps function seems to count reads more than once as well. > I ran the sample code in the help page for this function and found that the > single fragment was assigned to both transcripts: and indeed the function returns all 'hits' in the sense of non-zero overlap between read and gene. You're interested in counting 'no more than once', and the transcript does not align uniquely (or in a way that is compatible with each gene model), so yes, it seems reasonable to count 0 hits for each feature. On the other hand galp <- GAlignmentPairs( GAlignments("chr1", 5L, "6M", strand("+")), GAlignments("chr1", 20L, "6M", strand("-")), isProperPair=TRUE) and > (olaps <- findSpliceOverlaps(galp, genes)) Hits of length 2 queryLength: 1 subjectLength: 2 queryHits subjectHits compatible unique coding strandSpecific <integer> <integer> <logical> <logical> <logical> <logical> 1 1 1 TRUE TRUE NA TRUE 2 1 2 FALSE FALSE NA TRUE > (ulaps <- olaps[mcols(olaps)$unique]) Hits of length 1 queryLength: 1 subjectLength: 2 queryHits subjectHits compatible unique coding strandSpecific <integer> <integer> <logical> <logical> <logical> <logical> 1 1 1 TRUE TRUE NA TRUE > tabulate(subjectHits(ulaps), subjectLength(ulaps)) [1] 1 0 Martin > > Cheers Wei > > > ______________________________________________________________________ The > information in this email is confidential and intended solely for the > addressee. You must not disclose, forward, print or use it without the > permission of the sender. > ______________________________________________________________________ > -- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793
ADD REPLYlink written 4.6 years ago by Martin Morgan ♦♦ 20k
Hi Michael, I think each read should be counted at most once when they are being assigned to genes or transcripts because a read could not originate from two or more genes/transcripts. This is what featureCounts and htseq-count do. Cheers, Wei On Apr 12, 2013, at 11:53 PM, Michael Lawrence wrote: > Hi Wei, > > summarizeSpliceOverlaps does not yet exist. We held off on that until we determined whether findSpliceOverlaps was useful. And yes, findSpliceOverlaps counts reads on a per-feature basis, so it will double count reads in that way. That's totally intentional (and I'm pretty sure what Thomas was wanting). > > Michael > > > On Thu, Apr 11, 2013 at 7:11 PM, Wei Shi <shi@wehi.edu.au> wrote: > Hi Michael, > > I could not find the 'summarizeSpliceOverlaps' function in the GenomicFeatures package (1.12.0). > > The findSpliceOverlaps function seems to count reads more than once as well. I ran the sample code in the help page for this function and found that the single fragment was assigned to both transcripts: > > > genes <- GRangesList( > + GRanges("chr1", IRanges(c(5, 20), c(10, 25)), "+"), > + GRanges("chr1", IRanges(c(5, 22), c(15, 25)), "+")) > > genes > GRangesList of length 2: > [[1]] > GRanges with 2 ranges and 0 metadata columns: > seqnames ranges strand > <rle> <iranges> <rle> > [1] chr1 [ 5, 10] + > [2] chr1 [20, 25] + > > [[2]] > GRanges with 2 ranges and 0 metadata columns: > seqnames ranges strand > [1] chr1 [ 5, 15] + > [2] chr1 [22, 25] + > > --- > seqlengths: > chr1 > NA > > galp <- GappedAlignmentPairs( > + GappedAlignments("chr1", 5L, "11M4N6M", strand("+")), > + GappedAlignments("chr1", 50L, "6M", strand("-")), > + isProperPair=TRUE) > > galp > GappedAlignmentPairs with 1 alignment pair and 0 metadata columns: > seqnames strand : ranges -- ranges > <rle> <rle> : <iranges> -- <iranges> > [1] chr1 + : [5, 25] -- [50, 55] > --- > seqlengths: > chr1 > NA > > findSpliceOverlaps(galp, genes) > Hits of length 2 > queryLength: 1 > subjectLength: 2 > queryHits subjectHits compatible unique coding strandSpecific > <integer> <integer> <logical> <logical> <logical> <logical> > 1 1 1 FALSE FALSE NA TRUE > 2 1 2 FALSE FALSE NA TRUE > > > sessionInfo() > R version 3.0.0 (2013-04-03) > Platform: x86_64-apple-darwin10.8.0 (64-bit) > > locale: > [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 > > attached base packages: > [1] parallel stats graphics grDevices utils datasets methods base > > other attached packages: > [1] limma_3.16.1 GenomicFeatures_1.12.0 AnnotationDbi_1.22.1 Biobase_2.20.0 GenomicRanges_1.12.1 > [6] IRanges_1.18.0 BiocGenerics_0.6.0 Rsubread_1.10.1 BiocInstaller_1.10.0 > > loaded via a namespace (and not attached): > [1] biomaRt_2.16.0 Biostrings_2.28.0 bitops_1.0-5 BSgenome_1.28.0 DBI_0.2-5 RCurl_1.95-4.1 > [7] Rsamtools_1.12.0 RSQLite_0.11.2 rtracklayer_1.20.0 stats4_3.0.0 tools_3.0.0 XML_3.95-0.2 > [13] zlibbioc_1.6.0 > > > > > Wei > > On Apr 11, 2013, at 11:36 PM, Michael Lawrence wrote: > > > On Wed, Apr 10, 2013 at 7:09 PM, Thomas Girke <thomas.girke@ucr.edu> wrote: > > > >> Hi Martin, > >> > >> Yes, inter_feature=TRUE would maintain the current counting mode(s) that > >> prohibits counting of reads mapping to multiple features. This is a > >> special case of counting that is very useful for counting exonic regions > >> of genes. However, one also wants to be able to turn off this behavior > >> by ignoring inter feature overlaps (just like ignore.strand=T/F). > >> Otherwise we cannot use summarizeOverlaps along with its current modes > >> for important operations like transcript level counting because many > >> transcript variants from the same gene will mask each others reads when > >> inter_feature=TRUE. Providing the option to output the results from both > >> inter_feature=TRUE and inter_feature=FALSE could be a very sensible > >> solution and time saver for users working with new genomes/GFFs, where > >> one cannot trust every nested annotation for various reasons, and > >> inter_feature=TRUE can quickly become a very risky counting strategy > >> since it tends to erase counts. Every biologist will scream in your > >> face if the counter tells them that their favorite gene has zero counts > >> just because of some overlap with some annotation error:). > >> > >> For multiple range features stored in GRangesList objects, I would > >> currently favor making "inter_feature=FALSE" ignore the overlaps > >> occurring among different list components, but not necessarily those > >> within a list component (e.g. exon ranges of a gene). This way one > >> can benefit from the current infrastructure by restricting its feature > >> overlap scope to the range sets stored within individual list components > >> but ignoring those among different list components. Utilities like reduce > >> and other range modifier functions could handle situations where one wants > >> to ignore all feature overlaps within and among list components. However, > >> I am sure there could be other solutions to this. > >> > >> Long story short if inter_feature=TRUE/FALSE could be used in > >> combination with modes=Union/IntersectionStrict/IntersectionNonEmpty > >> resulting in six different counting flavors, I would be happy and, I am > >> sure, many other Bioc users as well. > >> > >> > > Have you taken a look at findSpliceOverlaps? Maybe summarizeSpliceOverlaps > > could be completed to satisfy your use case? Somehow we need to control the > > proliferation of counting functions and modes. Having some idea of your > > high-level use case might help. > > > > Also, for spliced alignments, I recommend giving GSNAP a try if you haven't > > already. It's accessible though the gmapR package. > > > > Michael > > > > I hope this makes sense. > >> > >> Thomas > >> > >> > >> > >> On Wed, Apr 10, 2013 at 11:37:04PM +0000, Martin Morgan wrote: > >>> On 04/10/2013 12:42 PM, Thomas Girke wrote: > >>>> Thanks. Adding an inter-feature unaware mode will be very helpful and > >> also > >>>> broaden summarizeOverlaps' application spectrum for a lot of use cases. > >>> > >>> I'm probably being quite dense here, and am mostly an outside observer. > >> What I > >>> hear you saying is that there are currently three modes -- Union, > >>> IntersectionStrict, IntersectionNonEmpty. These modes are summarized in > >> the > >>> seven rows of figure 1 of > >>> > >>> vignette("summarizeOverlaps", package="GenomicRanges") > >>> > >>> Let's say there is a flag inter_feature, and when its value is TRUE then > >> the > >>> current counting schemes are obtained. These modes differ in the way > >> counting > >>> works for reads illustrated in rows 5, 6, and 7 of the figure. You'd > >> like a > >>> count scored where a '1' appears in the table below. With > >> inter_feature=TRUE > >>> reads are 'never counted more than once' ('Hits per read' is <= 1) > >>> > >>> |----------------------+-----+-----------+-----------+---------------| > >>> | Mode | Row | Feature 1 | Feature 2 | Hits per read | > >>> |----------------------+-----+-----------+-----------+---------------| > >>> | Union | 5 | 1 | 0 | 1 | > >>> | | 6 | 0 | 0 | 0 | > >>> | | 7 | 0 | 0 | 0 | > >>> | IntersectionStrict | 5 | 1 | 0 | 1 | > >>> | | 6 | 1 | 0 | 1 | > >>> | | 7 | 0 | 0 | 0 | > >>> | IntersectionNotEmpty | 5 | 1 | 0 | 1 | > >>> | | 6 | 1 | 0 | 1 | > >>> | | 7 | 0 | 0 | 0 | > >>> |----------------------+-----+-----------+-----------+---------------| > >>> > >>> > >>> You'd like to add 3 more modes, with inter_feature=FALSE. Reads are > >> sometimes > >>> 'counted twice' > >>> > >>> |----------------------+-----+-----------+-----------+---------------| > >>> | Mode | Row | Feature 1 | Feature 2 | Hits per read | > >>> |----------------------+-----+-----------+-----------+---------------| > >>> | Union | 5 | 1 | 0 | 1 | > >>> | | 6 | 1 | 1 | 2 | > >>> | | 7 | 1 | 1 | 2 | > >>> | IntersectionStrict | 5 | 1 | 0 | 1 | > >>> | | 6 | 1 | 0 | 1 | > >>> | | 7 | 1 | 1 | 2 | > >>> | IntersectionNotEmpty | 5 | 1 | 0 | 1 | > >>> | | 6 | 1 | 1 | 2 | > >>> | | 7 | 1 | 1 | 2 | > >>> |----------------------+-----+-----------+-----------+---------------| > >>> > >>> > >>> Martin > >>> > >>>> > >>>> Thomas > >>>> > >>>> > >>>> On Wed, Apr 10, 2013 at 04:59:49PM +0000, Valerie Obenchain wrote: > >>>>> Hi Thomas, > >>>>> > >>>>> On 04/10/2013 09:23 AM, Thomas Girke wrote: > >>>>>> Valerie, > >>>>>> > >>>>>> Please see my inserted comments below. > >>>>>> > >>>>>> On Wed, Apr 10, 2013 at 03:50:54PM +0000, Valerie Obenchain wrote: > >>>>>>> Ah, I see. You'd like to count with one of the existing modes but > >> have > >>>>>>> the option to pick up counts for these inter-feature reads (fall > >>>>>>> completely 'within' >1 feature). These inter-feature reads would be > >>>>>>> double (triple, quadruple, etc.) counted. Essentially they would > >> add one > >>>>>>> count to each feature they hit. Right? > >>>>>> > >>>>>> Correct. Perhaps let's discuss this with a very common example of > >>>>>> transcript-level counting rather than counting on the unified > >> (virtual) > >>>>>> exonic regions of genes. With the current description provided in the > >>>>>> summarizeOverlaps vignette at > >>>>>> > >> http://www.bioconductor.org/packages/2.12/bioc/vignettes/GenomicR anges/inst/doc/summarizeOverlaps.pdf > >>>>>> I don't see how this can be achieved without falling back to using > >>>>>> countOverlaps without loosing the new counting modes provided by > >>>>>> summarizeOverlaps? > >>>>> > >>>>> It can't be achieved with the function as is but we could add an > >>>>> argument to handle this (as you suggested from the start). If > >>>>> 'inter-feature=TRUE' then these counts would be added to the counts > >>>>> already obtained from using a particular 'mode'. I will move ahead > >> with > >>>>> implementing this argument. > >>>>> > >>>>>> > >>>>>>> > >>>>>>> One more thought about memory usage. If you are working with > >> single-end > >>>>>>> reads the summarizeOverlaps,BamFileList-method I mentioned below > >> should > >>>>>>> work fine. The approach is slightly different with paired- end > >> reads. Our > >>>>>>> current algorithm for pairing paired-end reads requires the whole > >> file > >>>>>>> to be read into memory. A different approach is currently being > >>>>>>> developed but in the meantime you can take the qname-sorted > >> approach. > >>>>>>> The Bam file will need to be sorted by qname and both the > >> 'yieldSize' > >>>>>>> and 'obeyQname=TRUE' set when creating the BamFile/BamFileList. An > >>>>>>> example is on ?BamFile. > >>>>>> > >>>>>> Thanks for pointing this out. My fault that I didn't read through all > >>>>>> the linked documentation. Perhaps it may not be a bad idea to make > >> the > >>>>>> memory restricted bam read instances the default setting in the > >> future. > >>>>>> This will definitely help biologists using those utilities without > >>>>>> crashing their machines on the first attempt. > >>>>> > >>>>> Good suggestion, thanks. > >>>>> > >>>>> Valerie > >>>>> > >>>>>> > >>>>>> Thomas > >>>>>> > >>>>>> > >>>>>>> > >>>>>>> Valerie > >>>>>>> > >>>>>>> > >>>>>>> On 04/09/2013 08:01 PM, Thomas Girke wrote: > >>>>>>>> Thanks for the tip. I guess doing it this way reverses the > >> counting mode > >>>>>>>> back to countOverlaps, but how can I use at the same time > >>>>>>>> "IntersectionStrict" or any of the other modes provided by > >>>>>>>> summarizeOverlaps if its mode argument is already used and > >> countOverlaps > >>>>>>>> doesn't accept one? > >>>>>>>> > >>>>>>>> Thomas > >>>>>>>> > >>>>>>>> On Tue, Apr 09, 2013 at 09:08:02PM +0000, Valerie Obenchain wrote: > >>>>>>>>> Thanks for the example. You're right, none of the modes will > >> count a > >>>>>>>>> read falling completely within >1 feature. > >>>>>>>>> > >>>>>>>>> You can supply your own function as a 'mode'. Two requirements > >> must be met: > >>>>>>>>> > >>>>>>>>> (1) The function must have 3 arguments that correspond to > >>>>>>>>> 'features', 'reads' and 'ignore.strand', in that order. > >>>>>>>>> > >>>>>>>>> (2) The function must return a vector of counts the same length > >>>>>>>>> as 'features' > >>>>>>>>> > >>>>>>>>> Here is an example using countOverlaps() which I think gets at the > >>>>>>>>> counting you want. > >>>>>>>>> > >>>>>>>>> counter <- function(x, y, ignore.strand) > >>>>>>>>> countOverlaps(y, x, ignore.strand=ignore.strand) > >>>>>>>>> > >>>>>>>>>> assays(summarizeOverlaps(gr, rd, mode=counter))$counts > >>>>>>>>> [,1] > >>>>>>>>> [1,] 1 > >>>>>>>>> [2,] 1 > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> Valerie > >>>>>>>>> > >>>>>>>>> On 04/09/2013 09:37 AM, Thomas Girke wrote: > >>>>>>>>>> Hi Valerie, > >>>>>>>>>> > >>>>>>>>>> Perhaps let's call this more explicitly an > >> ignore_inter_feature_overlap option > >>>>>>>>>> that we often need for cases like this: > >>>>>>>>>> > >>>>>>>>>> Read1 ---- > >>>>>>>>>> F1 ---------------- > >>>>>>>>>> F2 ----------- > >>>>>>>>>> > >>>>>>>>>> where we would like to have at least the option to assign Read1 > >> to both F1 and F2: > >>>>>>>>>> > >>>>>>>>>> F1: 1 > >>>>>>>>>> F2: 1 > >>>>>>>>>> > >>>>>>>>>> However, summarizeOverlapse doesn't count Read1 at all in all of > >> its currently > >>>>>>>>>> available modes that I am aware of. This lack of an > >> ignore_inter_feature_overlap > >>>>>>>>>> option is frequently a problem when working with poorly > >> annotated genomes (high > >>>>>>>>>> uncertainty about hidden/incorrect feature overlaps) or various > >>>>>>>>>> RNA/ChIP-Seq _engineering_ projects where I rather take the risk > >> of ambiguous read > >>>>>>>>>> assignments than not counting at all as shown above. > >>>>>>>>>> > >>>>>>>>>> ## Here is a code example illustrating the same case: > >>>>>>>>>> library(GenomicRanges); library(Rsamtools) > >>>>>>>>>> rd <- GappedAlignments(letters[1], seqnames = Rle(rep("chr1",1)), > >>>>>>>>>> pos = as.integer(c(500)), > >>>>>>>>>> cigar = rep("100M", 1), strand = strand(rep("+", 1))) > >>>>>>>>>> gr1 <- GRanges("chr1", IRanges(start=100, width=1001), > >> strand="+", ID="feat1") > >>>>>>>>>> gr2 <- GRanges("chr1", IRanges(start=500, width=101), > >> strand="+", ID="feat2") > >>>>>>>>>> gr <- c(gr1, gr2) > >>>>>>>>>> > >>>>>>>>>> ## All of the three current modes in summarizeOverlaps return a > >> count of zero > >>>>>>>>>> ## because of its inter_feature_overlap awareness: > >>>>>>>>>> assays(summarizeOverlaps(gr, rd, mode="Union", > >> ignore.strand=TRUE))$counts > >>>>>>>>>> assays(summarizeOverlaps(gr, rd, mode="IntersectionStrict", > >> ignore.strand=TRUE))$counts > >>>>>>>>>> assays(summarizeOverlaps(gr, rd, mode="IntersectionNotEmpty", > >> ignore.strand=TRUE))$counts > >>>>>>>>>> [,1] > >>>>>>>>>> [1,] 0 > >>>>>>>>>> [2,] 0 > >>>>>>>>>> > >>>>>>>>>> ## However, countOverlaps handles this correctly, but is not the > >> best choice when > >>>>>>>>>> ## counting multiple range features. > >>>>>>>>>> countOverlaps(gr, rd) > >>>>>>>>>> [1] 1 1 > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> Thomas > >>>>>>>>>> > >>>>>>>>>>> sessionInfo() > >>>>>>>>>> R version 3.0.0 (2013-04-03) > >>>>>>>>>> Platform: x86_64-apple-darwin10.8.0 (64-bit) > >>>>>>>>>> > >>>>>>>>>> locale: > >>>>>>>>>> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 > >>>>>>>>>> > >>>>>>>>>> attached base packages: > >>>>>>>>>> [1] parallel stats graphics grDevices utils datasets > >> methods > >>>>>>>>>> [8] base > >>>>>>>>>> > >>>>>>>>>> other attached packages: > >>>>>>>>>> [1] Rsamtools_1.12.0 Biostrings_2.28.0 > >> GenomicRanges_1.12.1 > >>>>>>>>>> [4] IRanges_1.18.0 BiocGenerics_0.6.0 > >>>>>>>>>> > >>>>>>>>>> loaded via a namespace (and not attached): > >>>>>>>>>> [1] bitops_1.0-5 stats4_3.0.0 tools_3.0.0 zlibbioc_1.6.0 > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> On Tue, Apr 09, 2013 at 03:42:03PM +0000, Valerie Obenchain > >> wrote: > >>>>>>>>>>> Hi Thomas, > >>>>>>>>>>> > >>>>>>>>>>> On 04/08/2013 05:52 PM, Thomas Girke wrote: > >>>>>>>>>>>> Dear Valerie, > >>>>>>>>>>>> > >>>>>>>>>>>> Is there currently any way to run summarizeOverlaps in a > >> feature-overlap > >>>>>>>>>>>> unaware mode, e.g with an ignorefeatureOL=FALSE/TRUE setting? > >> Currently, > >>>>>>>>>>>> one can switch back to countOverlaps when feature overlap > >> unawareness is > >>>>>>>>>>>> the more appropriate counting mode for a biological question, > >> but then > >>>>>>>>>>>> double counting of reads mapping to multiple-range features is > >> not > >>>>>>>>>>>> accounted for. It would be really nice to have such a > >> feature-overlap > >>>>>>>>>>>> unaware option directly in summarizeOverlaps. > >>>>>>>>>>> > >>>>>>>>>>> No, we don't currently have an option to ignore > >> feature-overlap. It > >>>>>>>>>>> sounds like you want to count with countOverlaps() but still > >> want the > >>>>>>>>>>> double counting to be resolved. This is essentially what the > >> other modes > >>>>>>>>>>> are doing so I must be missing something. > >>>>>>>>>>> > >>>>>>>>>>> In this example 2 reads hit feature A, 1 read hits feature B. > >> With > >>>>>>>>>>> something like ignorefeature0L=TRUE, what results would you > >> expect to > >>>>>>>>>>> see? Maybe you have another, more descriptive example? > >>>>>>>>>>> > >>>>>>>>>>> reads <- GRanges("chr1", IRanges(c(1, 5, 20), width=3)) > >>>>>>>>>>> features <- GRanges("chr1", IRanges(c(1, 20), width=10, > >>>>>>>>>>> names=c("A", "B"))) > >>>>>>>>>>> > >>>>>>>>>>>> countOverlaps(features, reads) > >>>>>>>>>>> [1] 2 1 > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> Another question relates to the memory usage of > >> summarizeOverlaps. Has > >>>>>>>>>>>> this been optimized yet? On a typical bam file with ~50-100 > >> million > >>>>>>>>>>>> reads the memory usage of summarizeOverlaps is often around > >> 10-20GB. To > >>>>>>>>>>>> use the function on a desktop computer or in large- scale > >> RNA-Seq > >>>>>>>>>>>> projects on a commodity compute cluster, it would be desirable > >> if every > >>>>>>>>>>>> counting instance would consume not more than 5GB of RAM. > >>>>>>>>>>> > >>>>>>>>>>> Have you tried the BamFileList-method? There is an example at > >> the bottom > >>>>>>>>>>> of the ?BamFileList man page using summarizeOverlaps(). As Ryan > >>>>>>>>>>> mentioned, the key is to set the 'yieldSize' parameter when > >> creating the > >>>>>>>>>>> BamFile. This method also makes use of mclapply(). > >>>>>>>>>>> > >>>>>>>>>>> Valerie > >>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> Thanks in advance for your help and suggestions, > >>>>>>>>>>>> > >>>>>>>>>>>> Thomas > >>>>>>>>>>>> > >>>>>>>>>>>> _______________________________________________ > >>>>>>>>>>>> Bioconductor mailing list > >>>>>>>>>>>> Bioconductor@r-project.org > >>>>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor > >>>>>>>>>>>> Search the archives: > >> http://news.gmane.org/gmane.science.biology.informatics.conductor > >>>>>>>>>>>> > >>>> > >>>> _______________________________________________ > >>>> Bioconductor mailing list > >>>> Bioconductor@r-project.org > >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor > >>>> Search the archives: > >> http://news.gmane.org/gmane.science.biology.informatics.conductor > >>>> > >>> > >>> > >>> -- > >>> Computational Biology / Fred Hutchinson Cancer Research Center > >>> 1100 Fairview Ave. N. > >>> PO Box 19024 Seattle, WA 98109 > >>> > >>> Location: Arnold Building M1 B861 > >>> Phone: (206) 667-2793 > >> > >> _______________________________________________ > >> Bioconductor mailing list > >> Bioconductor@r-project.org > >> https://stat.ethz.ch/mailman/listinfo/bioconductor > >> Search the archives: > >> http://news.gmane.org/gmane.science.biology.informatics.conductor > >> > > > > [[alternative HTML version deleted]] > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor@r-project.org > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > ______________________________________________________________________ > The information in this email is confidential and inte...{{dropped:17}}
ADD REPLYlink written 4.6 years ago by Wei Shi2.7k
Hi Tom, The featureCounts function in Rsubread package goes some way like what you suggested. It is different from countOverlaps in that each read is counted at most once. It is also different from summarizeOverlaps in that it assigns the read to one of the features if it overlaps with >1 features. featureCounts assigns reads to exons first and then summarize exon- level read counts to gene-level read counts. Therefore, if a read overlaps with >1 exons within the same gene it will be counted only once for the gene. featureCounts includes in-built annotations for human and mouse genomes (NCBI RefSeq annotation), but you can provide your own annotation as well. You can also use it for the general-purpose read assignment. It is not restricted to the read counting for RNA-seq data. Cheers, Wei On Apr 11, 2013, at 12:09 PM, Thomas Girke wrote: > Hi Martin, > > Yes, inter_feature=TRUE would maintain the current counting mode(s) that > prohibits counting of reads mapping to multiple features. This is a > special case of counting that is very useful for counting exonic regions > of genes. However, one also wants to be able to turn off this behavior > by ignoring inter feature overlaps (just like ignore.strand=T/F). > Otherwise we cannot use summarizeOverlaps along with its current modes > for important operations like transcript level counting because many > transcript variants from the same gene will mask each others reads when > inter_feature=TRUE. Providing the option to output the results from both > inter_feature=TRUE and inter_feature=FALSE could be a very sensible > solution and time saver for users working with new genomes/GFFs, where > one cannot trust every nested annotation for various reasons, and > inter_feature=TRUE can quickly become a very risky counting strategy > since it tends to erase counts. Every biologist will scream in your > face if the counter tells them that their favorite gene has zero counts > just because of some overlap with some annotation error:). > > For multiple range features stored in GRangesList objects, I would > currently favor making "inter_feature=FALSE" ignore the overlaps > occurring among different list components, but not necessarily those > within a list component (e.g. exon ranges of a gene). This way one > can benefit from the current infrastructure by restricting its feature > overlap scope to the range sets stored within individual list components > but ignoring those among different list components. Utilities like reduce > and other range modifier functions could handle situations where one wants > to ignore all feature overlaps within and among list components. However, > I am sure there could be other solutions to this. > > Long story short if inter_feature=TRUE/FALSE could be used in > combination with modes=Union/IntersectionStrict/IntersectionNonEmpty > resulting in six different counting flavors, I would be happy and, I am > sure, many other Bioc users as well. > > I hope this makes sense. > > Thomas > > > > On Wed, Apr 10, 2013 at 11:37:04PM +0000, Martin Morgan wrote: >> On 04/10/2013 12:42 PM, Thomas Girke wrote: >>> Thanks. Adding an inter-feature unaware mode will be very helpful and also >>> broaden summarizeOverlaps' application spectrum for a lot of use cases. >> >> I'm probably being quite dense here, and am mostly an outside observer. What I >> hear you saying is that there are currently three modes -- Union, >> IntersectionStrict, IntersectionNonEmpty. These modes are summarized in the >> seven rows of figure 1 of >> >> vignette("summarizeOverlaps", package="GenomicRanges") >> >> Let's say there is a flag inter_feature, and when its value is TRUE then the >> current counting schemes are obtained. These modes differ in the way counting >> works for reads illustrated in rows 5, 6, and 7 of the figure. You'd like a >> count scored where a '1' appears in the table below. With inter_feature=TRUE >> reads are 'never counted more than once' ('Hits per read' is <= 1) >> >> |----------------------+-----+-----------+-----------+---------------| >> | Mode | Row | Feature 1 | Feature 2 | Hits per read | >> |----------------------+-----+-----------+-----------+---------------| >> | Union | 5 | 1 | 0 | 1 | >> | | 6 | 0 | 0 | 0 | >> | | 7 | 0 | 0 | 0 | >> | IntersectionStrict | 5 | 1 | 0 | 1 | >> | | 6 | 1 | 0 | 1 | >> | | 7 | 0 | 0 | 0 | >> | IntersectionNotEmpty | 5 | 1 | 0 | 1 | >> | | 6 | 1 | 0 | 1 | >> | | 7 | 0 | 0 | 0 | >> |----------------------+-----+-----------+-----------+---------------| >> >> >> You'd like to add 3 more modes, with inter_feature=FALSE. Reads are sometimes >> 'counted twice' >> >> |----------------------+-----+-----------+-----------+---------------| >> | Mode | Row | Feature 1 | Feature 2 | Hits per read | >> |----------------------+-----+-----------+-----------+---------------| >> | Union | 5 | 1 | 0 | 1 | >> | | 6 | 1 | 1 | 2 | >> | | 7 | 1 | 1 | 2 | >> | IntersectionStrict | 5 | 1 | 0 | 1 | >> | | 6 | 1 | 0 | 1 | >> | | 7 | 1 | 1 | 2 | >> | IntersectionNotEmpty | 5 | 1 | 0 | 1 | >> | | 6 | 1 | 1 | 2 | >> | | 7 | 1 | 1 | 2 | >> |----------------------+-----+-----------+-----------+---------------| >> >> >> Martin >> >>> >>> Thomas >>> >>> >>> On Wed, Apr 10, 2013 at 04:59:49PM +0000, Valerie Obenchain wrote: >>>> Hi Thomas, >>>> >>>> On 04/10/2013 09:23 AM, Thomas Girke wrote: >>>>> Valerie, >>>>> >>>>> Please see my inserted comments below. >>>>> >>>>> On Wed, Apr 10, 2013 at 03:50:54PM +0000, Valerie Obenchain wrote: >>>>>> Ah, I see. You'd like to count with one of the existing modes but have >>>>>> the option to pick up counts for these inter-feature reads (fall >>>>>> completely 'within' >1 feature). These inter-feature reads would be >>>>>> double (triple, quadruple, etc.) counted. Essentially they would add one >>>>>> count to each feature they hit. Right? >>>>> >>>>> Correct. Perhaps let's discuss this with a very common example of >>>>> transcript-level counting rather than counting on the unified (virtual) >>>>> exonic regions of genes. With the current description provided in the >>>>> summarizeOverlaps vignette at >>>>> http://www.bioconductor.org/packages/2.12/bioc/vignettes/Genomic Ranges/inst/doc/summarizeOverlaps.pdf >>>>> I don't see how this can be achieved without falling back to using >>>>> countOverlaps without loosing the new counting modes provided by >>>>> summarizeOverlaps? >>>> >>>> It can't be achieved with the function as is but we could add an >>>> argument to handle this (as you suggested from the start). If >>>> 'inter-feature=TRUE' then these counts would be added to the counts >>>> already obtained from using a particular 'mode'. I will move ahead with >>>> implementing this argument. >>>> >>>>> >>>>>> >>>>>> One more thought about memory usage. If you are working with single-end >>>>>> reads the summarizeOverlaps,BamFileList-method I mentioned below should >>>>>> work fine. The approach is slightly different with paired-end reads. Our >>>>>> current algorithm for pairing paired-end reads requires the whole file >>>>>> to be read into memory. A different approach is currently being >>>>>> developed but in the meantime you can take the qname-sorted approach. >>>>>> The Bam file will need to be sorted by qname and both the 'yieldSize' >>>>>> and 'obeyQname=TRUE' set when creating the BamFile/BamFileList. An >>>>>> example is on ?BamFile. >>>>> >>>>> Thanks for pointing this out. My fault that I didn't read through all >>>>> the linked documentation. Perhaps it may not be a bad idea to make the >>>>> memory restricted bam read instances the default setting in the future. >>>>> This will definitely help biologists using those utilities without >>>>> crashing their machines on the first attempt. >>>> >>>> Good suggestion, thanks. >>>> >>>> Valerie >>>> >>>>> >>>>> Thomas >>>>> >>>>> >>>>>> >>>>>> Valerie >>>>>> >>>>>> >>>>>> On 04/09/2013 08:01 PM, Thomas Girke wrote: >>>>>>> Thanks for the tip. I guess doing it this way reverses the counting mode >>>>>>> back to countOverlaps, but how can I use at the same time >>>>>>> "IntersectionStrict" or any of the other modes provided by >>>>>>> summarizeOverlaps if its mode argument is already used and countOverlaps >>>>>>> doesn't accept one? >>>>>>> >>>>>>> Thomas >>>>>>> >>>>>>> On Tue, Apr 09, 2013 at 09:08:02PM +0000, Valerie Obenchain wrote: >>>>>>>> Thanks for the example. You're right, none of the modes will count a >>>>>>>> read falling completely within >1 feature. >>>>>>>> >>>>>>>> You can supply your own function as a 'mode'. Two requirements must be met: >>>>>>>> >>>>>>>> (1) The function must have 3 arguments that correspond to >>>>>>>> 'features', 'reads' and 'ignore.strand', in that order. >>>>>>>> >>>>>>>> (2) The function must return a vector of counts the same length >>>>>>>> as 'features' >>>>>>>> >>>>>>>> Here is an example using countOverlaps() which I think gets at the >>>>>>>> counting you want. >>>>>>>> >>>>>>>> counter <- function(x, y, ignore.strand) >>>>>>>> countOverlaps(y, x, ignore.strand=ignore.strand) >>>>>>>> >>>>>>>>> assays(summarizeOverlaps(gr, rd, mode=counter))$counts >>>>>>>> [,1] >>>>>>>> [1,] 1 >>>>>>>> [2,] 1 >>>>>>>> >>>>>>>> >>>>>>>> Valerie >>>>>>>> >>>>>>>> On 04/09/2013 09:37 AM, Thomas Girke wrote: >>>>>>>>> Hi Valerie, >>>>>>>>> >>>>>>>>> Perhaps let's call this more explicitly an ignore_inter_feature_overlap option >>>>>>>>> that we often need for cases like this: >>>>>>>>> >>>>>>>>> Read1 ---- >>>>>>>>> F1 ---------------- >>>>>>>>> F2 ----------- >>>>>>>>> >>>>>>>>> where we would like to have at least the option to assign Read1 to both F1 and F2: >>>>>>>>> >>>>>>>>> F1: 1 >>>>>>>>> F2: 1 >>>>>>>>> >>>>>>>>> However, summarizeOverlapse doesn't count Read1 at all in all of its currently >>>>>>>>> available modes that I am aware of. This lack of an ignore_inter_feature_overlap >>>>>>>>> option is frequently a problem when working with poorly annotated genomes (high >>>>>>>>> uncertainty about hidden/incorrect feature overlaps) or various >>>>>>>>> RNA/ChIP-Seq _engineering_ projects where I rather take the risk of ambiguous read >>>>>>>>> assignments than not counting at all as shown above. >>>>>>>>> >>>>>>>>> ## Here is a code example illustrating the same case: >>>>>>>>> library(GenomicRanges); library(Rsamtools) >>>>>>>>> rd <- GappedAlignments(letters[1], seqnames = Rle(rep("chr1",1)), >>>>>>>>> pos = as.integer(c(500)), >>>>>>>>> cigar = rep("100M", 1), strand = strand(rep("+", 1))) >>>>>>>>> gr1 <- GRanges("chr1", IRanges(start=100, width=1001), strand="+", ID="feat1") >>>>>>>>> gr2 <- GRanges("chr1", IRanges(start=500, width=101), strand="+", ID="feat2") >>>>>>>>> gr <- c(gr1, gr2) >>>>>>>>> >>>>>>>>> ## All of the three current modes in summarizeOverlaps return a count of zero >>>>>>>>> ## because of its inter_feature_overlap awareness: >>>>>>>>> assays(summarizeOverlaps(gr, rd, mode="Union", ignore.strand=TRUE))$counts >>>>>>>>> assays(summarizeOverlaps(gr, rd, mode="IntersectionStrict", ignore.strand=TRUE))$counts >>>>>>>>> assays(summarizeOverlaps(gr, rd, mode="IntersectionNotEmpty", ignore.strand=TRUE))$counts >>>>>>>>> [,1] >>>>>>>>> [1,] 0 >>>>>>>>> [2,] 0 >>>>>>>>> >>>>>>>>> ## However, countOverlaps handles this correctly, but is not the best choice when >>>>>>>>> ## counting multiple range features. >>>>>>>>> countOverlaps(gr, rd) >>>>>>>>> [1] 1 1 >>>>>>>>> >>>>>>>>> >>>>>>>>> Thomas >>>>>>>>> >>>>>>>>>> sessionInfo() >>>>>>>>> R version 3.0.0 (2013-04-03) >>>>>>>>> Platform: x86_64-apple-darwin10.8.0 (64-bit) >>>>>>>>> >>>>>>>>> locale: >>>>>>>>> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 >>>>>>>>> >>>>>>>>> attached base packages: >>>>>>>>> [1] parallel stats graphics grDevices utils datasets methods >>>>>>>>> [8] base >>>>>>>>> >>>>>>>>> other attached packages: >>>>>>>>> [1] Rsamtools_1.12.0 Biostrings_2.28.0 GenomicRanges_1.12.1 >>>>>>>>> [4] IRanges_1.18.0 BiocGenerics_0.6.0 >>>>>>>>> >>>>>>>>> loaded via a namespace (and not attached): >>>>>>>>> [1] bitops_1.0-5 stats4_3.0.0 tools_3.0.0 zlibbioc_1.6.0 >>>>>>>>> >>>>>>>>> >>>>>>>>> On Tue, Apr 09, 2013 at 03:42:03PM +0000, Valerie Obenchain wrote: >>>>>>>>>> Hi Thomas, >>>>>>>>>> >>>>>>>>>> On 04/08/2013 05:52 PM, Thomas Girke wrote: >>>>>>>>>>> Dear Valerie, >>>>>>>>>>> >>>>>>>>>>> Is there currently any way to run summarizeOverlaps in a feature-overlap >>>>>>>>>>> unaware mode, e.g with an ignorefeatureOL=FALSE/TRUE setting? Currently, >>>>>>>>>>> one can switch back to countOverlaps when feature overlap unawareness is >>>>>>>>>>> the more appropriate counting mode for a biological question, but then >>>>>>>>>>> double counting of reads mapping to multiple-range features is not >>>>>>>>>>> accounted for. It would be really nice to have such a feature-overlap >>>>>>>>>>> unaware option directly in summarizeOverlaps. >>>>>>>>>> >>>>>>>>>> No, we don't currently have an option to ignore feature- overlap. It >>>>>>>>>> sounds like you want to count with countOverlaps() but still want the >>>>>>>>>> double counting to be resolved. This is essentially what the other modes >>>>>>>>>> are doing so I must be missing something. >>>>>>>>>> >>>>>>>>>> In this example 2 reads hit feature A, 1 read hits feature B. With >>>>>>>>>> something like ignorefeature0L=TRUE, what results would you expect to >>>>>>>>>> see? Maybe you have another, more descriptive example? >>>>>>>>>> >>>>>>>>>> reads <- GRanges("chr1", IRanges(c(1, 5, 20), width=3)) >>>>>>>>>> features <- GRanges("chr1", IRanges(c(1, 20), width=10, >>>>>>>>>> names=c("A", "B"))) >>>>>>>>>> >>>>>>>>>>> countOverlaps(features, reads) >>>>>>>>>> [1] 2 1 >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Another question relates to the memory usage of summarizeOverlaps. Has >>>>>>>>>>> this been optimized yet? On a typical bam file with ~50-100 million >>>>>>>>>>> reads the memory usage of summarizeOverlaps is often around 10-20GB. To >>>>>>>>>>> use the function on a desktop computer or in large-scale RNA-Seq >>>>>>>>>>> projects on a commodity compute cluster, it would be desirable if every >>>>>>>>>>> counting instance would consume not more than 5GB of RAM. >>>>>>>>>> >>>>>>>>>> Have you tried the BamFileList-method? There is an example at the bottom >>>>>>>>>> of the ?BamFileList man page using summarizeOverlaps(). As Ryan >>>>>>>>>> mentioned, the key is to set the 'yieldSize' parameter when creating the >>>>>>>>>> BamFile. This method also makes use of mclapply(). >>>>>>>>>> >>>>>>>>>> Valerie >>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Thanks in advance for your help and suggestions, >>>>>>>>>>> >>>>>>>>>>> Thomas >>>>>>>>>>> >>>>>>>>>>> _______________________________________________ >>>>>>>>>>> Bioconductor mailing list >>>>>>>>>>> Bioconductor at r-project.org >>>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>>>>>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>>>>>>>> >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >> >> >> -- >> Computational Biology / Fred Hutchinson Cancer Research Center >> 1100 Fairview Ave. N. >> PO Box 19024 Seattle, WA 98109 >> >> Location: Arnold Building M1 B861 >> Phone: (206) 667-2793 > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:6}}
ADD REPLYlink written 4.6 years ago by Wei Shi2.7k
0
gravatar for Thomas Girke
4.6 years ago by
Thomas Girke1.6k
United States
Thomas Girke1.6k wrote:
Dear Wei, Thanks. I guess this makes sense, but is it possible this way to perform transcript level counting in addition to exonic gene-level counting? Also, this reminds me of one question related to Rsubread for which I couldn't find an answer in the documentation (haven't had time to read your NAR paper though). Is it possible to use Rsubread for aligning directly against transcriptome sequences where we want to set no restrictions on multiple mappings of equal quality in order to obtain transcript read counts? Right I am often doing this with Bowtie2 for our own transcriptome assemblies where we don't have a reference genome and then use Rsamtools for counting the reads mapping to individual transcripts. With Rsubread I was not sure if the algorithm can return any number of alternative hits? If it can then I would love to give it a try for our transcriptome assembly projects. Thomas On Thu, Apr 11, 2013 at 02:32:03AM +0000, Wei Shi wrote: > Hi Tom, > > The featureCounts function in Rsubread package goes some way like what you suggested. It is different from countOverlaps in that each read is counted at most once. It is also different from summarizeOverlaps in that it assigns the read to one of the features if it overlaps with >1 features. > > featureCounts assigns reads to exons first and then summarize exon- level read counts to gene-level read counts. Therefore, if a read overlaps with >1 exons within the same gene it will be counted only once for the gene. > > featureCounts includes in-built annotations for human and mouse genomes (NCBI RefSeq annotation), but you can provide your own annotation as well. You can also use it for the general-purpose read assignment. It is not restricted to the read counting for RNA-seq data. > > Cheers, > Wei > > On Apr 11, 2013, at 12:09 PM, Thomas Girke wrote: > > > Hi Martin, > > > > Yes, inter_feature=TRUE would maintain the current counting mode(s) that > > prohibits counting of reads mapping to multiple features. This is a > > special case of counting that is very useful for counting exonic regions > > of genes. However, one also wants to be able to turn off this behavior > > by ignoring inter feature overlaps (just like ignore.strand=T/F). > > Otherwise we cannot use summarizeOverlaps along with its current modes > > for important operations like transcript level counting because many > > transcript variants from the same gene will mask each others reads when > > inter_feature=TRUE. Providing the option to output the results from both > > inter_feature=TRUE and inter_feature=FALSE could be a very sensible > > solution and time saver for users working with new genomes/GFFs, where > > one cannot trust every nested annotation for various reasons, and > > inter_feature=TRUE can quickly become a very risky counting strategy > > since it tends to erase counts. Every biologist will scream in your > > face if the counter tells them that their favorite gene has zero counts > > just because of some overlap with some annotation error:). > > > > For multiple range features stored in GRangesList objects, I would > > currently favor making "inter_feature=FALSE" ignore the overlaps > > occurring among different list components, but not necessarily those > > within a list component (e.g. exon ranges of a gene). This way one > > can benefit from the current infrastructure by restricting its feature > > overlap scope to the range sets stored within individual list components > > but ignoring those among different list components. Utilities like reduce > > and other range modifier functions could handle situations where one wants > > to ignore all feature overlaps within and among list components. However, > > I am sure there could be other solutions to this. > > > > Long story short if inter_feature=TRUE/FALSE could be used in > > combination with modes=Union/IntersectionStrict/IntersectionNonEmpty > > resulting in six different counting flavors, I would be happy and, I am > > sure, many other Bioc users as well. > > > > I hope this makes sense. > > > > Thomas > > > > > > > > On Wed, Apr 10, 2013 at 11:37:04PM +0000, Martin Morgan wrote: > >> On 04/10/2013 12:42 PM, Thomas Girke wrote: > >>> Thanks. Adding an inter-feature unaware mode will be very helpful and also > >>> broaden summarizeOverlaps' application spectrum for a lot of use cases. > >> > >> I'm probably being quite dense here, and am mostly an outside observer. What I > >> hear you saying is that there are currently three modes -- Union, > >> IntersectionStrict, IntersectionNonEmpty. These modes are summarized in the > >> seven rows of figure 1 of > >> > >> vignette("summarizeOverlaps", package="GenomicRanges") > >> > >> Let's say there is a flag inter_feature, and when its value is TRUE then the > >> current counting schemes are obtained. These modes differ in the way counting > >> works for reads illustrated in rows 5, 6, and 7 of the figure. You'd like a > >> count scored where a '1' appears in the table below. With inter_feature=TRUE > >> reads are 'never counted more than once' ('Hits per read' is <= 1) > >> > >> |----------------------+-----+-----------+-----------+---------------| > >> | Mode | Row | Feature 1 | Feature 2 | Hits per read | > >> |----------------------+-----+-----------+-----------+---------------| > >> | Union | 5 | 1 | 0 | 1 | > >> | | 6 | 0 | 0 | 0 | > >> | | 7 | 0 | 0 | 0 | > >> | IntersectionStrict | 5 | 1 | 0 | 1 | > >> | | 6 | 1 | 0 | 1 | > >> | | 7 | 0 | 0 | 0 | > >> | IntersectionNotEmpty | 5 | 1 | 0 | 1 | > >> | | 6 | 1 | 0 | 1 | > >> | | 7 | 0 | 0 | 0 | > >> |----------------------+-----+-----------+-----------+---------------| > >> > >> > >> You'd like to add 3 more modes, with inter_feature=FALSE. Reads are sometimes > >> 'counted twice' > >> > >> |----------------------+-----+-----------+-----------+---------------| > >> | Mode | Row | Feature 1 | Feature 2 | Hits per read | > >> |----------------------+-----+-----------+-----------+---------------| > >> | Union | 5 | 1 | 0 | 1 | > >> | | 6 | 1 | 1 | 2 | > >> | | 7 | 1 | 1 | 2 | > >> | IntersectionStrict | 5 | 1 | 0 | 1 | > >> | | 6 | 1 | 0 | 1 | > >> | | 7 | 1 | 1 | 2 | > >> | IntersectionNotEmpty | 5 | 1 | 0 | 1 | > >> | | 6 | 1 | 1 | 2 | > >> | | 7 | 1 | 1 | 2 | > >> |----------------------+-----+-----------+-----------+---------------| > >> > >> > >> Martin > >> > >>> > >>> Thomas > >>> > >>> > >>> On Wed, Apr 10, 2013 at 04:59:49PM +0000, Valerie Obenchain wrote: > >>>> Hi Thomas, > >>>> > >>>> On 04/10/2013 09:23 AM, Thomas Girke wrote: > >>>>> Valerie, > >>>>> > >>>>> Please see my inserted comments below. > >>>>> > >>>>> On Wed, Apr 10, 2013 at 03:50:54PM +0000, Valerie Obenchain wrote: > >>>>>> Ah, I see. You'd like to count with one of the existing modes but have > >>>>>> the option to pick up counts for these inter-feature reads (fall > >>>>>> completely 'within' >1 feature). These inter-feature reads would be > >>>>>> double (triple, quadruple, etc.) counted. Essentially they would add one > >>>>>> count to each feature they hit. Right? > >>>>> > >>>>> Correct. Perhaps let's discuss this with a very common example of > >>>>> transcript-level counting rather than counting on the unified (virtual) > >>>>> exonic regions of genes. With the current description provided in the > >>>>> summarizeOverlaps vignette at > >>>>> http://www.bioconductor.org/packages/2.12/bioc/vignettes/Genom icRanges/inst/doc/summarizeOverlaps.pdf > >>>>> I don't see how this can be achieved without falling back to using > >>>>> countOverlaps without loosing the new counting modes provided by > >>>>> summarizeOverlaps? > >>>> > >>>> It can't be achieved with the function as is but we could add an > >>>> argument to handle this (as you suggested from the start). If > >>>> 'inter-feature=TRUE' then these counts would be added to the counts > >>>> already obtained from using a particular 'mode'. I will move ahead with > >>>> implementing this argument. > >>>> > >>>>> > >>>>>> > >>>>>> One more thought about memory usage. If you are working with single-end > >>>>>> reads the summarizeOverlaps,BamFileList-method I mentioned below should > >>>>>> work fine. The approach is slightly different with paired-end reads. Our > >>>>>> current algorithm for pairing paired-end reads requires the whole file > >>>>>> to be read into memory. A different approach is currently being > >>>>>> developed but in the meantime you can take the qname-sorted approach. > >>>>>> The Bam file will need to be sorted by qname and both the 'yieldSize' > >>>>>> and 'obeyQname=TRUE' set when creating the BamFile/BamFileList. An > >>>>>> example is on ?BamFile. > >>>>> > >>>>> Thanks for pointing this out. My fault that I didn't read through all > >>>>> the linked documentation. Perhaps it may not be a bad idea to make the > >>>>> memory restricted bam read instances the default setting in the future. > >>>>> This will definitely help biologists using those utilities without > >>>>> crashing their machines on the first attempt. > >>>> > >>>> Good suggestion, thanks. > >>>> > >>>> Valerie > >>>> > >>>>> > >>>>> Thomas > >>>>> > >>>>> > >>>>>> > >>>>>> Valerie > >>>>>> > >>>>>> > >>>>>> On 04/09/2013 08:01 PM, Thomas Girke wrote: > >>>>>>> Thanks for the tip. I guess doing it this way reverses the counting mode > >>>>>>> back to countOverlaps, but how can I use at the same time > >>>>>>> "IntersectionStrict" or any of the other modes provided by > >>>>>>> summarizeOverlaps if its mode argument is already used and countOverlaps > >>>>>>> doesn't accept one? > >>>>>>> > >>>>>>> Thomas > >>>>>>> > >>>>>>> On Tue, Apr 09, 2013 at 09:08:02PM +0000, Valerie Obenchain wrote: > >>>>>>>> Thanks for the example. You're right, none of the modes will count a > >>>>>>>> read falling completely within >1 feature. > >>>>>>>> > >>>>>>>> You can supply your own function as a 'mode'. Two requirements must be met: > >>>>>>>> > >>>>>>>> (1) The function must have 3 arguments that correspond to > >>>>>>>> 'features', 'reads' and 'ignore.strand', in that order. > >>>>>>>> > >>>>>>>> (2) The function must return a vector of counts the same length > >>>>>>>> as 'features' > >>>>>>>> > >>>>>>>> Here is an example using countOverlaps() which I think gets at the > >>>>>>>> counting you want. > >>>>>>>> > >>>>>>>> counter <- function(x, y, ignore.strand) > >>>>>>>> countOverlaps(y, x, ignore.strand=ignore.strand) > >>>>>>>> > >>>>>>>>> assays(summarizeOverlaps(gr, rd, mode=counter))$counts > >>>>>>>> [,1] > >>>>>>>> [1,] 1 > >>>>>>>> [2,] 1 > >>>>>>>> > >>>>>>>> > >>>>>>>> Valerie > >>>>>>>> > >>>>>>>> On 04/09/2013 09:37 AM, Thomas Girke wrote: > >>>>>>>>> Hi Valerie, > >>>>>>>>> > >>>>>>>>> Perhaps let's call this more explicitly an ignore_inter_feature_overlap option > >>>>>>>>> that we often need for cases like this: > >>>>>>>>> > >>>>>>>>> Read1 ---- > >>>>>>>>> F1 ---------------- > >>>>>>>>> F2 ----------- > >>>>>>>>> > >>>>>>>>> where we would like to have at least the option to assign Read1 to both F1 and F2: > >>>>>>>>> > >>>>>>>>> F1: 1 > >>>>>>>>> F2: 1 > >>>>>>>>> > >>>>>>>>> However, summarizeOverlapse doesn't count Read1 at all in all of its currently > >>>>>>>>> available modes that I am aware of. This lack of an ignore_inter_feature_overlap > >>>>>>>>> option is frequently a problem when working with poorly annotated genomes (high > >>>>>>>>> uncertainty about hidden/incorrect feature overlaps) or various > >>>>>>>>> RNA/ChIP-Seq _engineering_ projects where I rather take the risk of ambiguous read > >>>>>>>>> assignments than not counting at all as shown above. > >>>>>>>>> > >>>>>>>>> ## Here is a code example illustrating the same case: > >>>>>>>>> library(GenomicRanges); library(Rsamtools) > >>>>>>>>> rd <- GappedAlignments(letters[1], seqnames = Rle(rep("chr1",1)), > >>>>>>>>> pos = as.integer(c(500)), > >>>>>>>>> cigar = rep("100M", 1), strand = strand(rep("+", 1))) > >>>>>>>>> gr1 <- GRanges("chr1", IRanges(start=100, width=1001), strand="+", ID="feat1") > >>>>>>>>> gr2 <- GRanges("chr1", IRanges(start=500, width=101), strand="+", ID="feat2") > >>>>>>>>> gr <- c(gr1, gr2) > >>>>>>>>> > >>>>>>>>> ## All of the three current modes in summarizeOverlaps return a count of zero > >>>>>>>>> ## because of its inter_feature_overlap awareness: > >>>>>>>>> assays(summarizeOverlaps(gr, rd, mode="Union", ignore.strand=TRUE))$counts > >>>>>>>>> assays(summarizeOverlaps(gr, rd, mode="IntersectionStrict", ignore.strand=TRUE))$counts > >>>>>>>>> assays(summarizeOverlaps(gr, rd, mode="IntersectionNotEmpty", ignore.strand=TRUE))$counts > >>>>>>>>> [,1] > >>>>>>>>> [1,] 0 > >>>>>>>>> [2,] 0 > >>>>>>>>> > >>>>>>>>> ## However, countOverlaps handles this correctly, but is not the best choice when > >>>>>>>>> ## counting multiple range features. > >>>>>>>>> countOverlaps(gr, rd) > >>>>>>>>> [1] 1 1 > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> Thomas > >>>>>>>>> > >>>>>>>>>> sessionInfo() > >>>>>>>>> R version 3.0.0 (2013-04-03) > >>>>>>>>> Platform: x86_64-apple-darwin10.8.0 (64-bit) > >>>>>>>>> > >>>>>>>>> locale: > >>>>>>>>> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 > >>>>>>>>> > >>>>>>>>> attached base packages: > >>>>>>>>> [1] parallel stats graphics grDevices utils datasets methods > >>>>>>>>> [8] base > >>>>>>>>> > >>>>>>>>> other attached packages: > >>>>>>>>> [1] Rsamtools_1.12.0 Biostrings_2.28.0 GenomicRanges_1.12.1 > >>>>>>>>> [4] IRanges_1.18.0 BiocGenerics_0.6.0 > >>>>>>>>> > >>>>>>>>> loaded via a namespace (and not attached): > >>>>>>>>> [1] bitops_1.0-5 stats4_3.0.0 tools_3.0.0 zlibbioc_1.6.0 > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> On Tue, Apr 09, 2013 at 03:42:03PM +0000, Valerie Obenchain wrote: > >>>>>>>>>> Hi Thomas, > >>>>>>>>>> > >>>>>>>>>> On 04/08/2013 05:52 PM, Thomas Girke wrote: > >>>>>>>>>>> Dear Valerie, > >>>>>>>>>>> > >>>>>>>>>>> Is there currently any way to run summarizeOverlaps in a feature-overlap > >>>>>>>>>>> unaware mode, e.g with an ignorefeatureOL=FALSE/TRUE setting? Currently, > >>>>>>>>>>> one can switch back to countOverlaps when feature overlap unawareness is > >>>>>>>>>>> the more appropriate counting mode for a biological question, but then > >>>>>>>>>>> double counting of reads mapping to multiple-range features is not > >>>>>>>>>>> accounted for. It would be really nice to have such a feature-overlap > >>>>>>>>>>> unaware option directly in summarizeOverlaps. > >>>>>>>>>> > >>>>>>>>>> No, we don't currently have an option to ignore feature- overlap. It > >>>>>>>>>> sounds like you want to count with countOverlaps() but still want the > >>>>>>>>>> double counting to be resolved. This is essentially what the other modes > >>>>>>>>>> are doing so I must be missing something. > >>>>>>>>>> > >>>>>>>>>> In this example 2 reads hit feature A, 1 read hits feature B. With > >>>>>>>>>> something like ignorefeature0L=TRUE, what results would you expect to > >>>>>>>>>> see? Maybe you have another, more descriptive example? > >>>>>>>>>> > >>>>>>>>>> reads <- GRanges("chr1", IRanges(c(1, 5, 20), width=3)) > >>>>>>>>>> features <- GRanges("chr1", IRanges(c(1, 20), width=10, > >>>>>>>>>> names=c("A", "B"))) > >>>>>>>>>> > >>>>>>>>>>> countOverlaps(features, reads) > >>>>>>>>>> [1] 2 1 > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> Another question relates to the memory usage of summarizeOverlaps. Has > >>>>>>>>>>> this been optimized yet? On a typical bam file with ~50-100 million > >>>>>>>>>>> reads the memory usage of summarizeOverlaps is often around 10-20GB. To > >>>>>>>>>>> use the function on a desktop computer or in large-scale RNA-Seq > >>>>>>>>>>> projects on a commodity compute cluster, it would be desirable if every > >>>>>>>>>>> counting instance would consume not more than 5GB of RAM. > >>>>>>>>>> > >>>>>>>>>> Have you tried the BamFileList-method? There is an example at the bottom > >>>>>>>>>> of the ?BamFileList man page using summarizeOverlaps(). As Ryan > >>>>>>>>>> mentioned, the key is to set the 'yieldSize' parameter when creating the > >>>>>>>>>> BamFile. This method also makes use of mclapply(). > >>>>>>>>>> > >>>>>>>>>> Valerie > >>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> Thanks in advance for your help and suggestions, > >>>>>>>>>>> > >>>>>>>>>>> Thomas > >>>>>>>>>>> > >>>>>>>>>>> _______________________________________________ > >>>>>>>>>>> Bioconductor mailing list > >>>>>>>>>>> Bioconductor at r-project.org > >>>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor > >>>>>>>>>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > >>>>>>>>>>> > >>> > >>> _______________________________________________ > >>> Bioconductor mailing list > >>> Bioconductor at r-project.org > >>> https://stat.ethz.ch/mailman/listinfo/bioconductor > >>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > >>> > >> > >> > >> -- > >> Computational Biology / Fred Hutchinson Cancer Research Center > >> 1100 Fairview Ave. N. > >> PO Box 19024 Seattle, WA 98109 > >> > >> Location: Arnold Building M1 B861 > >> Phone: (206) 667-2793 > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor at r-project.org > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > ______________________________________________________________________ > The information in this email is confidential and inte...{{dropped:4}}
ADD COMMENTlink written 4.6 years ago by Thomas Girke1.6k
Dear Tom, Please see my reply below. On Apr 11, 2013, at 12:55 PM, Thomas Girke wrote: > Dear Wei, > > Thanks. I guess this makes sense, but is it possible this way to perform > transcript level counting in addition to exonic gene-level counting? > Yes, it can do that. But you will have to provide your own transcript annotation. Rsubread only includes RefSeq annotation for exons. > Also, this reminds me of one question related to Rsubread for which I > couldn't find an answer in the documentation (haven't had time to read > your NAR paper though). Is it possible to use Rsubread for aligning > directly against transcriptome sequences where we want to set no > restrictions on multiple mappings of equal quality in order to obtain > transcript read counts? Right I am often doing this with Bowtie2 for our > own transcriptome assemblies where we don't have a reference genome and > then use Rsamtools for counting the reads mapping to individual > transcripts. With Rsubread I was not sure if the algorithm can return > any number of alternative hits? If it can then I would love to give it > a try for our transcriptome assembly projects. > > Thomas > > Rsubread reports no more than one mapping location for each read. But we probably should provide an option to allow multiple mapping locations to be reported for the reads. This should be pretty straightforward to do. I will let you know once we added this option (probably in a couple of days). Cheers, Wei ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:6}}
ADD REPLYlink written 4.6 years ago by Wei Shi2.7k
0
gravatar for Thomas Girke
4.6 years ago by
Thomas Girke1.6k
United States
Thomas Girke1.6k wrote:
Hi Michael, Both findSpliceOverlaps and summarizeSpliceOverlaps look really interesting. I am looking forward using them for some of our splice variant profiling experiments. Thanks, Thomas On Thu, Apr 11, 2013 at 01:36:03PM +0000, Michael Lawrence wrote: > > > > On Wed, Apr 10, 2013 at 7:09 PM, Thomas Girke <thomas.girke at="" ucr.edu<mailto:thomas.girke="" at="" ucr.edu="">> wrote: > Hi Martin, > > Yes, inter_feature=TRUE would maintain the current counting mode(s) that > prohibits counting of reads mapping to multiple features. This is a > special case of counting that is very useful for counting exonic regions > of genes. However, one also wants to be able to turn off this behavior > by ignoring inter feature overlaps (just like ignore.strand=T/F). > Otherwise we cannot use summarizeOverlaps along with its current modes > for important operations like transcript level counting because many > transcript variants from the same gene will mask each others reads when > inter_feature=TRUE. Providing the option to output the results from both > inter_feature=TRUE and inter_feature=FALSE could be a very sensible > solution and time saver for users working with new genomes/GFFs, where > one cannot trust every nested annotation for various reasons, and > inter_feature=TRUE can quickly become a very risky counting strategy > since it tends to erase counts. Every biologist will scream in your > face if the counter tells them that their favorite gene has zero counts > just because of some overlap with some annotation error:). > > For multiple range features stored in GRangesList objects, I would > currently favor making "inter_feature=FALSE" ignore the overlaps > occurring among different list components, but not necessarily those > within a list component (e.g. exon ranges of a gene). This way one > can benefit from the current infrastructure by restricting its feature > overlap scope to the range sets stored within individual list components > but ignoring those among different list components. Utilities like reduce > and other range modifier functions could handle situations where one wants > to ignore all feature overlaps within and among list components. However, > I am sure there could be other solutions to this. > > Long story short if inter_feature=TRUE/FALSE could be used in > combination with modes=Union/IntersectionStrict/IntersectionNonEmpty > resulting in six different counting flavors, I would be happy and, I am > sure, many other Bioc users as well. > > > Have you taken a look at findSpliceOverlaps? Maybe summarizeSpliceOverlaps could be completed to satisfy your use case? Somehow we need to control the proliferation of counting functions and modes. Having some idea of your high-level use case might help. > > Also, for spliced alignments, I recommend giving GSNAP a try if you haven't already. It's accessible though the gmapR package. > > Michael > > I hope this makes sense. > > Thomas > > > > On Wed, Apr 10, 2013 at 11:37:04PM +0000, Martin Morgan wrote: > > On 04/10/2013 12:42 PM, Thomas Girke wrote: > > > Thanks. Adding an inter-feature unaware mode will be very helpful and also > > > broaden summarizeOverlaps' application spectrum for a lot of use cases. > > > > I'm probably being quite dense here, and am mostly an outside observer. What I > > hear you saying is that there are currently three modes -- Union, > > IntersectionStrict, IntersectionNonEmpty. These modes are summarized in the > > seven rows of figure 1 of > > > > vignette("summarizeOverlaps", package="GenomicRanges") > > > > Let's say there is a flag inter_feature, and when its value is TRUE then the > > current counting schemes are obtained. These modes differ in the way counting > > works for reads illustrated in rows 5, 6, and 7 of the figure. You'd like a > > count scored where a '1' appears in the table below. With inter_feature=TRUE > > reads are 'never counted more than once' ('Hits per read' is <= 1) > > > > |----------------------+-----+-----------+-----------+---------------| > > | Mode | Row | Feature 1 | Feature 2 | Hits per read | > > |----------------------+-----+-----------+-----------+---------------| > > | Union | 5 | 1 | 0 | 1 | > > | | 6 | 0 | 0 | 0 | > > | | 7 | 0 | 0 | 0 | > > | IntersectionStrict | 5 | 1 | 0 | 1 | > > | | 6 | 1 | 0 | 1 | > > | | 7 | 0 | 0 | 0 | > > | IntersectionNotEmpty | 5 | 1 | 0 | 1 | > > | | 6 | 1 | 0 | 1 | > > | | 7 | 0 | 0 | 0 | > > |----------------------+-----+-----------+-----------+---------------| > > > > > > You'd like to add 3 more modes, with inter_feature=FALSE. Reads are sometimes > > 'counted twice' > > > > |----------------------+-----+-----------+-----------+---------------| > > | Mode | Row | Feature 1 | Feature 2 | Hits per read | > > |----------------------+-----+-----------+-----------+---------------| > > | Union | 5 | 1 | 0 | 1 | > > | | 6 | 1 | 1 | 2 | > > | | 7 | 1 | 1 | 2 | > > | IntersectionStrict | 5 | 1 | 0 | 1 | > > | | 6 | 1 | 0 | 1 | > > | | 7 | 1 | 1 | 2 | > > | IntersectionNotEmpty | 5 | 1 | 0 | 1 | > > | | 6 | 1 | 1 | 2 | > > | | 7 | 1 | 1 | 2 | > > |----------------------+-----+-----------+-----------+---------------| > > > > > > Martin > > > > > > > > Thomas > > > > > > > > > On Wed, Apr 10, 2013 at 04:59:49PM +0000, Valerie Obenchain wrote: > > >> Hi Thomas, > > >> > > >> On 04/10/2013 09:23 AM, Thomas Girke wrote: > > >>> Valerie, > > >>> > > >>> Please see my inserted comments below. > > >>> > > >>> On Wed, Apr 10, 2013 at 03:50:54PM +0000, Valerie Obenchain wrote: > > >>>> Ah, I see. You'd like to count with one of the existing modes but have > > >>>> the option to pick up counts for these inter-feature reads (fall > > >>>> completely 'within' >1 feature). These inter-feature reads would be > > >>>> double (triple, quadruple, etc.) counted. Essentially they would add one > > >>>> count to each feature they hit. Right? > > >>> > > >>> Correct. Perhaps let's discuss this with a very common example of > > >>> transcript-level counting rather than counting on the unified (virtual) > > >>> exonic regions of genes. With the current description provided in the > > >>> summarizeOverlaps vignette at > > >>> http://www.bioconductor.org/packages/2.12/bioc/vignettes/Genom icRanges/inst/doc/summarizeOverlaps.pdf > > >>> I don't see how this can be achieved without falling back to using > > >>> countOverlaps without loosing the new counting modes provided by > > >>> summarizeOverlaps? > > >> > > >> It can't be achieved with the function as is but we could add an > > >> argument to handle this (as you suggested from the start). If > > >> 'inter-feature=TRUE' then these counts would be added to the counts > > >> already obtained from using a particular 'mode'. I will move ahead with > > >> implementing this argument. > > >> > > >>> > > >>>> > > >>>> One more thought about memory usage. If you are working with single-end > > >>>> reads the summarizeOverlaps,BamFileList-method I mentioned below should > > >>>> work fine. The approach is slightly different with paired-end reads. Our > > >>>> current algorithm for pairing paired-end reads requires the whole file > > >>>> to be read into memory. A different approach is currently being > > >>>> developed but in the meantime you can take the qname-sorted approach. > > >>>> The Bam file will need to be sorted by qname and both the 'yieldSize' > > >>>> and 'obeyQname=TRUE' set when creating the BamFile/BamFileList. An > > >>>> example is on ?BamFile. > > >>> > > >>> Thanks for pointing this out. My fault that I didn't read through all > > >>> the linked documentation. Perhaps it may not be a bad idea to make the > > >>> memory restricted bam read instances the default setting in the future. > > >>> This will definitely help biologists using those utilities without > > >>> crashing their machines on the first attempt. > > >> > > >> Good suggestion, thanks. > > >> > > >> Valerie > > >> > > >>> > > >>> Thomas > > >>> > > >>> > > >>>> > > >>>> Valerie > > >>>> > > >>>> > > >>>> On 04/09/2013 08:01 PM, Thomas Girke wrote: > > >>>>> Thanks for the tip. I guess doing it this way reverses the counting mode > > >>>>> back to countOverlaps, but how can I use at the same time > > >>>>> "IntersectionStrict" or any of the other modes provided by > > >>>>> summarizeOverlaps if its mode argument is already used and countOverlaps > > >>>>> doesn't accept one? > > >>>>> > > >>>>> Thomas > > >>>>> > > >>>>> On Tue, Apr 09, 2013 at 09:08:02PM +0000, Valerie Obenchain wrote: > > >>>>>> Thanks for the example. You're right, none of the modes will count a > > >>>>>> read falling completely within >1 feature. > > >>>>>> > > >>>>>> You can supply your own function as a 'mode'. Two requirements must be met: > > >>>>>> > > >>>>>> (1) The function must have 3 arguments that correspond to > > >>>>>> 'features', 'reads' and 'ignore.strand', in that order. > > >>>>>> > > >>>>>> (2) The function must return a vector of counts the same length > > >>>>>> as 'features' > > >>>>>> > > >>>>>> Here is an example using countOverlaps() which I think gets at the > > >>>>>> counting you want. > > >>>>>> > > >>>>>> counter <- function(x, y, ignore.strand) > > >>>>>> countOverlaps(y, x, ignore.strand=ignore.strand) > > >>>>>> > > >>>>>> > assays(summarizeOverlaps(gr, rd, mode=counter))$counts > > >>>>>> [,1] > > >>>>>> [1,] 1 > > >>>>>> [2,] 1 > > >>>>>> > > >>>>>> > > >>>>>> Valerie > > >>>>>> > > >>>>>> On 04/09/2013 09:37 AM, Thomas Girke wrote: > > >>>>>>> Hi Valerie, > > >>>>>>> > > >>>>>>> Perhaps let's call this more explicitly an ignore_inter_feature_overlap option > > >>>>>>> that we often need for cases like this: > > >>>>>>> > > >>>>>>> Read1 ---- > > >>>>>>> F1 ---------------- > > >>>>>>> F2 ----------- > > >>>>>>> > > >>>>>>> where we would like to have at least the option to assign Read1 to both F1 and F2: > > >>>>>>> > > >>>>>>> F1: 1 > > >>>>>>> F2: 1 > > >>>>>>> > > >>>>>>> However, summarizeOverlapse doesn't count Read1 at all in all of its currently > > >>>>>>> available modes that I am aware of. This lack of an ignore_inter_feature_overlap > > >>>>>>> option is frequently a problem when working with poorly annotated genomes (high > > >>>>>>> uncertainty about hidden/incorrect feature overlaps) or various > > >>>>>>> RNA/ChIP-Seq _engineering_ projects where I rather take the risk of ambiguous read > > >>>>>>> assignments than not counting at all as shown above. > > >>>>>>> > > >>>>>>> ## Here is a code example illustrating the same case: > > >>>>>>> library(GenomicRanges); library(Rsamtools) > > >>>>>>> rd <- GappedAlignments(letters[1], seqnames = Rle(rep("chr1",1)), > > >>>>>>> pos = as.integer(c(500)), > > >>>>>>> cigar = rep("100M", 1), strand = strand(rep("+", 1))) > > >>>>>>> gr1 <- GRanges("chr1", IRanges(start=100, width=1001), strand="+", ID="feat1") > > >>>>>>> gr2 <- GRanges("chr1", IRanges(start=500, width=101), strand="+", ID="feat2") > > >>>>>>> gr <- c(gr1, gr2) > > >>>>>>> > > >>>>>>> ## All of the three current modes in summarizeOverlaps return a count of zero > > >>>>>>> ## because of its inter_feature_overlap awareness: > > >>>>>>> assays(summarizeOverlaps(gr, rd, mode="Union", ignore.strand=TRUE))$counts > > >>>>>>> assays(summarizeOverlaps(gr, rd, mode="IntersectionStrict", ignore.strand=TRUE))$counts > > >>>>>>> assays(summarizeOverlaps(gr, rd, mode="IntersectionNotEmpty", ignore.strand=TRUE))$counts > > >>>>>>> [,1] > > >>>>>>> [1,] 0 > > >>>>>>> [2,] 0 > > >>>>>>> > > >>>>>>> ## However, countOverlaps handles this correctly, but is not the best choice when > > >>>>>>> ## counting multiple range features. > > >>>>>>> countOverlaps(gr, rd) > > >>>>>>> [1] 1 1 > > >>>>>>> > > >>>>>>> > > >>>>>>> Thomas > > >>>>>>> > > >>>>>>>> sessionInfo() > > >>>>>>> R version 3.0.0 (2013-04-03) > > >>>>>>> Platform: x86_64-apple-darwin10.8.0 (64-bit) > > >>>>>>> > > >>>>>>> locale: > > >>>>>>> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 > > >>>>>>> > > >>>>>>> attached base packages: > > >>>>>>> [1] parallel stats graphics grDevices utils datasets methods > > >>>>>>> [8] base > > >>>>>>> > > >>>>>>> other attached packages: > > >>>>>>> [1] Rsamtools_1.12.0 Biostrings_2.28.0 GenomicRanges_1.12.1 > > >>>>>>> [4] IRanges_1.18.0 BiocGenerics_0.6.0 > > >>>>>>> > > >>>>>>> loaded via a namespace (and not attached): > > >>>>>>> [1] bitops_1.0-5 stats4_3.0.0 tools_3.0.0 zlibbioc_1.6.0 > > >>>>>>> > > >>>>>>> > > >>>>>>> On Tue, Apr 09, 2013 at 03:42:03PM +0000, Valerie Obenchain wrote: > > >>>>>>>> Hi Thomas, > > >>>>>>>> > > >>>>>>>> On 04/08/2013 05:52 PM, Thomas Girke wrote: > > >>>>>>>>> Dear Valerie, > > >>>>>>>>> > > >>>>>>>>> Is there currently any way to run summarizeOverlaps in a feature-overlap > > >>>>>>>>> unaware mode, e.g with an ignorefeatureOL=FALSE/TRUE setting? Currently, > > >>>>>>>>> one can switch back to countOverlaps when feature overlap unawareness is > > >>>>>>>>> the more appropriate counting mode for a biological question, but then > > >>>>>>>>> double counting of reads mapping to multiple-range features is not > > >>>>>>>>> accounted for. It would be really nice to have such a feature-overlap > > >>>>>>>>> unaware option directly in summarizeOverlaps. > > >>>>>>>> > > >>>>>>>> No, we don't currently have an option to ignore feature- overlap. It > > >>>>>>>> sounds like you want to count with countOverlaps() but still want the > > >>>>>>>> double counting to be resolved. This is essentially what the other modes > > >>>>>>>> are doing so I must be missing something. > > >>>>>>>> > > >>>>>>>> In this example 2 reads hit feature A, 1 read hits feature B. With > > >>>>>>>> something like ignorefeature0L=TRUE, what results would you expect to > > >>>>>>>> see? Maybe you have another, more descriptive example? > > >>>>>>>> > > >>>>>>>> reads <- GRanges("chr1", IRanges(c(1, 5, 20), width=3)) > > >>>>>>>> features <- GRanges("chr1", IRanges(c(1, 20), width=10, > > >>>>>>>> names=c("A", "B"))) > > >>>>>>>> > > >>>>>>>> > countOverlaps(features, reads) > > >>>>>>>> [1] 2 1 > > >>>>>>>> > > >>>>>>>> > > >>>>>>>>> > > >>>>>>>>> Another question relates to the memory usage of summarizeOverlaps. Has > > >>>>>>>>> this been optimized yet? On a typical bam file with ~50-100 million > > >>>>>>>>> reads the memory usage of summarizeOverlaps is often around 10-20GB. To > > >>>>>>>>> use the function on a desktop computer or in large-scale RNA-Seq > > >>>>>>>>> projects on a commodity compute cluster, it would be desirable if every > > >>>>>>>>> counting instance would consume not more than 5GB of RAM. > > >>>>>>>> > > >>>>>>>> Have you tried the BamFileList-method? There is an example at the bottom > > >>>>>>>> of the ?BamFileList man page using summarizeOverlaps(). As Ryan > > >>>>>>>> mentioned, the key is to set the 'yieldSize' parameter when creating the > > >>>>>>>> BamFile. This method also makes use of mclapply(). > > >>>>>>>> > > >>>>>>>> Valerie > > >>>>>>>> > > >>>>>>>>> > > >>>>>>>>> Thanks in advance for your help and suggestions, > > >>>>>>>>> > > >>>>>>>>> Thomas > > >>>>>>>>> > > >>>>>>>>> _______________________________________________ > > >>>>>>>>> Bioconductor mailing list > > >>>>>>>>> Bioconductor at r-project.org<mailto:bioconductor at="" r-project.org=""> > > >>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor > > >>>>>>>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > >>>>>>>>> > > > > > > _______________________________________________ > > > Bioconductor mailing list > > > Bioconductor at r-project.org<mailto:bioconductor at="" r-project.org=""> > > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > > > > > > -- > > Computational Biology / Fred Hutchinson Cancer Research Center > > 1100 Fairview Ave. N. > > PO Box 19024 Seattle, WA 98109 > > > > Location: Arnold Building M1 B861 > > Phone: (206) 667-2793<tel:%28206%29%20667-2793> > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org<mailto:bioconductor at="" r-project.org=""> > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >
ADD COMMENTlink written 4.6 years ago by Thomas Girke1.6k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 261 users visited in the last hour