multiple feature/mode counting with summarizeOverlaps
2
0
Entering edit mode
Thomas Girke ★ 1.7k
@thomas-girke-993
Last seen 8 months ago
United States
I apologize up front to Valarie for posting additional questions related to her excellent read counter. This time my questions are related to the economics of accessing big files as few times as possible to improve workflow performance in large RNA-Seq projects and to subsequently allow deleting all input bam files since BIG DATA is killing us :). (1) What is currently the intended approach to obtain counts for multiple feature types in a single pass-through of one or many bam files? For instance, often we want to count the reads overlapping with many different features types (not just one), such as exonBygenes, cdsBygenes, UTRs, introns and intergenics. Computing all those feature counts sequentially in a loop is easy, but can be time consuming and heavy on disk I/O (the most expensive resource in NGS analysis) and internal networks when running many counting processes in parallel. As a partial solution to this, I often organize all feature types in one big GRanges/GRangesList container, perform the counting and then split the resulting count tables accordingly. However, is this really the best way of doing this? A major limitation of this approach is that it does not support feature overlap aware counting modes. (2) Similarly, what if I want to generate counts for multiple counting modes in a single call to summarizeOverlaps, such as different overlap modes, or sense and antisense counts which we often want to report when working with strand specific RNA-Seq data? Are there plans to support this? BTW: currently, I generate antisense read counts by inversing the strand information of the features which works fine just not for both ways in one step. (3) Is it currently possible to return the total numbers of aligned reads in bam files with summarizeOverlaps and store them in the resulting counting object? There are many places from where the numbers of aligned reads can be obtained, but to me the most obvious step to generate and store them would be right here. They are useful parameters for alignment QC'ing and reporting proper RPKM/FPKM values to biologists. Thomas
Alignment Alignment • 1.4k views
0
Entering edit mode
@valerie-obenchain-4275
Last seen 23 months ago
United States
Hi Thomas, On 05/20/2013 10:12 PM, Thomas Girke wrote: > I apologize up front to Valarie for posting additional questions related > to her excellent read counter. This time my questions are related to the > economics of accessing big files as few times as possible to improve > workflow performance in large RNA-Seq projects and to subsequently allow > deleting all input bam files since BIG DATA is killing us :). > > (1) What is currently the intended approach to obtain counts for > multiple feature types in a single pass-through of one or many bam > files? For instance, often we want to count the reads overlapping with > many different features types (not just one), such as exonBygenes, > cdsBygenes, UTRs, introns and intergenics. Computing all those feature > counts sequentially in a loop is easy, but can be time consuming and > heavy on disk I/O (the most expensive resource in NGS analysis) and > internal networks when running many counting processes in parallel. As a > partial solution to this, I often organize all feature types in one big > GRanges/GRangesList container, perform the counting and then split the > resulting count tables accordingly. However, is this really the best way > of doing this? A major limitation of this approach is that it does not > support feature overlap aware counting modes. I would not recommend combining different annotations into a single GRanges/GRangesList. To summarizeOverlaps(), each row of a GRanges or each list element of a GRangesList represents a feature. If a read hits more than one feature we have a double hit situation that must be resolved. If you're interested in counting both exonsByGene and transcriptsByGene you probably want these to be independent counts. By combining them into one object they can be confounding and create multiple hits situations where there really aren't any. Toy example, but let's say 'gr1' comes from exonsBy() and 'gr2' from 3UTRsByTranscript(). gr1 <- GRanges(c("chr1", "chr1"), IRanges(c(1, 10), width=10)) gr2 <- GRanges("chr1", IRanges(25, width=10)) grl <- GRangesList(gr1, gr2) reads <- GAlignments("chr1", pos=18L, cigar="10M", strand=strand("*")) The read hits gr1 and gr2. If counted separately, > assays(summarizeOverlaps(grl[1], reads, Union))$counts reads [1,] 1 > assays(summarizeOverlaps(grl[2], reads, Union))$counts reads [1,] 1 Now count as GRangesList, > assays(summarizeOverlaps(grl, reads, Union))$counts reads [1,] 0 [2,] 0 Using 'inter.feature=FALSE' will regain the hits with modes Union and IntersectionStrict but not necessarily IntersectionNotEmpty because shared regions of the annotation are removed before counting. > assays(summarizeOverlaps(grl, reads, Union, inter.feature=FALSE))$counts reads [1,] 1 [2,] 1 > > (2) Similarly, what if I want to generate counts for multiple counting > modes in a single call to summarizeOverlaps, such as different overlap > modes, or sense and antisense counts which we often want to report when > working with strand specific RNA-Seq data? Are there plans to support > this? BTW: currently, I generate antisense read counts by inversing the > strand information of the features which works fine just not for both > ways in one step. We don't have a built-in method to count Bams over multiple annotations and multiple count modes. There are several approaches you could take. After some discussion this was one we came up with. ## Count a single file. count1bam <- function(file, features, mode, ..., yieldSize=1000000L, reader=readGAlignments, param=ScanBamParam()) { ## initialize bf <- open(BamFile(file, yieldSize=yieldSize)) counts <- lapply(sapply(features, length), integer) ## iterate while(length(reads <- reader(bf, param=param))) for (i in seq_along(features)) counts[[i]] <- counts[[i]] + mode(features[[i]], reads, ...) close(bf) counts } ## Count over files in parallel. countbams <- function(features, files, ...) { counts0 <- mclapply(files, count1bam, features=features, ...) counts <- lapply(names(features), function(i, counts) { sapply(counts0, "[[", i) }, counts0) } ## In practice. library(parallel); options(mc.cores=detectCores()) library(GenomicRanges) library(Rsamtools) features <- list(coarse=GRanges("seq1", successiveIRanges(rep(100, 15))), fine=GRanges("seq2", successiveIRanges(rep(20, 45)))) fl <- system.file("extdata", "ex1.bam", package="Rsamtools") files <- list(fl, fl) ## usually different files ## As 'mode' you can directly call Union, IntersectionStrict, ## and IntersectionNotEmpty. result <- countbams(features, files, mode=Union, ignore.strand=TRUE, inter.feature=FALSE) ## 'reader' specifies how the data are read in. For paired-end ## use readGAlignmentsList (qname sorted or readGAlignmentPairs. result <- countbams(features, files, mode=Union, reader=readGAlignmentPairs) This code handles the case of multiple Bam with multiple annotations. To handle combinations for multiple arguments ('mode', 'ignore.strand', 'inter.feature', etc.) you could add a Map() in the 'while' loop in count1bam. Instead of passing a single value you would pass lists the same length as the list of annotations. For example, modeList <- c(Union, IntersectionStrict) ignoreStrandList <- c(TRUE, TRUE) interFeatureList <- c(TRUE, FALSE) > > (3) Is it currently possible to return the total numbers of aligned > reads in bam files with summarizeOverlaps and store them in the > resulting counting object? There are many places from where the numbers > of aligned reads can be obtained, but to me the most obvious step to > generate and store them would be right here. They are useful parameters > for alignment QC'ing and reporting proper RPKM/FPKM values to biologists. > I've added the output of countBam() to the 'colData' slot of the SummarizedExperiment. The 'mapped' counts are computed with countBam() and a param where 'isUnmappedQuery=FALSE'. This isn't a great example but does demonstrate the new output. library(Rsamtools) library(pasillaBamSubset) library("TxDb.Dmelanogaster.UCSC.dm3.ensGene") exbygene <- exonsBy(TxDb.Dmelanogaster.UCSC.dm3.ensGene) se <- summarizeOverlaps(exbygene, BamFile(untreated1_chr4())) > colData(se) DataFrame with 1 row and 3 columns records nucleotides mapped <integer> <numeric> <integer> untreated1_chr4.bam 204355 15326625 204355 Code updates are in GenomicRanges 1.13.15 and Rsamtools 1.13.16. Valerie > Thomas > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >
0
Entering edit mode
Thomas Girke ★ 1.7k
@thomas-girke-993
Last seen 8 months ago
United States
Thanks Varlerie for looking into these suggestions and improvements. I really appreciate the time and effort you keep investing into my unsolicited questions. - Obviously, read counting has become a major activity for many of us and your software will definitley put to great use. In our projects, flexiblity and accuracy of the read counter is most important to us, which summarizeOverlaps and associates address very well. Since research facilities like ours have to perform those analyses on very large disk storage devices with several hundred TBs of space that are shared among many NGS users, disk I/O and network speeds on compute clusters have become a major bottleneck. Thus, we are also very interested in solutions that collect counts for many feature types and counting modes in a single pass-through of these big files, as opposed to accessing them over and over again when it is not really necessary. Usually, I also like to delete the bam files early on in most of our RNA-Seq workflows to save storage space. However, our collaborators often keep coming up with all kinds of creative and exciting new ideas what features/modes to count on where the best analysis strategy is often to generate by default counts for many feature types right away (even if I don't intend and recommend or to use them) rather than only one prescibed solution that usually ends up not being enough. I guess RNA-Seq is still very much in an engineering stage where most of us depend on software design that provides a very high level of flexiblitly. Again, thanks a lot. Thomas On Thu, May 23, 2013 at 09:06:10PM +0000, Valerie Obenchain wrote: > Hi Thomas, > > On 05/20/2013 10:12 PM, Thomas Girke wrote: > > I apologize up front to Valarie for posting additional questions related > > to her excellent read counter. This time my questions are related to the > > economics of accessing big files as few times as possible to improve > > workflow performance in large RNA-Seq projects and to subsequently allow > > deleting all input bam files since BIG DATA is killing us :). > > > > (1) What is currently the intended approach to obtain counts for > > multiple feature types in a single pass-through of one or many bam > > files? For instance, often we want to count the reads overlapping with > > many different features types (not just one), such as exonBygenes, > > cdsBygenes, UTRs, introns and intergenics. Computing all those feature > > counts sequentially in a loop is easy, but can be time consuming and > > heavy on disk I/O (the most expensive resource in NGS analysis) and > > internal networks when running many counting processes in parallel. As a > > partial solution to this, I often organize all feature types in one big > > GRanges/GRangesList container, perform the counting and then split the > > resulting count tables accordingly. However, is this really the best way > > of doing this? A major limitation of this approach is that it does not > > support feature overlap aware counting modes. > > I would not recommend combining different annotations into a single > GRanges/GRangesList. To summarizeOverlaps(), each row of a GRanges or > each list element of a GRangesList represents a feature. If a read hits > more than one feature we have a double hit situation that must be > resolved. If you're interested in counting both exonsByGene and > transcriptsByGene you probably want these to be independent counts. By > combining them into one object they can be confounding and create > multiple hits situations where there really aren't any. > > Toy example, but let's say 'gr1' comes from exonsBy() and 'gr2' from > 3UTRsByTranscript(). > > gr1 <- GRanges(c("chr1", "chr1"), IRanges(c(1, 10), width=10)) > gr2 <- GRanges("chr1", IRanges(25, width=10)) > grl <- GRangesList(gr1, gr2) > reads <- GAlignments("chr1", pos=18L, cigar="10M", strand=strand("*")) > > > The read hits gr1 and gr2. If counted separately, > > > assays(summarizeOverlaps(grl[1], reads, Union))$counts > reads > [1,] 1 > > assays(summarizeOverlaps(grl[2], reads, Union))$counts > reads > [1,] 1 > > Now count as GRangesList, > > > assays(summarizeOverlaps(grl, reads, Union))$counts > reads > [1,] 0 > [2,] 0 > > > Using 'inter.feature=FALSE' will regain the hits with modes Union and > IntersectionStrict but not necessarily IntersectionNotEmpty because > shared regions of the annotation are removed before counting. > > > assays(summarizeOverlaps(grl, reads, Union, inter.feature=FALSE))$counts > reads > [1,] 1 > [2,] 1 > > > > > (2) Similarly, what if I want to generate counts for multiple counting > > modes in a single call to summarizeOverlaps, such as different overlap > > modes, or sense and antisense counts which we often want to report when > > working with strand specific RNA-Seq data? Are there plans to support > > this? BTW: currently, I generate antisense read counts by inversing the > > strand information of the features which works fine just not for both > > ways in one step. > > We don't have a built-in method to count Bams over multiple annotations > and multiple count modes. There are several approaches you could take. > After some discussion this was one we came up with. > > ## Count a single file. > count1bam <- > function(file, features, mode, ..., > yieldSize=1000000L, reader=readGAlignments, > param=ScanBamParam()) > { > ## initialize > bf <- open(BamFile(file, yieldSize=yieldSize)) > counts <- lapply(sapply(features, length), integer) > > ## iterate > while(length(reads <- reader(bf, param=param))) > for (i in seq_along(features)) > counts[[i]] <- > counts[[i]] + mode(features[[i]], reads, ...) > close(bf) > counts > } > > ## Count over files in parallel. > countbams <- > function(features, files, ...) > { > counts0 <- mclapply(files, count1bam, features=features, ...) > counts <- lapply(names(features), function(i, counts) { > sapply(counts0, "[[", i) > }, counts0) > } > > ## In practice. > library(parallel); options(mc.cores=detectCores()) > library(GenomicRanges) > library(Rsamtools) > features <- > list(coarse=GRanges("seq1", successiveIRanges(rep(100, 15))), > fine=GRanges("seq2", successiveIRanges(rep(20, 45)))) > fl <- system.file("extdata", "ex1.bam", package="Rsamtools") > files <- list(fl, fl) ## usually different files > > ## As 'mode' you can directly call Union, IntersectionStrict, > ## and IntersectionNotEmpty. > result <- countbams(features, files, mode=Union, > ignore.strand=TRUE, inter.feature=FALSE) > > ## 'reader' specifies how the data are read in. For paired-end > ## use readGAlignmentsList (qname sorted or readGAlignmentPairs. > result <- countbams(features, files, mode=Union, > reader=readGAlignmentPairs) > > This code handles the case of multiple Bam with multiple annotations. To > handle combinations for multiple arguments ('mode', 'ignore.strand', > 'inter.feature', etc.) you could add a Map() in the 'while' loop in > count1bam. Instead of passing a single value you would pass lists the > same length as the list of annotations. For example, > > modeList <- c(Union, IntersectionStrict) > ignoreStrandList <- c(TRUE, TRUE) > interFeatureList <- c(TRUE, FALSE) > > > > > (3) Is it currently possible to return the total numbers of aligned > > reads in bam files with summarizeOverlaps and store them in the > > resulting counting object? There are many places from where the numbers > > of aligned reads can be obtained, but to me the most obvious step to > > generate and store them would be right here. They are useful parameters > > for alignment QC'ing and reporting proper RPKM/FPKM values to biologists. > > > > I've added the output of countBam() to the 'colData' slot of the > SummarizedExperiment. The 'mapped' counts are computed with countBam() > and a param where 'isUnmappedQuery=FALSE'. This isn't a great example > but does demonstrate the new output. > > library(Rsamtools) > library(pasillaBamSubset) > library("TxDb.Dmelanogaster.UCSC.dm3.ensGene") > exbygene <- exonsBy(TxDb.Dmelanogaster.UCSC.dm3.ensGene) > se <- summarizeOverlaps(exbygene, BamFile(untreated1_chr4())) > > > colData(se) > DataFrame with 1 row and 3 columns > records nucleotides mapped > <integer> <numeric> <integer> > untreated1_chr4.bam 204355 15326625 204355 > > > > Code updates are in GenomicRanges 1.13.15 and Rsamtools 1.13.16. > > Valerie > > > Thomas > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor at r-project.org > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > >
0
Entering edit mode
Thanks Thomas. I appreciate the thoughts and the back-story details. These are helpful in shaping future decisions. Valerie On 05/23/2013 03:39 PM, Thomas Girke wrote: > Thanks Varlerie for looking into these suggestions and improvements. I > really appreciate the time and effort you keep investing into my > unsolicited questions. - Obviously, read counting has become a major > activity for many of us and your software will definitley put to great > use. In our projects, flexiblity and accuracy of the read counter is > most important to us, which summarizeOverlaps and associates address > very well. Since research facilities like ours have to perform those > analyses on very large disk storage devices with several hundred TBs of > space that are shared among many NGS users, disk I/O and network speeds > on compute clusters have become a major bottleneck. Thus, we are also > very interested in solutions that collect counts for many feature types > and counting modes in a single pass-through of these big files, as > opposed to accessing them over and over again when it is not really > necessary. Usually, I also like to delete the bam files early on in most > of our RNA-Seq workflows to save storage space. However, our > collaborators often keep coming up with all kinds of creative and > exciting new ideas what features/modes to count on where the best > analysis strategy is often to generate by default counts for many > feature types right away (even if I don't intend and recommend or to use > them) rather than only one prescibed solution that usually ends up not > being enough. I guess RNA-Seq is still very much in an engineering stage > where most of us depend on software design that provides a very high > level of flexiblitly. > > Again, thanks a lot. > > Thomas > > > > On Thu, May 23, 2013 at 09:06:10PM +0000, Valerie Obenchain wrote: >> Hi Thomas, >> >> On 05/20/2013 10:12 PM, Thomas Girke wrote: >>> I apologize up front to Valarie for posting additional questions related >>> to her excellent read counter. This time my questions are related to the >>> economics of accessing big files as few times as possible to improve >>> workflow performance in large RNA-Seq projects and to subsequently allow >>> deleting all input bam files since BIG DATA is killing us :). >>> >>> (1) What is currently the intended approach to obtain counts for >>> multiple feature types in a single pass-through of one or many bam >>> files? For instance, often we want to count the reads overlapping with >>> many different features types (not just one), such as exonBygenes, >>> cdsBygenes, UTRs, introns and intergenics. Computing all those feature >>> counts sequentially in a loop is easy, but can be time consuming and >>> heavy on disk I/O (the most expensive resource in NGS analysis) and >>> internal networks when running many counting processes in parallel. As a >>> partial solution to this, I often organize all feature types in one big >>> GRanges/GRangesList container, perform the counting and then split the >>> resulting count tables accordingly. However, is this really the best way >>> of doing this? A major limitation of this approach is that it does not >>> support feature overlap aware counting modes. >> >> I would not recommend combining different annotations into a single >> GRanges/GRangesList. To summarizeOverlaps(), each row of a GRanges or >> each list element of a GRangesList represents a feature. If a read hits >> more than one feature we have a double hit situation that must be >> resolved. If you're interested in counting both exonsByGene and >> transcriptsByGene you probably want these to be independent counts. By >> combining them into one object they can be confounding and create >> multiple hits situations where there really aren't any. >> >> Toy example, but let's say 'gr1' comes from exonsBy() and 'gr2' from >> 3UTRsByTranscript(). >> >> gr1 <- GRanges(c("chr1", "chr1"), IRanges(c(1, 10), width=10)) >> gr2 <- GRanges("chr1", IRanges(25, width=10)) >> grl <- GRangesList(gr1, gr2) >> reads <- GAlignments("chr1", pos=18L, cigar="10M", strand=strand("*")) >> >> >> The read hits gr1 and gr2. If counted separately, >> >> > assays(summarizeOverlaps(grl[1], reads, Union))$counts >> reads >> [1,] 1 >> > assays(summarizeOverlaps(grl[2], reads, Union))$counts >> reads >> [1,] 1 >> >> Now count as GRangesList, >> >> > assays(summarizeOverlaps(grl, reads, Union))$counts >> reads >> [1,] 0 >> [2,] 0 >> >> >> Using 'inter.feature=FALSE' will regain the hits with modes Union and >> IntersectionStrict but not necessarily IntersectionNotEmpty because >> shared regions of the annotation are removed before counting. >> >> > assays(summarizeOverlaps(grl, reads, Union, inter.feature=FALSE))$counts >> reads >> [1,] 1 >> [2,] 1 >> >>> >>> (2) Similarly, what if I want to generate counts for multiple counting >>> modes in a single call to summarizeOverlaps, such as different overlap >>> modes, or sense and antisense counts which we often want to report when >>> working with strand specific RNA-Seq data? Are there plans to support >>> this? BTW: currently, I generate antisense read counts by inversing the >>> strand information of the features which works fine just not for both >>> ways in one step. >> >> We don't have a built-in method to count Bams over multiple annotations >> and multiple count modes. There are several approaches you could take. >> After some discussion this was one we came up with. >> >> ## Count a single file. >> count1bam <- >> function(file, features, mode, ..., >> yieldSize=1000000L, reader=readGAlignments, >> param=ScanBamParam()) >> { >> ## initialize >> bf <- open(BamFile(file, yieldSize=yieldSize)) >> counts <- lapply(sapply(features, length), integer) >> >> ## iterate >> while(length(reads <- reader(bf, param=param))) >> for (i in seq_along(features)) >> counts[[i]] <- >> counts[[i]] + mode(features[[i]], reads, ...) >> close(bf) >> counts >> } >> >> ## Count over files in parallel. >> countbams <- >> function(features, files, ...) >> { >> counts0 <- mclapply(files, count1bam, features=features, ...) >> counts <- lapply(names(features), function(i, counts) { >> sapply(counts0, "[[", i) >> }, counts0) >> } >> >> ## In practice. >> library(parallel); options(mc.cores=detectCores()) >> library(GenomicRanges) >> library(Rsamtools) >> features <- >> list(coarse=GRanges("seq1", successiveIRanges(rep(100, 15))), >> fine=GRanges("seq2", successiveIRanges(rep(20, 45)))) >> fl <- system.file("extdata", "ex1.bam", package="Rsamtools") >> files <- list(fl, fl) ## usually different files >> >> ## As 'mode' you can directly call Union, IntersectionStrict, >> ## and IntersectionNotEmpty. >> result <- countbams(features, files, mode=Union, >> ignore.strand=TRUE, inter.feature=FALSE) >> >> ## 'reader' specifies how the data are read in. For paired- end >> ## use readGAlignmentsList (qname sorted or readGAlignmentPairs. >> result <- countbams(features, files, mode=Union, >> reader=readGAlignmentPairs) >> >> This code handles the case of multiple Bam with multiple annotations. To >> handle combinations for multiple arguments ('mode', 'ignore.strand', >> 'inter.feature', etc.) you could add a Map() in the 'while' loop in >> count1bam. Instead of passing a single value you would pass lists the >> same length as the list of annotations. For example, >> >> modeList <- c(Union, IntersectionStrict) >> ignoreStrandList <- c(TRUE, TRUE) >> interFeatureList <- c(TRUE, FALSE) >> >>> >>> (3) Is it currently possible to return the total numbers of aligned >>> reads in bam files with summarizeOverlaps and store them in the >>> resulting counting object? There are many places from where the numbers >>> of aligned reads can be obtained, but to me the most obvious step to >>> generate and store them would be right here. They are useful parameters >>> for alignment QC'ing and reporting proper RPKM/FPKM values to biologists. >>> >> >> I've added the output of countBam() to the 'colData' slot of the >> SummarizedExperiment. The 'mapped' counts are computed with countBam() >> and a param where 'isUnmappedQuery=FALSE'. This isn't a great example >> but does demonstrate the new output. >> >> library(Rsamtools) >> library(pasillaBamSubset) >> library("TxDb.Dmelanogaster.UCSC.dm3.ensGene") >> exbygene <- exonsBy(TxDb.Dmelanogaster.UCSC.dm3.ensGene) >> se <- summarizeOverlaps(exbygene, BamFile(untreated1_chr4())) >> >> > colData(se) >> DataFrame with 1 row and 3 columns >> records nucleotides mapped >> <integer> <numeric> <integer> >> untreated1_chr4.bam 204355 15326625 204355 >> >> >> >> Code updates are in GenomicRanges 1.13.15 and Rsamtools 1.13.16. >> >> Valerie >> >>> Thomas >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >>>