Search
Question: GenomicAlignments - Speeding up readGAlignmentPairs()
0
gravatar for Valerie Obenchain
3.5 years ago by
Valerie Obenchain ♦♦ 6.4k
United States
Valerie Obenchain ♦♦ 6.4k wrote:
Hi, You don't need to call findMateAlignment() when using readGAlignmentPairs(). In a previous iteration, readGAlignmentPairs() called findMateAlignment() but this is no longer the case. readGAlignmentPairs() does have more overhead than readGAlignments() because of the mate paring. These numbers are for a BAM file with 800484 records (400054 final pairs) on my not-so-speedy laptop. library(microbenchmark) library(RNAseqData.HNRNPC.bam.chr14) fls <- RNAseqData.HNRNPC.bam.chr14_BAMFILES >> microbenchmark(readGAlignments(fls[1]), times=5) > Unit: seconds > expr min lq median uq max neval > readGAlignments(fls[1]) 1.753653 1.788292 1.85527 2.019071 2.121211 5 >> ## mate pairing >> microbenchmark(readGAlignmentPairs(fls[1]), times=5) > Unit: seconds > expr min lq median uq max neval > readGAlignmentPairs(fls[1]) 9.283966 9.300328 9.535194 9.843282 11.37441 5 ## mate pairing >> microbenchmark(readGAlignmentsList(fls[1]), times=5) > Unit: seconds > expr min lq median uq max neval > readGAlignmentsList(fls[1]) 8.409743 8.457232 8.803696 9.692845 9.867628 5 You can set a 'yieldSize' when creating the BamFile. This will chunk through the file 'yieldSize' records at a time and is useful when processing a complete file and when there are memory limitations. bf <- BamFile(myfile, yieldSize = 1000000) Herve reported some additional timings for readGAlignmentPairs() on this post: https://stat.ethz.ch/pipermail/bioconductor/2014-May/059695.html Can you provide some numbers for how long your run is taking and the number of records you're trying to read in? I assume 'allExons' is used for counting against the BAM files in later code? Maybe this is why your were looking into summarizeOverlaps(). I should point out that if you're only interested in the regions overlapping 'allExons' you could call range(allExons) and use that as the 'which' in the param. This would focus the area of interest and speed up the reading. Valerie On 06/30/14 12:22, Fong Chun Chan wrote: > Hi, > > I've been using the GenomicFeatures R package to read in some RNA- seq > paired-end read data. While the readGAlignments() function reads in the bam > file within a minute, I've noticed that the readGAlignmentPairs() function > is extremely slow. > > This is even after restricting the space to just a single chromosome along > with setting the flags isDuplicate = FALSE and isNotPrimaryRead = FALSE > (the latter being suggested in the reference): > > "An easy way to avoid this degradation is to load only primary alignments > by setting the isNotPrimaryRead ???ag to FALSE in ScanBamParam()" > > Based on my understanding, it seems that the ???ndMateAlignment() function > has to be run first. Is there any other way to speed it up? Perhaps I am > doing something wrong here? > > Code and sessionInfo() below. > > Thanks! > > ---- > library("GenomicFeatures") > library('GenomicAlignments') > library('Rsamtools') > si <- seqinfo(BamFile(arguments$bamFile)) > gr <- GRanges(arguments$chr, IRanges(100, > seqlengths(si)[[arguments$chr]]-100)) > allExons <- exons(txdb, columns = c('gene_id', 'exon_id', 'exon_name')) > scf <- scanBamFlag( isDuplicate = FALSE, isNotPrimaryRead = FALSE ) # > remove duplicate reads, remove non-primary reads (i.e. multi-aligned reads) > reads <- readGAlignmentPairs( arguments$bamFile, param = ScanBamParam( > which = gr, flag = scf ) ); # grab reads in specific region > >> sessionInfo() > R version 3.1.0 (2014-04-10) > Platform: x86_64-unknown-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 > LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C > LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] parallel stats graphics grDevices utils datasets methods > base > > other attached packages: > [1] argparse_1.0.1 proto_0.3-10 > GenomicAlignments_1.0.1 BSgenome_1.32.0 Rsamtools_1.16.1 > Biostrings_2.32.0 XVector_0.4.0 GenomicFeatures_1.16.2 > AnnotationDbi_1.26.0 Biobase_2.24.0 GenomicRanges_1.16.3 > GenomeInfoDb_1.0.2 IRanges_1.22.9 BiocGenerics_0.10.0 > vimcom.plus_0.9-93 > [16] setwidth_1.0-3 colorout_1.0-2 > > loaded via a namespace (and not attached): > [1] BatchJobs_1.2 BBmisc_1.7 BiocParallel_0.6.1 > biomaRt_2.20.0 bitops_1.0-6 brew_1.0-6 checkmate_1.0 > codetools_0.2-8 DBI_0.2-7 digest_0.6.4 fail_1.2 > findpython_1.0.1 foreach_1.4.2 getopt_1.20.0 iterators_1.0.7 > plyr_1.8.1 Rcpp_0.11.2 RCurl_1.95-4.1 > [19] rjson_0.2.14 RSQLite_0.11.4 rtracklayer_1.24.2 > sendmailR_1.1-2 stats4_3.1.0 stringr_0.6.2 tools_3.1.0 > XML_3.98-1.1 zlibbioc_1.10.0 >> > > [[alternative HTML version deleted]] > > > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Valerie Obenchain Program in Computational Biology Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, Seattle, WA 98109 Email: vobencha at fhcrc.org Phone: (206) 667-3158
ADD COMMENTlink modified 3.5 years ago by Fong Chun Chan320 • written 3.5 years ago by Valerie Obenchain ♦♦ 6.4k
0
gravatar for Fong Chun Chan
3.5 years ago by
Fong Chun Chan320 wrote:
Hi Valerie, Thanks for the reply. I'll post some runtime as soon as it finishes running the readGAlignments and readGAlignmentPairs. Thanks for the tip on passing range(allExons) into the which param. There is only one issue in that if I wanted to then generate say RPKM values where I need to know the "total number of aligned reads" in a bam file. Because I've read in just a subset of the bam (i.e. the reads that aligned just to exons) I would have technically produced an incorrection calculation of RPKM values (assuming you normalize using the number of aligned reads). On that note, is there a quick and easy way to just count the number of aligned reads or aligned paired reads so I can quickly normalize by this way. This way I can still use the which parameter in readGAlignments and readGAlignmentPairs to speed it up? Also another thing I notice is that when I use the readGAlignmentPairs function, I lose my ability to cancel the command (ctrl-c no longer works). Not sure this is a bug, or if it is something to do with the internal workings of the function. The loss of ctrl-c also occurs when trying to run as a Rscript. I can seemingly ctrl-c anytime before, but it hits this step it just refuses to cancel. Thanks, On Tue, Jul 1, 2014 at 10:06 AM, Valerie Obenchain <vobencha@fhcrc.org> wrote: > Hi, > > You don't need to call findMateAlignment() when using > readGAlignmentPairs(). In a previous iteration, readGAlignmentPairs() > called findMateAlignment() but this is no longer the case. > > readGAlignmentPairs() does have more overhead than readGAlignments() > because of the mate paring. These numbers are for a BAM file with 800484 > records (400054 final pairs) on my not-so-speedy laptop. > > library(microbenchmark) > library(RNAseqData.HNRNPC.bam.chr14) > fls <- RNAseqData.HNRNPC.bam.chr14_BAMFILES > > microbenchmark(readGAlignments(fls[1]), times=5) >>> >> Unit: seconds >> expr min lq median uq max neval >> readGAlignments(fls[1]) 1.753653 1.788292 1.85527 2.019071 2.121211 5 >> >>> >>> > ## mate pairing > >> microbenchmark(readGAlignmentPairs(fls[1]), times=5) >>> >> Unit: seconds >> expr min lq median uq max >> neval >> readGAlignmentPairs(fls[1]) 9.283966 9.300328 9.535194 9.843282 11.37441 >> 5 >> > > ## mate pairing > >> microbenchmark(readGAlignmentsList(fls[1]), times=5) >>> >> Unit: seconds >> expr min lq median uq max >> neval >> readGAlignmentsList(fls[1]) 8.409743 8.457232 8.803696 9.692845 9.867628 >> 5 >> > > > You can set a 'yieldSize' when creating the BamFile. This will chunk > through the file 'yieldSize' records at a time and is useful when > processing a complete file and when there are memory limitations. > > bf <- BamFile(myfile, yieldSize = 1000000) > > Herve reported some additional timings for readGAlignmentPairs() on this > post: > > https://stat.ethz.ch/pipermail/bioconductor/2014-May/059695.html > > > Can you provide some numbers for how long your run is taking and the > number of records you're trying to read in? > > I assume 'allExons' is used for counting against the BAM files in later > code? Maybe this is why your were looking into summarizeOverlaps(). I > should point out that if you're only interested in the regions overlapping > 'allExons' you could call range(allExons) and use that as the 'which' in > the param. This would focus the area of interest and speed up the reading. > > Valerie > > > > > On 06/30/14 12:22, Fong Chun Chan wrote: > >> Hi, >> >> I've been using the GenomicFeatures R package to read in some RNA- seq >> paired-end read data. While the readGAlignments() function reads in the >> bam >> file within a minute, I've noticed that the readGAlignmentPairs() function >> is extremely slow. >> >> This is even after restricting the space to just a single chromosome along >> with setting the flags isDuplicate = FALSE and isNotPrimaryRead = FALSE >> (the latter being suggested in the reference): >> >> "An easy way to avoid this degradation is to load only primary alignments >> by setting the isNotPrimaryRead ï¬'ag to FALSE in ScanBamParam()" >> >> Based on my understanding, it seems that the ï¬ ndMateAlignment() function >> >> has to be run first. Is there any other way to speed it up? Perhaps I am >> doing something wrong here? >> >> Code and sessionInfo() below. >> >> Thanks! >> >> ---- >> library("GenomicFeatures") >> library('GenomicAlignments') >> library('Rsamtools') >> si <- seqinfo(BamFile(arguments$bamFile)) >> gr <- GRanges(arguments$chr, IRanges(100, >> seqlengths(si)[[arguments$chr]]-100)) >> allExons <- exons(txdb, columns = c('gene_id', 'exon_id', 'exon_name')) >> scf <- scanBamFlag( isDuplicate = FALSE, isNotPrimaryRead = FALSE ) # >> remove duplicate reads, remove non-primary reads (i.e. multi- aligned >> reads) >> reads <- readGAlignmentPairs( arguments$bamFile, param = ScanBamParam( >> which = gr, flag = scf ) ); # grab reads in specific region >> >> sessionInfo() >>> >> R version 3.1.0 (2014-04-10) >> Platform: x86_64-unknown-linux-gnu (64-bit) >> >> locale: >> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >> LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 >> LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 >> LC_NAME=C LC_ADDRESS=C >> LC_TELEPHONE=C >> LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >> >> attached base packages: >> [1] parallel stats graphics grDevices utils datasets methods >> base >> >> other attached packages: >> [1] argparse_1.0.1 proto_0.3-10 >> GenomicAlignments_1.0.1 BSgenome_1.32.0 Rsamtools_1.16.1 >> Biostrings_2.32.0 XVector_0.4.0 GenomicFeatures_1.16.2 >> AnnotationDbi_1.26.0 Biobase_2.24.0 GenomicRanges_1.16.3 >> GenomeInfoDb_1.0.2 IRanges_1.22.9 BiocGenerics_0.10.0 >> vimcom.plus_0.9-93 >> [16] setwidth_1.0-3 colorout_1.0-2 >> >> loaded via a namespace (and not attached): >> [1] BatchJobs_1.2 BBmisc_1.7 BiocParallel_0.6.1 >> biomaRt_2.20.0 bitops_1.0-6 brew_1.0-6 checkmate_1.0 >> codetools_0.2-8 DBI_0.2-7 digest_0.6.4 fail_1.2 >> findpython_1.0.1 foreach_1.4.2 getopt_1.20.0 >> iterators_1.0.7 >> plyr_1.8.1 Rcpp_0.11.2 RCurl_1.95-4.1 >> [19] rjson_0.2.14 RSQLite_0.11.4 rtracklayer_1.24.2 >> sendmailR_1.1-2 stats4_3.1.0 stringr_0.6.2 tools_3.1.0 >> XML_3.98-1.1 zlibbioc_1.10.0 >> >>> >>> >> [[alternative HTML version deleted]] >> >> >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor@r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane. >> science.biology.informatics.conductor >> >> > > -- > Valerie Obenchain > Program in Computational Biology > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, Seattle, WA 98109 > > Email: vobencha@fhcrc.org > Phone: (206) 667-3158 > [[alternative HTML version deleted]]
ADD COMMENTlink written 3.5 years ago by Fong Chun Chan320
Hi, On 07/02/2014 12:56 PM, Fong Chun Chan wrote: > Hi Valerie, > > Thanks for the reply. I'll post some runtime as soon as it finishes > running the readGAlignments and readGAlignmentPairs. > > Thanks for the tip on passing range(allExons) into the which param. > There is only one issue in that if I wanted to then generate say RPKM > values where I need to know the "total number of aligned reads" in a bam > file. Because I've read in just a subset of the bam (i.e. the reads that > aligned just to exons) I would have technically produced an incorrection > calculation of RPKM values (assuming you normalize using the number of > aligned reads). On that note, is there a quick and easy way to just > count the number of aligned reads or aligned paired reads so I can > quickly normalize by this way. This way I can still use the which > parameter in readGAlignments and readGAlignmentPairs to speed it up? For single-end you can use countBam(): > fl <- system.file("extdata", "ex1.bam", package = "Rsamtools") > param <- ScanBamParam(flag = scanBamFlag(isUnmappedQuery = FALSE)) > countBam(fl, param = param) space start end width file records nucleotides 1 NA NA NA NA ex1.bam 3271 115286 It's not so clear for paired-end. The mate-pairing algorithm used in the readGAlignment* functions is described on the man page and uses a combination of BAM fields and flags. You can't really get at this simply by creating a param with a flag combination such as isPaired = TRUE isProperPair = TRUE isUnmappedQuery = FALSE hasUnmappedMate = FALSE For example, if you're interested in this region gr <- GRanges("seq1", IRanges(100, 500)) and set flags, flags <- scanBamFlag(isPaired = TRUE, isProperPair = TRUE, isUnmappedQuery = FALSE, hasUnmappedMate = FALSE) countBam() reports 335 records. These are individual records so the mate pairs (parts) are each counted separately. Only records that fall within the region defined by 'gr' are counted. > param <- ScanBamParam(which = gr, flag = flags) > countBam(fl, param = param) space start end width file records nucleotides 1 seq1 100 500 401 ex1.bam 335 11785 readGAlignmentPairs() reports 223 pairs (446 individual records). All records in 'gr' are mated if possible. In the case where one mate part falls inside 'gr' and one outside, the other mate will be retrieved (i.e., outside the range of 'gr'). This makes the 335 above confusing for counting the number of aligned pairs in the region. The number is a mixture of (1) mate parts both in 'gr' that belong together (2) mate parts with mate outside 'gr' and (3) mates or mate parts that don't meet other mate-pairing criteria. > galp <- readGAlignmentPairs(fl, param=param) > length(galp) [1] 223 Off hand I can't think of a good way to count aligned pairs w/in a specified region. > > Also another thing I notice is that when I use the > readGAlignmentPairs function, I lose my ability to cancel the command > (ctrl-c no longer works). It's implemented in C/C++. You can try ctrl+\ in the session or kill with the top command from another terminal. Maybe others have suggestions they'll add ... Valerie Not sure this is a bug, or if it is something > to do with the internal workings of the function. The loss of ctrl-c > also occurs when trying to run as a Rscript. I can seemingly ctrl-c > anytime before, but it hits this step it just refuses to cancel. > > Thanks, > > > > > > > On Tue, Jul 1, 2014 at 10:06 AM, Valerie Obenchain <vobencha at="" fhcrc.org=""> <mailto:vobencha at="" fhcrc.org="">> wrote: > > Hi, > > You don't need to call findMateAlignment() when using > readGAlignmentPairs(). In a previous iteration, > readGAlignmentPairs() called findMateAlignment() but this is no > longer the case. > > readGAlignmentPairs() does have more overhead than readGAlignments() > because of the mate paring. These numbers are for a BAM file with > 800484 records (400054 final pairs) on my not-so-speedy laptop. > > library(microbenchmark) > library(RNAseqData.HNRNPC.bam.__chr14) > fls <- RNAseqData.HNRNPC.bam.chr14___BAMFILES > > microbenchmark(__readGAlignments(fls[1]), times=5) > > Unit: seconds > expr min lq median uq > max neval > readGAlignments(fls[1]) 1.753653 1.788292 1.85527 2.019071 > 2.121211 5 > > > > ## mate pairing > > microbenchmark(__readGAlignmentPairs(fls[1]), times=5) > > Unit: seconds > expr min lq median > uq max neval > readGAlignmentPairs(fls[1]) 9.283966 9.300328 9.535194 > 9.843282 11.37441 5 > > > ## mate pairing > > microbenchmark(__readGAlignmentsList(fls[1]), times=5) > > Unit: seconds > expr min lq median > uq max neval > readGAlignmentsList(fls[1]) 8.409743 8.457232 8.803696 > 9.692845 9.867628 5 > > > > You can set a 'yieldSize' when creating the BamFile. This will chunk > through the file 'yieldSize' records at a time and is useful when > processing a complete file and when there are memory limitations. > > bf <- BamFile(myfile, yieldSize = 1000000) > > Herve reported some additional timings for readGAlignmentPairs() on > this post: > > https://stat.ethz.ch/__pipermail/bioconductor/2014-__May/059695.html > <https: stat.ethz.ch="" pipermail="" bioconductor="" 2014-may="" 059695.html=""> > > > Can you provide some numbers for how long your run is taking and the > number of records you're trying to read in? > > I assume 'allExons' is used for counting against the BAM files in > later code? Maybe this is why your were looking into > summarizeOverlaps(). I should point out that if you're only > interested in the regions overlapping 'allExons' you could call > range(allExons) and use that as the 'which' in the param. This would > focus the area of interest and speed up the reading. > > Valerie > > > > > On 06/30/14 12:22, Fong Chun Chan wrote: > > Hi, > > I've been using the GenomicFeatures R package to read in some > RNA-seq > paired-end read data. While the readGAlignments() function reads > in the bam > file within a minute, I've noticed that the > readGAlignmentPairs() function > is extremely slow. > > This is even after restricting the space to just a single > chromosome along > with setting the flags isDuplicate = FALSE and isNotPrimaryRead > = FALSE > (the latter being suggested in the reference): > > "An easy way to avoid this degradation is to load only primary > alignments > by setting the isNotPrimaryRead ???ag to FALSE in ScanBamParam()" > > Based on my understanding, it seems that the ?? > ndMateAlignment() function > > has to be run first. Is there any other way to speed it up? > Perhaps I am > doing something wrong here? > > Code and sessionInfo() below. > > Thanks! > > ---- > library("GenomicFeatures") > library('GenomicAlignments') > library('Rsamtools') > si <- seqinfo(BamFile(arguments$__bamFile)) > gr <- GRanges(arguments$chr, IRanges(100, > seqlengths(si)[[arguments$chr]__]-100)) > allExons <- exons(txdb, columns = c('gene_id', 'exon_id', > 'exon_name')) > scf <- scanBamFlag( isDuplicate = FALSE, isNotPrimaryRead = > FALSE ) # > remove duplicate reads, remove non-primary reads (i.e. > multi-aligned reads) > reads <- readGAlignmentPairs( arguments$bamFile, param = > ScanBamParam( > which = gr, flag = scf ) ); # grab reads in specific region > > sessionInfo() > > R version 3.1.0 (2014-04-10) > Platform: x86_64-unknown-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 > LC_PAPER=en_US.UTF-8 > LC_NAME=C LC_ADDRESS=C > LC_TELEPHONE=C > LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] parallel stats graphics grDevices utils datasets > methods > base > > other attached packages: > [1] argparse_1.0.1 proto_0.3-10 > GenomicAlignments_1.0.1 BSgenome_1.32.0 Rsamtools_1.16.1 > Biostrings_2.32.0 XVector_0.4.0 > GenomicFeatures_1.16.2 > AnnotationDbi_1.26.0 Biobase_2.24.0 > GenomicRanges_1.16.3 > GenomeInfoDb_1.0.2 IRanges_1.22.9 > BiocGenerics_0.10.0 > vimcom.plus_0.9-93 > [16] setwidth_1.0-3 colorout_1.0-2 > > loaded via a namespace (and not attached): > [1] BatchJobs_1.2 BBmisc_1.7 BiocParallel_0.6.1 > biomaRt_2.20.0 bitops_1.0-6 brew_1.0-6 > checkmate_1.0 > codetools_0.2-8 DBI_0.2-7 digest_0.6.4 fail_1.2 > findpython_1.0.1 foreach_1.4.2 getopt_1.20.0 > iterators_1.0.7 > plyr_1.8.1 Rcpp_0.11.2 RCurl_1.95-4.1 > [19] rjson_0.2.14 RSQLite_0.11.4 rtracklayer_1.24.2 > sendmailR_1.1-2 stats4_3.1.0 stringr_0.6.2 tools_3.1.0 > XML_3.98-1.1 zlibbioc_1.10.0 > > > > [[alternative HTML version deleted]] > > > > _________________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org <mailto:bioconductor at="" r-project.org=""> > https://stat.ethz.ch/mailman/__listinfo/bioconductor > <https: stat.ethz.ch="" mailman="" listinfo="" bioconductor=""> > Search the archives: > http://news.gmane.org/gmane.__science.biology.informatics.__conductor > <http: news.gmane.org="" gmane.science.biology.informatics.conductor=""> > > > > -- > Valerie Obenchain > Program in Computational Biology > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, Seattle, WA 98109 > > Email: vobencha at fhcrc.org <mailto:vobencha at="" fhcrc.org=""> > Phone: (206) 667-3158 <tel:%28206%29%20667-3158> > >
ADD REPLYlink written 3.5 years ago by Valerie Obenchain ♦♦ 6.4k
Correction below. On 07/03/2014 09:28 AM, Valerie Obenchain wrote: > Hi, > > On 07/02/2014 12:56 PM, Fong Chun Chan wrote: >> Hi Valerie, >> >> Thanks for the reply. I'll post some runtime as soon as it finishes >> running the readGAlignments and readGAlignmentPairs. >> >> Thanks for the tip on passing range(allExons) into the which param. >> There is only one issue in that if I wanted to then generate say RPKM >> values where I need to know the "total number of aligned reads" in a bam >> file. Because I've read in just a subset of the bam (i.e. the reads that >> aligned just to exons) I would have technically produced an incorrection >> calculation of RPKM values (assuming you normalize using the number of >> aligned reads). On that note, is there a quick and easy way to just >> count the number of aligned reads or aligned paired reads so I can >> quickly normalize by this way. This way I can still use the which >> parameter in readGAlignments and readGAlignmentPairs to speed it up? > > For single-end you can use countBam(): > > > fl <- system.file("extdata", "ex1.bam", package = "Rsamtools") > > param <- ScanBamParam(flag = scanBamFlag(isUnmappedQuery = FALSE)) > > countBam(fl, param = param) > space start end width file records nucleotides > 1 NA NA NA NA ex1.bam 3271 115286 > > It's not so clear for paired-end. The mate-pairing algorithm used in the > readGAlignment* functions is described on the man page and uses a > combination of BAM fields and flags. You can't really get at this simply > by creating a param with a flag combination such as > > isPaired = TRUE > isProperPair = TRUE > isUnmappedQuery = FALSE > hasUnmappedMate = FALSE > > For example, if you're interested in this region > > gr <- GRanges("seq1", IRanges(100, 500)) > > and set flags, > > flags <- scanBamFlag(isPaired = TRUE, > isProperPair = TRUE, > isUnmappedQuery = FALSE, > hasUnmappedMate = FALSE) > > countBam() reports 335 records. These are individual records so the mate > pairs (parts) are each counted separately. Only records that fall within > the region defined by 'gr' are counted. > > > param <- ScanBamParam(which = gr, flag = flags) > > countBam(fl, param = param) > space start end width file records nucleotides > 1 seq1 100 500 401 ex1.bam 335 11785 > > > readGAlignmentPairs() reports 223 pairs (446 individual records). All > records in 'gr' are mated if possible. In the case where one mate part > falls inside 'gr' and one outside, the other mate will be retrieved > (i.e., outside the range of 'gr'). This makes the 335 above confusing > for counting the number of aligned pairs in the region. The number is a > mixture of (1) mate parts both in 'gr' that belong together (2) mate > parts with mate outside 'gr' and (3) mates or mate parts that don't meet > other mate-pairing criteria. > > > galp <- readGAlignmentPairs(fl, param=param) > > length(galp) > [1] 223 > > Off hand I can't think of a good way to count aligned pairs w/in a > specified region. What I should have said here is I can't think of a good way to accurately count aligned pairs (1 count per pair) in a BAM file w/o using the pairing algo. Valerie > >> >> Also another thing I notice is that when I use the >> readGAlignmentPairs function, I lose my ability to cancel the command >> (ctrl-c no longer works). > > It's implemented in C/C++. You can try ctrl+\ in the session or kill > with the top command from another terminal. Maybe others have > suggestions they'll add ... > > Valerie > > Not sure this is a bug, or if it is something >> to do with the internal workings of the function. The loss of ctrl-c >> also occurs when trying to run as a Rscript. I can seemingly ctrl-c >> anytime before, but it hits this step it just refuses to cancel. >> >> Thanks, >> >> >> >> >> >> >> On Tue, Jul 1, 2014 at 10:06 AM, Valerie Obenchain <vobencha at="" fhcrc.org="">> <mailto:vobencha at="" fhcrc.org="">> wrote: >> >> Hi, >> >> You don't need to call findMateAlignment() when using >> readGAlignmentPairs(). In a previous iteration, >> readGAlignmentPairs() called findMateAlignment() but this is no >> longer the case. >> >> readGAlignmentPairs() does have more overhead than readGAlignments() >> because of the mate paring. These numbers are for a BAM file with >> 800484 records (400054 final pairs) on my not-so-speedy laptop. >> >> library(microbenchmark) >> library(RNAseqData.HNRNPC.bam.__chr14) >> fls <- RNAseqData.HNRNPC.bam.chr14___BAMFILES >> >> microbenchmark(__readGAlignments(fls[1]), times=5) >> >> Unit: seconds >> expr min lq median uq >> max neval >> readGAlignments(fls[1]) 1.753653 1.788292 1.85527 2.019071 >> 2.121211 5 >> >> >> >> ## mate pairing >> >> microbenchmark(__readGAlignmentPairs(fls[1]), times=5) >> >> Unit: seconds >> expr min lq median >> uq max neval >> readGAlignmentPairs(fls[1]) 9.283966 9.300328 9.535194 >> 9.843282 11.37441 5 >> >> >> ## mate pairing >> >> microbenchmark(__readGAlignmentsList(fls[1]), times=5) >> >> Unit: seconds >> expr min lq median >> uq max neval >> readGAlignmentsList(fls[1]) 8.409743 8.457232 8.803696 >> 9.692845 9.867628 5 >> >> >> >> You can set a 'yieldSize' when creating the BamFile. This will chunk >> through the file 'yieldSize' records at a time and is useful when >> processing a complete file and when there are memory limitations. >> >> bf <- BamFile(myfile, yieldSize = 1000000) >> >> Herve reported some additional timings for readGAlignmentPairs() on >> this post: >> >> https://stat.ethz.ch/__pipermail/bioconductor/2014-__May/059695.html >> <https: stat.ethz.ch="" pipermail="" bioconductor="" 2014-may="" 059695.html=""> >> >> >> Can you provide some numbers for how long your run is taking and the >> number of records you're trying to read in? >> >> I assume 'allExons' is used for counting against the BAM files in >> later code? Maybe this is why your were looking into >> summarizeOverlaps(). I should point out that if you're only >> interested in the regions overlapping 'allExons' you could call >> range(allExons) and use that as the 'which' in the param. This would >> focus the area of interest and speed up the reading. >> >> Valerie >> >> >> >> >> On 06/30/14 12:22, Fong Chun Chan wrote: >> >> Hi, >> >> I've been using the GenomicFeatures R package to read in some >> RNA-seq >> paired-end read data. While the readGAlignments() function reads >> in the bam >> file within a minute, I've noticed that the >> readGAlignmentPairs() function >> is extremely slow. >> >> This is even after restricting the space to just a single >> chromosome along >> with setting the flags isDuplicate = FALSE and isNotPrimaryRead >> = FALSE >> (the latter being suggested in the reference): >> >> "An easy way to avoid this degradation is to load only primary >> alignments >> by setting the isNotPrimaryRead ???ag to FALSE in ScanBamParam()" >> >> Based on my understanding, it seems that the ?? >> ndMateAlignment() function >> >> has to be run first. Is there any other way to speed it up? >> Perhaps I am >> doing something wrong here? >> >> Code and sessionInfo() below. >> >> Thanks! >> >> ---- >> library("GenomicFeatures") >> library('GenomicAlignments') >> library('Rsamtools') >> si <- seqinfo(BamFile(arguments$__bamFile)) >> gr <- GRanges(arguments$chr, IRanges(100, >> seqlengths(si)[[arguments$chr]__]-100)) >> allExons <- exons(txdb, columns = c('gene_id', 'exon_id', >> 'exon_name')) >> scf <- scanBamFlag( isDuplicate = FALSE, isNotPrimaryRead = >> FALSE ) # >> remove duplicate reads, remove non-primary reads (i.e. >> multi-aligned reads) >> reads <- readGAlignmentPairs( arguments$bamFile, param = >> ScanBamParam( >> which = gr, flag = scf ) ); # grab reads in specific region >> >> sessionInfo() >> >> R version 3.1.0 (2014-04-10) >> Platform: x86_64-unknown-linux-gnu (64-bit) >> >> locale: >> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >> LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 >> LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 >> LC_PAPER=en_US.UTF-8 >> LC_NAME=C LC_ADDRESS=C >> LC_TELEPHONE=C >> LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >> >> attached base packages: >> [1] parallel stats graphics grDevices utils datasets >> methods >> base >> >> other attached packages: >> [1] argparse_1.0.1 proto_0.3-10 >> GenomicAlignments_1.0.1 BSgenome_1.32.0 >> Rsamtools_1.16.1 >> Biostrings_2.32.0 XVector_0.4.0 >> GenomicFeatures_1.16.2 >> AnnotationDbi_1.26.0 Biobase_2.24.0 >> GenomicRanges_1.16.3 >> GenomeInfoDb_1.0.2 IRanges_1.22.9 >> BiocGenerics_0.10.0 >> vimcom.plus_0.9-93 >> [16] setwidth_1.0-3 colorout_1.0-2 >> >> loaded via a namespace (and not attached): >> [1] BatchJobs_1.2 BBmisc_1.7 BiocParallel_0.6.1 >> biomaRt_2.20.0 bitops_1.0-6 brew_1.0-6 >> checkmate_1.0 >> codetools_0.2-8 DBI_0.2-7 digest_0.6.4 >> fail_1.2 >> findpython_1.0.1 foreach_1.4.2 getopt_1.20.0 >> iterators_1.0.7 >> plyr_1.8.1 Rcpp_0.11.2 RCurl_1.95-4.1 >> [19] rjson_0.2.14 RSQLite_0.11.4 rtracklayer_1.24.2 >> sendmailR_1.1-2 stats4_3.1.0 stringr_0.6.2 >> tools_3.1.0 >> XML_3.98-1.1 zlibbioc_1.10.0 >> >> >> >> [[alternative HTML version deleted]] >> >> >> >> _________________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org <mailto:bioconductor at="" r-project.org=""> >> https://stat.ethz.ch/mailman/__listinfo/bioconductor >> <https: stat.ethz.ch="" mailman="" listinfo="" bioconductor=""> >> Search the archives: >> >> http://news.gmane.org/gmane.__science.biology.informatics.__conductor >> >> <http: news.gmane.org="" gmane.science.biology.informatics.conductor=""> >> >> >> >> -- >> Valerie Obenchain >> Program in Computational Biology >> Fred Hutchinson Cancer Research Center >> 1100 Fairview Ave. N, Seattle, WA 98109 >> >> Email: vobencha at fhcrc.org <mailto:vobencha at="" fhcrc.org=""> >> Phone: (206) 667-3158 <tel:%28206%29%20667-3158> >> >> > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor
ADD REPLYlink written 3.5 years ago by Valerie Obenchain ♦♦ 6.4k
Thanks for the reply. I understand what you mean about counting aligned pairs within a region and how that would give incorrect results. However, if I was purely interested in just getting all aligned read- pairs within the whole genomic space (no restriction to any region since this is what I need to normalize by), wouldn't the following code work? library('Rsamtools') library('GenomicAlignments') fl <- system.file("extdata", "ex1.bam", package = "Rsamtools") flags <- scanBamFlag(isPaired = TRUE, isProperPair = TRUE, isUnmappedQuery = FALSE, hasUnmappedMate = FALSE) param <- ScanBamParam(flag = flags) countBam(fl, param = param)[['records']]/2 [1]1572 galp <- readGAlignmentPairs(fl, param=param) length(galp) [1]1572 On Thu, Jul 3, 2014 at 9:40 AM, Valerie Obenchain <vobencha@fhcrc.org> wrote: > Correction below. > > > On 07/03/2014 09:28 AM, Valerie Obenchain wrote: > >> Hi, >> >> On 07/02/2014 12:56 PM, Fong Chun Chan wrote: >> >>> Hi Valerie, >>> >>> Thanks for the reply. I'll post some runtime as soon as it finishes >>> running the readGAlignments and readGAlignmentPairs. >>> >>> Thanks for the tip on passing range(allExons) into the which param. >>> There is only one issue in that if I wanted to then generate say RPKM >>> values where I need to know the "total number of aligned reads" in a bam >>> file. Because I've read in just a subset of the bam (i.e. the reads that >>> aligned just to exons) I would have technically produced an incorrection >>> calculation of RPKM values (assuming you normalize using the number of >>> aligned reads). On that note, is there a quick and easy way to just >>> count the number of aligned reads or aligned paired reads so I can >>> quickly normalize by this way. This way I can still use the which >>> parameter in readGAlignments and readGAlignmentPairs to speed it up? >>> >> >> For single-end you can use countBam(): >> >> > fl <- system.file("extdata", "ex1.bam", package = "Rsamtools") >> > param <- ScanBamParam(flag = scanBamFlag(isUnmappedQuery = FALSE)) >> > countBam(fl, param = param) >> space start end width file records nucleotides >> 1 NA NA NA NA ex1.bam 3271 115286 >> >> It's not so clear for paired-end. The mate-pairing algorithm used in the >> readGAlignment* functions is described on the man page and uses a >> combination of BAM fields and flags. You can't really get at this simply >> by creating a param with a flag combination such as >> >> isPaired = TRUE >> isProperPair = TRUE >> isUnmappedQuery = FALSE >> hasUnmappedMate = FALSE >> >> For example, if you're interested in this region >> >> gr <- GRanges("seq1", IRanges(100, 500)) >> >> and set flags, >> >> flags <- scanBamFlag(isPaired = TRUE, >> isProperPair = TRUE, >> isUnmappedQuery = FALSE, >> hasUnmappedMate = FALSE) >> >> countBam() reports 335 records. These are individual records so the mate >> pairs (parts) are each counted separately. Only records that fall within >> the region defined by 'gr' are counted. >> >> > param <- ScanBamParam(which = gr, flag = flags) >> > countBam(fl, param = param) >> space start end width file records nucleotides >> 1 seq1 100 500 401 ex1.bam 335 11785 >> >> >> readGAlignmentPairs() reports 223 pairs (446 individual records). All >> records in 'gr' are mated if possible. In the case where one mate part >> falls inside 'gr' and one outside, the other mate will be retrieved >> (i.e., outside the range of 'gr'). This makes the 335 above confusing >> for counting the number of aligned pairs in the region. The number is a >> mixture of (1) mate parts both in 'gr' that belong together (2) mate >> parts with mate outside 'gr' and (3) mates or mate parts that don't meet >> other mate-pairing criteria. >> >> > galp <- readGAlignmentPairs(fl, param=param) >> > length(galp) >> [1] 223 >> >> Off hand I can't think of a good way to count aligned pairs w/in a >> specified region. >> > > What I should have said here is I can't think of a good way to accurately > count aligned pairs (1 count per pair) in a BAM file w/o using the pairing > algo. > > Valerie > > >> >>> Also another thing I notice is that when I use the >>> readGAlignmentPairs function, I lose my ability to cancel the command >>> (ctrl-c no longer works). >>> >> >> It's implemented in C/C++. You can try ctrl+\ in the session or kill >> with the top command from another terminal. Maybe others have >> suggestions they'll add ... >> >> Valerie >> >> Not sure this is a bug, or if it is something >> >>> to do with the internal workings of the function. The loss of ctrl-c >>> also occurs when trying to run as a Rscript. I can seemingly ctrl-c >>> anytime before, but it hits this step it just refuses to cancel. >>> >>> Thanks, >>> >>> >>> >>> >>> >>> >>> On Tue, Jul 1, 2014 at 10:06 AM, Valerie Obenchain <vobencha@fhcrc.org>>> <mailto:vobencha@fhcrc.org>> wrote: >>> >>> Hi, >>> >>> You don't need to call findMateAlignment() when using >>> readGAlignmentPairs(). In a previous iteration, >>> readGAlignmentPairs() called findMateAlignment() but this is no >>> longer the case. >>> >>> readGAlignmentPairs() does have more overhead than readGAlignments() >>> because of the mate paring. These numbers are for a BAM file with >>> 800484 records (400054 final pairs) on my not-so-speedy laptop. >>> >>> library(microbenchmark) >>> library(RNAseqData.HNRNPC.bam.__chr14) >>> fls <- RNAseqData.HNRNPC.bam.chr14___BAMFILES >>> >>> microbenchmark(__readGAlignments(fls[1]), times=5) >>> >>> Unit: seconds >>> expr min lq median uq >>> max neval >>> readGAlignments(fls[1]) 1.753653 1.788292 1.85527 2.019071 >>> 2.121211 5 >>> >>> >>> >>> ## mate pairing >>> >>> microbenchmark(__readGAlignmentPairs(fls[1]), times=5) >>> >>> Unit: seconds >>> expr min lq median >>> uq max neval >>> readGAlignmentPairs(fls[1]) 9.283966 9.300328 9.535194 >>> 9.843282 11.37441 5 >>> >>> >>> ## mate pairing >>> >>> microbenchmark(__readGAlignmentsList(fls[1]), times=5) >>> >>> Unit: seconds >>> expr min lq median >>> uq max neval >>> readGAlignmentsList(fls[1]) 8.409743 8.457232 8.803696 >>> 9.692845 9.867628 5 >>> >>> >>> >>> You can set a 'yieldSize' when creating the BamFile. This will chunk >>> through the file 'yieldSize' records at a time and is useful when >>> processing a complete file and when there are memory limitations. >>> >>> bf <- BamFile(myfile, yieldSize = 1000000) >>> >>> Herve reported some additional timings for readGAlignmentPairs() on >>> this post: >>> >>> https://stat.ethz.ch/__pipermail/bioconductor/2014-__May/059695.html >>> <https: stat.ethz.ch="" pipermail="" bioconductor="" 2014-may="" 059695.html=""> >>> >>> >>> Can you provide some numbers for how long your run is taking and the >>> number of records you're trying to read in? >>> >>> I assume 'allExons' is used for counting against the BAM files in >>> later code? Maybe this is why your were looking into >>> summarizeOverlaps(). I should point out that if you're only >>> interested in the regions overlapping 'allExons' you could call >>> range(allExons) and use that as the 'which' in the param. This would >>> focus the area of interest and speed up the reading. >>> >>> Valerie >>> >>> >>> >>> >>> On 06/30/14 12:22, Fong Chun Chan wrote: >>> >>> Hi, >>> >>> I've been using the GenomicFeatures R package to read in some >>> RNA-seq >>> paired-end read data. While the readGAlignments() function reads >>> in the bam >>> file within a minute, I've noticed that the >>> readGAlignmentPairs() function >>> is extremely slow. >>> >>> This is even after restricting the space to just a single >>> chromosome along >>> with setting the flags isDuplicate = FALSE and isNotPrimaryRead >>> = FALSE >>> (the latter being suggested in the reference): >>> >>> "An easy way to avoid this degradation is to load only primary >>> alignments >>> by setting the isNotPrimaryRead ï¬'ag to FALSE in ScanBamParam()" >>> >>> Based on my understanding, it seems that the ï¬ >>> ndMateAlignment() function >>> >>> has to be run first. Is there any other way to speed it up? >>> Perhaps I am >>> doing something wrong here? >>> >>> Code and sessionInfo() below. >>> >>> Thanks! >>> >>> ---- >>> library("GenomicFeatures") >>> library('GenomicAlignments') >>> library('Rsamtools') >>> si <- seqinfo(BamFile(arguments$__bamFile)) >>> gr <- GRanges(arguments$chr, IRanges(100, >>> seqlengths(si)[[arguments$chr]__]-100)) >>> allExons <- exons(txdb, columns = c('gene_id', 'exon_id', >>> 'exon_name')) >>> scf <- scanBamFlag( isDuplicate = FALSE, isNotPrimaryRead = >>> FALSE ) # >>> remove duplicate reads, remove non-primary reads (i.e. >>> multi-aligned reads) >>> reads <- readGAlignmentPairs( arguments$bamFile, param = >>> ScanBamParam( >>> which = gr, flag = scf ) ); # grab reads in specific region >>> >>> sessionInfo() >>> >>> R version 3.1.0 (2014-04-10) >>> Platform: x86_64-unknown-linux-gnu (64-bit) >>> >>> locale: >>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >>> LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 >>> LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 >>> LC_PAPER=en_US.UTF-8 >>> LC_NAME=C LC_ADDRESS=C >>> LC_TELEPHONE=C >>> LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >>> >>> attached base packages: >>> [1] parallel stats graphics grDevices utils datasets >>> methods >>> base >>> >>> other attached packages: >>> [1] argparse_1.0.1 proto_0.3-10 >>> GenomicAlignments_1.0.1 BSgenome_1.32.0 >>> Rsamtools_1.16.1 >>> Biostrings_2.32.0 XVector_0.4.0 >>> GenomicFeatures_1.16.2 >>> AnnotationDbi_1.26.0 Biobase_2.24.0 >>> GenomicRanges_1.16.3 >>> GenomeInfoDb_1.0.2 IRanges_1.22.9 >>> BiocGenerics_0.10.0 >>> vimcom.plus_0.9-93 >>> [16] setwidth_1.0-3 colorout_1.0-2 >>> >>> loaded via a namespace (and not attached): >>> [1] BatchJobs_1.2 BBmisc_1.7 BiocParallel_0.6.1 >>> biomaRt_2.20.0 bitops_1.0-6 brew_1.0-6 >>> checkmate_1.0 >>> codetools_0.2-8 DBI_0.2-7 digest_0.6.4 >>> fail_1.2 >>> findpython_1.0.1 foreach_1.4.2 getopt_1.20.0 >>> iterators_1.0.7 >>> plyr_1.8.1 Rcpp_0.11.2 RCurl_1.95-4.1 >>> [19] rjson_0.2.14 RSQLite_0.11.4 rtracklayer_1.24.2 >>> sendmailR_1.1-2 stats4_3.1.0 stringr_0.6.2 >>> tools_3.1.0 >>> XML_3.98-1.1 zlibbioc_1.10.0 >>> >>> >>> >>> [[alternative HTML version deleted]] >>> >>> >>> >>> _________________________________________________ >>> Bioconductor mailing list >>> Bioconductor@r-project.org <mailto:bioconductor@r-project.org> >>> https://stat.ethz.ch/mailman/__listinfo/bioconductor >>> <https: stat.ethz.ch="" mailman="" listinfo="" bioconductor=""> >>> Search the archives: >>> >>> http://news.gmane.org/gmane.__science.biology.informatics.__conductor >>> >>> <http: news.gmane.org="" gmane.science.biology.informatics.conductor=""> >>> >>> >>> >>> -- >>> Valerie Obenchain >>> Program in Computational Biology >>> Fred Hutchinson Cancer Research Center >>> 1100 Fairview Ave. N, Seattle, WA 98109 >>> >>> Email: vobencha@fhcrc.org <mailto:vobencha@fhcrc.org> >>> Phone: (206) 667-3158 <tel:%28206%29%20667-3158> >>> >>> >>> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor@r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > [[alternative HTML version deleted]]
ADD REPLYlink written 3.5 years ago by Fong Chun Chan320
No, the division by 2 isn't always safe. 'ex1.bam' is a well-behaved file. Files with more diversity may have more than 2 segments per template. counts <- lapply(fls, function(fl) countBam(fl, param=param)$records) pairs <- lapply(fls, function(fl) length(readGAlignmentPairs(fl, param=param))) >> unlist(counts)/2 > ERR127306 ERR127307 ERR127308 ERR127309 ERR127302 ERR127303 ERR127304 ERR127305 > 363326 392420 397158 349970 355376 381493 389176 355834 >> unlist(pairs) > ERR127306 ERR127307 ERR127308 ERR127309 ERR127302 ERR127303 ERR127304 ERR127305 > 363147 392252 396933 349822 355215 381274 388991 355682 To see what's going on use readGAlignementsList() on the first file. This function adds some flexibility in that it reads in fragments, pairs w/ no mates, etc. galist <- readGAlignmentsList(fl, param=param) The length matches that of countBam() because it reads all records. >> length(galist) > [1] 363226 'mate_status' categorizes records by 'mated', 'ambiguous' and 'unmated'. Here we see the 363147 match to output from readGAlignmentsPairs() and the rest are ambiguous. >> table(mcols(galist)$mate_status) > > mated ambiguous unmated > 363147 79 0 elementLengths() shows the number of segments per template. Again we see the 363147 match and it shows exactly 2 segments for those. >> table(elementLengths(galist)) > > 2 4 6 8 > 363147 65 7 7 Valerie On 07/03/2014 10:08 AM, Fong Chun Chan wrote: > Thanks for the reply. I understand what you mean about counting aligned > pairs within a region and how that would give incorrect results. > > However, if I was purely interested in just getting all aligned > read-pairs within the whole genomic space (no restriction to any region > since this is what I need to normalize by), wouldn't the following code > work? > > library('Rsamtools') > library('GenomicAlignments') > > fl <- system.file("extdata", "ex1.bam", package = "Rsamtools") > > flags <- scanBamFlag(isPaired = TRUE, > isProperPair = TRUE, > isUnmappedQuery = FALSE, > hasUnmappedMate = FALSE) > > param <- ScanBamParam(flag = flags) > > countBam(fl, param = param)[['records']]/2 > [1]1572 > > galp <- readGAlignmentPairs(fl, param=param) > length(galp) > [1]1572 > > > On Thu, Jul 3, 2014 at 9:40 AM, Valerie Obenchain <vobencha at="" fhcrc.org=""> <mailto:vobencha at="" fhcrc.org="">> wrote: > > Correction below. > > > On 07/03/2014 09:28 AM, Valerie Obenchain wrote: > > Hi, > > On 07/02/2014 12:56 PM, Fong Chun Chan wrote: > > Hi Valerie, > > Thanks for the reply. I'll post some runtime as soon as it > finishes > running the readGAlignments and readGAlignmentPairs. > > Thanks for the tip on passing range(allExons) into the which > param. > There is only one issue in that if I wanted to then generate > say RPKM > values where I need to know the "total number of aligned > reads" in a bam > file. Because I've read in just a subset of the bam (i.e. > the reads that > aligned just to exons) I would have technically produced an > incorrection > calculation of RPKM values (assuming you normalize using the > number of > aligned reads). On that note, is there a quick and easy way > to just > count the number of aligned reads or aligned paired reads so > I can > quickly normalize by this way. This way I can still use the > which > parameter in readGAlignments and readGAlignmentPairs to > speed it up? > > > For single-end you can use countBam(): > > > fl <- system.file("extdata", "ex1.bam", package = "Rsamtools") > > param <- ScanBamParam(flag = scanBamFlag(isUnmappedQuery = > FALSE)) > > countBam(fl, param = param) > space start end width file records nucleotides > 1 NA NA NA NA ex1.bam 3271 115286 > > It's not so clear for paired-end. The mate-pairing algorithm > used in the > readGAlignment* functions is described on the man page and uses a > combination of BAM fields and flags. You can't really get at > this simply > by creating a param with a flag combination such as > > isPaired = TRUE > isProperPair = TRUE > isUnmappedQuery = FALSE > hasUnmappedMate = FALSE > > For example, if you're interested in this region > > gr <- GRanges("seq1", IRanges(100, 500)) > > and set flags, > > flags <- scanBamFlag(isPaired = TRUE, > isProperPair = TRUE, > isUnmappedQuery = FALSE, > hasUnmappedMate = FALSE) > > countBam() reports 335 records. These are individual records so > the mate > pairs (parts) are each counted separately. Only records that > fall within > the region defined by 'gr' are counted. > > > param <- ScanBamParam(which = gr, flag = flags) > > countBam(fl, param = param) > space start end width file records nucleotides > 1 seq1 100 500 401 ex1.bam 335 11785 > > > readGAlignmentPairs() reports 223 pairs (446 individual > records). All > records in 'gr' are mated if possible. In the case where one > mate part > falls inside 'gr' and one outside, the other mate will be retrieved > (i.e., outside the range of 'gr'). This makes the 335 above > confusing > for counting the number of aligned pairs in the region. The > number is a > mixture of (1) mate parts both in 'gr' that belong together (2) mate > parts with mate outside 'gr' and (3) mates or mate parts that > don't meet > other mate-pairing criteria. > > > galp <- readGAlignmentPairs(fl, param=param) > > length(galp) > [1] 223 > > Off hand I can't think of a good way to count aligned pairs w/in a > specified region. > > > What I should have said here is I can't think of a good way to > accurately count aligned pairs (1 count per pair) in a BAM file w/o > using the pairing algo. > > Valerie > > > > Also another thing I notice is that when I use the > readGAlignmentPairs function, I lose my ability to cancel > the command > (ctrl-c no longer works). > > > It's implemented in C/C++. You can try ctrl+\ in the session or kill > with the top command from another terminal. Maybe others have > suggestions they'll add ... > > Valerie > > Not sure this is a bug, or if it is something > > to do with the internal workings of the function. The loss > of ctrl-c > also occurs when trying to run as a Rscript. I can seemingly > ctrl-c > anytime before, but it hits this step it just refuses to cancel. > > Thanks, > > > > > > > On Tue, Jul 1, 2014 at 10:06 AM, Valerie Obenchain > <vobencha at="" fhcrc.org="" <mailto:vobencha="" at="" fhcrc.org=""> > <mailto:vobencha at="" fhcrc.org="" <mailto:vobencha="" at="" fhcrc.org="">>> wrote: > > Hi, > > You don't need to call findMateAlignment() when using > readGAlignmentPairs(). In a previous iteration, > readGAlignmentPairs() called findMateAlignment() but > this is no > longer the case. > > readGAlignmentPairs() does have more overhead than > readGAlignments() > because of the mate paring. These numbers are for a BAM > file with > 800484 records (400054 final pairs) on my not-so- speedy > laptop. > > library(microbenchmark) > library(RNAseqData.HNRNPC.bam.____chr14) > fls <- RNAseqData.HNRNPC.bam.chr14_____BAMFILES > > microbenchmark(____readGAlignments(fls[1]), > times=5) > > Unit: seconds > expr min lq median > uq > max neval > readGAlignments(fls[1]) 1.753653 1.788292 1.85527 > 2.019071 > 2.121211 5 > > > > ## mate pairing > > > microbenchmark(____readGAlignmentPairs(fls[1]), times=5) > > Unit: seconds > expr min lq > median > uq max neval > readGAlignmentPairs(fls[1]) 9.283966 9.300328 > 9.535194 > 9.843282 11.37441 5 > > > ## mate pairing > > > microbenchmark(____readGAlignmentsList(fls[1]), times=5) > > Unit: seconds > expr min lq > median > uq max neval > readGAlignmentsList(fls[1]) 8.409743 8.457232 > 8.803696 > 9.692845 9.867628 5 > > > > You can set a 'yieldSize' when creating the BamFile. > This will chunk > through the file 'yieldSize' records at a time and is > useful when > processing a complete file and when there are memory > limitations. > > bf <- BamFile(myfile, yieldSize = 1000000) > > Herve reported some additional timings for > readGAlignmentPairs() on > this post: > > https://stat.ethz.ch/____pipermail/bioconductor/2014-___ _May/059695.html > <https: stat.ethz.ch="" __pipermail="" bioconductor="" 2014-__may="" 059695.html=""> > > <https: stat.ethz.ch="" __pipermail="" bioconductor="" 2014-__may="" 059695.html=""> <https: stat.ethz.ch="" pipermail="" bioconductor="" 2014-may="" 059695.html="">> > > > Can you provide some numbers for how long your run is > taking and the > number of records you're trying to read in? > > I assume 'allExons' is used for counting against the > BAM files in > later code? Maybe this is why your were looking into > summarizeOverlaps(). I should point out that if you're only > interested in the regions overlapping 'allExons' you > could call > range(allExons) and use that as the 'which' in the > param. This would > focus the area of interest and speed up the reading. > > Valerie > > > > > On 06/30/14 12:22, Fong Chun Chan wrote: > > Hi, > > I've been using the GenomicFeatures R package to > read in some > RNA-seq > paired-end read data. While the readGAlignments() > function reads > in the bam > file within a minute, I've noticed that the > readGAlignmentPairs() function > is extremely slow. > > This is even after restricting the space to just a > single > chromosome along > with setting the flags isDuplicate = FALSE and > isNotPrimaryRead > = FALSE > (the latter being suggested in the reference): > > "An easy way to avoid this degradation is to load > only primary > alignments > by setting the isNotPrimaryRead ???ag to FALSE in > ScanBamParam()" > > Based on my understanding, it seems that the ?? > ndMateAlignment() function > > has to be run first. Is there any other way to > speed it up? > Perhaps I am > doing something wrong here? > > Code and sessionInfo() below. > > Thanks! > > ---- > library("GenomicFeatures") > library('GenomicAlignments') > library('Rsamtools') > si <- seqinfo(BamFile(arguments$____bamFile)) > gr <- GRanges(arguments$chr, IRanges(100, > seqlengths(si)[[arguments$chr]____]-100)) > allExons <- exons(txdb, columns = c('gene_id', > 'exon_id', > 'exon_name')) > scf <- scanBamFlag( isDuplicate = FALSE, > isNotPrimaryRead = > FALSE ) # > remove duplicate reads, remove non-primary reads (i.e. > multi-aligned reads) > reads <- readGAlignmentPairs( arguments$bamFile, > param = > ScanBamParam( > which = gr, flag = scf ) ); # grab reads in > specific region > > sessionInfo() > > R version 3.1.0 (2014-04-10) > Platform: x86_64-unknown-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 > LC_PAPER=en_US.UTF-8 > LC_NAME=C LC_ADDRESS=C > LC_TELEPHONE=C > LC_MEASUREMENT=en_US.UTF-8 > LC_IDENTIFICATION=C > > attached base packages: > [1] parallel stats graphics grDevices utils > datasets > methods > base > > other attached packages: > [1] argparse_1.0.1 proto_0.3-10 > GenomicAlignments_1.0.1 BSgenome_1.32.0 > Rsamtools_1.16.1 > Biostrings_2.32.0 XVector_0.4.0 > GenomicFeatures_1.16.2 > AnnotationDbi_1.26.0 Biobase_2.24.0 > GenomicRanges_1.16.3 > GenomeInfoDb_1.0.2 IRanges_1.22.9 > BiocGenerics_0.10.0 > vimcom.plus_0.9-93 > [16] setwidth_1.0-3 colorout_1.0-2 > > loaded via a namespace (and not attached): > [1] BatchJobs_1.2 BBmisc_1.7 > BiocParallel_0.6.1 > biomaRt_2.20.0 bitops_1.0-6 brew_1.0-6 > checkmate_1.0 > codetools_0.2-8 DBI_0.2-7 digest_0.6.4 > fail_1.2 > findpython_1.0.1 foreach_1.4.2 getopt_1.20.0 > iterators_1.0.7 > plyr_1.8.1 Rcpp_0.11.2 > RCurl_1.95-4.1 > [19] rjson_0.2.14 RSQLite_0.11.4 > rtracklayer_1.24.2 > sendmailR_1.1-2 stats4_3.1.0 stringr_0.6.2 > tools_3.1.0 > XML_3.98-1.1 zlibbioc_1.10.0 > > > > [[alternative HTML version deleted]] > > > > ___________________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > <mailto:bioconductor at="" r-project.org=""> > <mailto:bioconductor at="" r-__project.org=""> <mailto:bioconductor at="" r-project.org="">> > https://stat.ethz.ch/mailman/____listinfo/bioconductor > <https: stat.ethz.ch="" mailman="" __listinfo="" bioconductor=""> > > <https: stat.ethz.ch="" mailman="" __listinfo="" bioconductor=""> <https: stat.ethz.ch="" mailman="" listinfo="" bioconductor="">> > Search the archives: > > http://news.gmane.org/gmane.____science.biology.informat ics.____conductor > <http: news.gmane.org="" gmane.__science.biology.informati="" cs.__conductor=""> > > <http: news.gmane.org="" gmane.__science.biology.informatics.__conductor=""> <http: news.gmane.org="" gmane.science.biology.informatics.conductor="">> > > > > -- > Valerie Obenchain > Program in Computational Biology > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, Seattle, WA 98109 > > Email: vobencha at fhcrc.org <mailto:vobencha at="" fhcrc.org=""> > <mailto:vobencha at="" fhcrc.org="" <mailto:vobencha="" at="" fhcrc.org="">> > Phone: (206) 667-3158 <tel:%28206%29%20667-3158> > <tel:%28206%29%20667-3158> > > > > _________________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org <mailto:bioconductor at="" r-project.org=""> > https://stat.ethz.ch/mailman/__listinfo/bioconductor > <https: stat.ethz.ch="" mailman="" listinfo="" bioconductor=""> > Search the archives: > http://news.gmane.org/gmane.__science.biology.informatics.__conductor > <http: news.gmane.org="" gmane.science.biology.informatics.conductor=""> > > >
ADD REPLYlink written 3.5 years ago by Valerie Obenchain ♦♦ 6.4k
fyi the test files ('fls') came from, library(RNAseqData.HNRNPC.bam.chr14) fls <- RNAseqData.HNRNPC.bam.chr14_BAMFILES Valerie On 07/03/2014 10:42 AM, Valerie Obenchain wrote: > No, the division by 2 isn't always safe. 'ex1.bam' is a well-behaved > file. Files with more diversity may have more than 2 segments per template. > > counts <- lapply(fls, function(fl) > countBam(fl, param=param)$records) > > pairs <- lapply(fls, function(fl) > length(readGAlignmentPairs(fl, param=param))) > >>> unlist(counts)/2 >> ERR127306 ERR127307 ERR127308 ERR127309 ERR127302 ERR127303 ERR127304 >> ERR127305 >> 363326 392420 397158 349970 355376 381493 >> 389176 355834 >>> unlist(pairs) >> ERR127306 ERR127307 ERR127308 ERR127309 ERR127302 ERR127303 ERR127304 >> ERR127305 >> 363147 392252 396933 349822 355215 381274 >> 388991 355682 > > To see what's going on use readGAlignementsList() on the first file. > This function adds some flexibility in that it reads in fragments, pairs > w/ no mates, etc. > > galist <- readGAlignmentsList(fl, param=param) > > The length matches that of countBam() because it reads all records. > >>> length(galist) >> [1] 363226 > > 'mate_status' categorizes records by 'mated', 'ambiguous' and 'unmated'. > Here we see the 363147 match to output from readGAlignmentsPairs() and > the rest are ambiguous. > >>> table(mcols(galist)$mate_status) >> >> mated ambiguous unmated >> 363147 79 0 > > elementLengths() shows the number of segments per template. Again we see > the 363147 match and it shows exactly 2 segments for those. > >>> table(elementLengths(galist)) >> >> 2 4 6 8 >> 363147 65 7 7 > > > Valerie > > > On 07/03/2014 10:08 AM, Fong Chun Chan wrote: >> Thanks for the reply. I understand what you mean about counting aligned >> pairs within a region and how that would give incorrect results. >> >> However, if I was purely interested in just getting all aligned >> read-pairs within the whole genomic space (no restriction to any region >> since this is what I need to normalize by), wouldn't the following code >> work? >> >> library('Rsamtools') >> library('GenomicAlignments') >> >> fl <- system.file("extdata", "ex1.bam", package = "Rsamtools") >> >> flags <- scanBamFlag(isPaired = TRUE, >> isProperPair = TRUE, >> isUnmappedQuery = FALSE, >> hasUnmappedMate = FALSE) >> >> param <- ScanBamParam(flag = flags) >> >> countBam(fl, param = param)[['records']]/2 >> [1]1572 >> >> galp <- readGAlignmentPairs(fl, param=param) >> length(galp) >> [1]1572 >> >> >> On Thu, Jul 3, 2014 at 9:40 AM, Valerie Obenchain <vobencha at="" fhcrc.org="">> <mailto:vobencha at="" fhcrc.org="">> wrote: >> >> Correction below. >> >> >> On 07/03/2014 09:28 AM, Valerie Obenchain wrote: >> >> Hi, >> >> On 07/02/2014 12:56 PM, Fong Chun Chan wrote: >> >> Hi Valerie, >> >> Thanks for the reply. I'll post some runtime as soon as it >> finishes >> running the readGAlignments and readGAlignmentPairs. >> >> Thanks for the tip on passing range(allExons) into the which >> param. >> There is only one issue in that if I wanted to then generate >> say RPKM >> values where I need to know the "total number of aligned >> reads" in a bam >> file. Because I've read in just a subset of the bam (i.e. >> the reads that >> aligned just to exons) I would have technically produced an >> incorrection >> calculation of RPKM values (assuming you normalize using the >> number of >> aligned reads). On that note, is there a quick and easy way >> to just >> count the number of aligned reads or aligned paired reads so >> I can >> quickly normalize by this way. This way I can still use the >> which >> parameter in readGAlignments and readGAlignmentPairs to >> speed it up? >> >> >> For single-end you can use countBam(): >> >> > fl <- system.file("extdata", "ex1.bam", package = >> "Rsamtools") >> > param <- ScanBamParam(flag = scanBamFlag(isUnmappedQuery = >> FALSE)) >> > countBam(fl, param = param) >> space start end width file records nucleotides >> 1 NA NA NA NA ex1.bam 3271 115286 >> >> It's not so clear for paired-end. The mate-pairing algorithm >> used in the >> readGAlignment* functions is described on the man page and uses a >> combination of BAM fields and flags. You can't really get at >> this simply >> by creating a param with a flag combination such as >> >> isPaired = TRUE >> isProperPair = TRUE >> isUnmappedQuery = FALSE >> hasUnmappedMate = FALSE >> >> For example, if you're interested in this region >> >> gr <- GRanges("seq1", IRanges(100, 500)) >> >> and set flags, >> >> flags <- scanBamFlag(isPaired = TRUE, >> isProperPair = TRUE, >> isUnmappedQuery = FALSE, >> hasUnmappedMate = FALSE) >> >> countBam() reports 335 records. These are individual records so >> the mate >> pairs (parts) are each counted separately. Only records that >> fall within >> the region defined by 'gr' are counted. >> >> > param <- ScanBamParam(which = gr, flag = flags) >> > countBam(fl, param = param) >> space start end width file records nucleotides >> 1 seq1 100 500 401 ex1.bam 335 11785 >> >> >> readGAlignmentPairs() reports 223 pairs (446 individual >> records). All >> records in 'gr' are mated if possible. In the case where one >> mate part >> falls inside 'gr' and one outside, the other mate will be >> retrieved >> (i.e., outside the range of 'gr'). This makes the 335 above >> confusing >> for counting the number of aligned pairs in the region. The >> number is a >> mixture of (1) mate parts both in 'gr' that belong together >> (2) mate >> parts with mate outside 'gr' and (3) mates or mate parts that >> don't meet >> other mate-pairing criteria. >> >> > galp <- readGAlignmentPairs(fl, param=param) >> > length(galp) >> [1] 223 >> >> Off hand I can't think of a good way to count aligned pairs >> w/in a >> specified region. >> >> >> What I should have said here is I can't think of a good way to >> accurately count aligned pairs (1 count per pair) in a BAM file w/o >> using the pairing algo. >> >> Valerie >> >> >> >> Also another thing I notice is that when I use the >> readGAlignmentPairs function, I lose my ability to cancel >> the command >> (ctrl-c no longer works). >> >> >> It's implemented in C/C++. You can try ctrl+\ in the session >> or kill >> with the top command from another terminal. Maybe others have >> suggestions they'll add ... >> >> Valerie >> >> Not sure this is a bug, or if it is something >> >> to do with the internal workings of the function. The loss >> of ctrl-c >> also occurs when trying to run as a Rscript. I can seemingly >> ctrl-c >> anytime before, but it hits this step it just refuses to >> cancel. >> >> Thanks, >> >> >> >> >> >> >> On Tue, Jul 1, 2014 at 10:06 AM, Valerie Obenchain >> <vobencha at="" fhcrc.org="" <mailto:vobencha="" at="" fhcrc.org=""> >> <mailto:vobencha at="" fhcrc.org="" <mailto:vobencha="" at="" fhcrc.org="">>> >> wrote: >> >> Hi, >> >> You don't need to call findMateAlignment() when using >> readGAlignmentPairs(). In a previous iteration, >> readGAlignmentPairs() called findMateAlignment() but >> this is no >> longer the case. >> >> readGAlignmentPairs() does have more overhead than >> readGAlignments() >> because of the mate paring. These numbers are for a BAM >> file with >> 800484 records (400054 final pairs) on my not-so- speedy >> laptop. >> >> library(microbenchmark) >> library(RNAseqData.HNRNPC.bam.____chr14) >> fls <- RNAseqData.HNRNPC.bam.chr14_____BAMFILES >> >> microbenchmark(____readGAlignments(fls[1]), >> times=5) >> >> Unit: seconds >> expr min lq median >> uq >> max neval >> readGAlignments(fls[1]) 1.753653 1.788292 1.85527 >> 2.019071 >> 2.121211 5 >> >> >> >> ## mate pairing >> >> >> microbenchmark(____readGAlignmentPairs(fls[1]), times=5) >> >> Unit: seconds >> expr min lq >> median >> uq max neval >> readGAlignmentPairs(fls[1]) 9.283966 9.300328 >> 9.535194 >> 9.843282 11.37441 5 >> >> >> ## mate pairing >> >> >> microbenchmark(____readGAlignmentsList(fls[1]), times=5) >> >> Unit: seconds >> expr min lq >> median >> uq max neval >> readGAlignmentsList(fls[1]) 8.409743 8.457232 >> 8.803696 >> 9.692845 9.867628 5 >> >> >> >> You can set a 'yieldSize' when creating the BamFile. >> This will chunk >> through the file 'yieldSize' records at a time and is >> useful when >> processing a complete file and when there are memory >> limitations. >> >> bf <- BamFile(myfile, yieldSize = 1000000) >> >> Herve reported some additional timings for >> readGAlignmentPairs() on >> this post: >> >> >> https://stat.ethz.ch/____pipermail/bioconductor/2014-____May/059695 .html >> >> <https: stat.ethz.ch="" __pipermail="" bioconductor="" 2014-__may="" 059695.html=""> >> >> >> <https: stat.ethz.ch="" __pipermail="" bioconductor="" 2014-__may="" 059695.html="">> >> <https: stat.ethz.ch="" pipermail="" bioconductor="" 2014-may="" 059695.html="">> >> >> >> Can you provide some numbers for how long your run is >> taking and the >> number of records you're trying to read in? >> >> I assume 'allExons' is used for counting against the >> BAM files in >> later code? Maybe this is why your were looking into >> summarizeOverlaps(). I should point out that if >> you're only >> interested in the regions overlapping 'allExons' you >> could call >> range(allExons) and use that as the 'which' in the >> param. This would >> focus the area of interest and speed up the reading. >> >> Valerie >> >> >> >> >> On 06/30/14 12:22, Fong Chun Chan wrote: >> >> Hi, >> >> I've been using the GenomicFeatures R package to >> read in some >> RNA-seq >> paired-end read data. While the readGAlignments() >> function reads >> in the bam >> file within a minute, I've noticed that the >> readGAlignmentPairs() function >> is extremely slow. >> >> This is even after restricting the space to just a >> single >> chromosome along >> with setting the flags isDuplicate = FALSE and >> isNotPrimaryRead >> = FALSE >> (the latter being suggested in the reference): >> >> "An easy way to avoid this degradation is to load >> only primary >> alignments >> by setting the isNotPrimaryRead ???ag to FALSE in >> ScanBamParam()" >> >> Based on my understanding, it seems that the ?? >> ndMateAlignment() function >> >> has to be run first. Is there any other way to >> speed it up? >> Perhaps I am >> doing something wrong here? >> >> Code and sessionInfo() below. >> >> Thanks! >> >> ---- >> library("GenomicFeatures") >> library('GenomicAlignments') >> library('Rsamtools') >> si <- seqinfo(BamFile(arguments$____bamFile)) >> gr <- GRanges(arguments$chr, IRanges(100, >> seqlengths(si)[[arguments$chr]____]-100)) >> allExons <- exons(txdb, columns = c('gene_id', >> 'exon_id', >> 'exon_name')) >> scf <- scanBamFlag( isDuplicate = FALSE, >> isNotPrimaryRead = >> FALSE ) # >> remove duplicate reads, remove non-primary reads >> (i.e. >> multi-aligned reads) >> reads <- readGAlignmentPairs( arguments$bamFile, >> param = >> ScanBamParam( >> which = gr, flag = scf ) ); # grab reads in >> specific region >> >> sessionInfo() >> >> R version 3.1.0 (2014-04-10) >> Platform: x86_64-unknown-linux-gnu (64-bit) >> >> locale: >> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >> LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 >> LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 >> LC_PAPER=en_US.UTF-8 >> LC_NAME=C LC_ADDRESS=C >> LC_TELEPHONE=C >> LC_MEASUREMENT=en_US.UTF-8 >> LC_IDENTIFICATION=C >> >> attached base packages: >> [1] parallel stats graphics grDevices utils >> datasets >> methods >> base >> >> other attached packages: >> [1] argparse_1.0.1 proto_0.3-10 >> GenomicAlignments_1.0.1 BSgenome_1.32.0 >> Rsamtools_1.16.1 >> Biostrings_2.32.0 XVector_0.4.0 >> GenomicFeatures_1.16.2 >> AnnotationDbi_1.26.0 Biobase_2.24.0 >> GenomicRanges_1.16.3 >> GenomeInfoDb_1.0.2 IRanges_1.22.9 >> BiocGenerics_0.10.0 >> vimcom.plus_0.9-93 >> [16] setwidth_1.0-3 colorout_1.0-2 >> >> loaded via a namespace (and not attached): >> [1] BatchJobs_1.2 BBmisc_1.7 >> BiocParallel_0.6.1 >> biomaRt_2.20.0 bitops_1.0-6 brew_1.0-6 >> checkmate_1.0 >> codetools_0.2-8 DBI_0.2-7 >> digest_0.6.4 >> fail_1.2 >> findpython_1.0.1 foreach_1.4.2 >> getopt_1.20.0 >> iterators_1.0.7 >> plyr_1.8.1 Rcpp_0.11.2 >> RCurl_1.95-4.1 >> [19] rjson_0.2.14 RSQLite_0.11.4 >> rtracklayer_1.24.2 >> sendmailR_1.1-2 stats4_3.1.0 stringr_0.6.2 >> tools_3.1.0 >> XML_3.98-1.1 zlibbioc_1.10.0 >> >> >> >> [[alternative HTML version deleted]] >> >> >> >> ___________________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> <mailto:bioconductor at="" r-project.org=""> >> <mailto:bioconductor at="" r-__project.org="">> <mailto:bioconductor at="" r-project.org="">> >> https://stat.ethz.ch/mailman/____listinfo/bioconductor >> <https: stat.ethz.ch="" mailman="" __listinfo="" bioconductor=""> >> >> <https: stat.ethz.ch="" mailman="" __listinfo="" bioconductor="">> <https: stat.ethz.ch="" mailman="" listinfo="" bioconductor="">> >> Search the archives: >> >> >> http://news.gmane.org/gmane.____science.biology.informatics.____con ductor >> >> <http: news.gmane.org="" gmane.__science.biology.informatics.__conduc="" tor=""> >> >> >> <http: news.gmane.org="" gmane.__science.biology.informatics.__conductor="">> >> <http: news.gmane.org="" gmane.science.biology.informatics.conductor="">> >> >> >> >> -- >> Valerie Obenchain >> Program in Computational Biology >> Fred Hutchinson Cancer Research Center >> 1100 Fairview Ave. N, Seattle, WA 98109 >> >> Email: vobencha at fhcrc.org <mailto:vobencha at="" fhcrc.org=""> >> <mailto:vobencha at="" fhcrc.org="" <mailto:vobencha="" at="" fhcrc.org="">> >> Phone: (206) 667-3158 <tel:%28206%29%20667-3158> >> <tel:%28206%29%20667-3158> >> >> >> >> _________________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org <mailto:bioconductor at="" r-project.org=""> >> https://stat.ethz.ch/mailman/__listinfo/bioconductor >> <https: stat.ethz.ch="" mailman="" listinfo="" bioconductor=""> >> Search the archives: >> >> http://news.gmane.org/gmane.__science.biology.informatics.__conductor >> >> <http: news.gmane.org="" gmane.science.biology.informatics.conductor=""> >> >> >> > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor
ADD REPLYlink written 3.5 years ago by Valerie Obenchain ♦♦ 6.4k
As promised, I put some of the run-time details for a particular bam file comparing readGAlignments vs. readGAlignmentsPairs. readGAlignmentPairs has been running for like over a week I believe and doesn't look like it will actually stop running...so I don't have any run-time of this function unfortunately. ------ group | nb of | nb of | mean / max of | records | unique | records per records | in group | QNAMEs | unique QNAME All records........................ A |110625579 | 55302565 | 2 / 4 o template has single segment.... S | 0 | 0 | NA / NA o template has multiple segments. M |110625579 | 55302565 | 2 / 4 - first segment.............. F | 55312870 | 55302565 | 1 / 2 - last segment............... L | 55312709 | 55302565 | 1 / 2 - other segment.............. O | 0 | 0 | NA / NA Note that (S, M) is a partitioning of A, and (F, L, O) is a partitioning of M. Indentation reflects this. [1] "Read using readGAlignments..." Unit: seconds expr reads <- readGAlignments(bamFile, param = ScanBamParam(which = gr, flag = scf)) min lq median uq max neval 209.1448 216.8919 221.3251 225.7923 243.3946 100 [1] "Read using readGAlignmentsPairs..." ------ On Thu, Jul 3, 2014 at 12:11 PM, Valerie Obenchain <vobencha@fhcrc.org> wrote: > fyi the test files ('fls') came from, > > library(RNAseqData.HNRNPC.bam.chr14) > fls <- RNAseqData.HNRNPC.bam.chr14_BAMFILES > > Valerie > > > > On 07/03/2014 10:42 AM, Valerie Obenchain wrote: > >> No, the division by 2 isn't always safe. 'ex1.bam' is a well- behaved >> file. Files with more diversity may have more than 2 segments per >> template. >> >> counts <- lapply(fls, function(fl) >> countBam(fl, param=param)$records) >> >> pairs <- lapply(fls, function(fl) >> length(readGAlignmentPairs(fl, param=param))) >> >> unlist(counts)/2 >>>> >>> ERR127306 ERR127307 ERR127308 ERR127309 ERR127302 ERR127303 ERR127304 >>> ERR127305 >>> 363326 392420 397158 349970 355376 381493 >>> 389176 355834 >>> >>>> unlist(pairs) >>>> >>> ERR127306 ERR127307 ERR127308 ERR127309 ERR127302 ERR127303 ERR127304 >>> ERR127305 >>> 363147 392252 396933 349822 355215 381274 >>> 388991 355682 >>> >> >> To see what's going on use readGAlignementsList() on the first file. >> This function adds some flexibility in that it reads in fragments, pairs >> w/ no mates, etc. >> >> galist <- readGAlignmentsList(fl, param=param) >> >> The length matches that of countBam() because it reads all records. >> >> length(galist) >>>> >>> [1] 363226 >>> >> >> 'mate_status' categorizes records by 'mated', 'ambiguous' and 'unmated'. >> Here we see the 363147 match to output from readGAlignmentsPairs() and >> the rest are ambiguous. >> >> table(mcols(galist)$mate_status) >>>> >>> >>> mated ambiguous unmated >>> 363147 79 0 >>> >> >> elementLengths() shows the number of segments per template. Again we see >> the 363147 match and it shows exactly 2 segments for those. >> >> table(elementLengths(galist)) >>>> >>> >>> 2 4 6 8 >>> 363147 65 7 7 >>> >> >> >> Valerie >> >> >> On 07/03/2014 10:08 AM, Fong Chun Chan wrote: >> >>> Thanks for the reply. I understand what you mean about counting aligned >>> pairs within a region and how that would give incorrect results. >>> >>> However, if I was purely interested in just getting all aligned >>> read-pairs within the whole genomic space (no restriction to any region >>> since this is what I need to normalize by), wouldn't the following code >>> work? >>> >>> library('Rsamtools') >>> library('GenomicAlignments') >>> >>> fl <- system.file("extdata", "ex1.bam", package = "Rsamtools") >>> >>> flags <- scanBamFlag(isPaired = TRUE, >>> isProperPair = TRUE, >>> isUnmappedQuery = FALSE, >>> hasUnmappedMate = FALSE) >>> >>> param <- ScanBamParam(flag = flags) >>> >>> countBam(fl, param = param)[['records']]/2 >>> [1]1572 >>> >>> galp <- readGAlignmentPairs(fl, param=param) >>> length(galp) >>> [1]1572 >>> >>> >>> On Thu, Jul 3, 2014 at 9:40 AM, Valerie Obenchain <vobencha@fhcrc.org>>> <mailto:vobencha@fhcrc.org>> wrote: >>> >>> Correction below. >>> >>> >>> On 07/03/2014 09:28 AM, Valerie Obenchain wrote: >>> >>> Hi, >>> >>> On 07/02/2014 12:56 PM, Fong Chun Chan wrote: >>> >>> Hi Valerie, >>> >>> Thanks for the reply. I'll post some runtime as soon as it >>> finishes >>> running the readGAlignments and readGAlignmentPairs. >>> >>> Thanks for the tip on passing range(allExons) into the which >>> param. >>> There is only one issue in that if I wanted to then generate >>> say RPKM >>> values where I need to know the "total number of aligned >>> reads" in a bam >>> file. Because I've read in just a subset of the bam (i.e. >>> the reads that >>> aligned just to exons) I would have technically produced an >>> incorrection >>> calculation of RPKM values (assuming you normalize using the >>> number of >>> aligned reads). On that note, is there a quick and easy way >>> to just >>> count the number of aligned reads or aligned paired reads so >>> I can >>> quickly normalize by this way. This way I can still use the >>> which >>> parameter in readGAlignments and readGAlignmentPairs to >>> speed it up? >>> >>> >>> For single-end you can use countBam(): >>> >>> > fl <- system.file("extdata", "ex1.bam", package = >>> "Rsamtools") >>> > param <- ScanBamParam(flag = scanBamFlag(isUnmappedQuery = >>> FALSE)) >>> > countBam(fl, param = param) >>> space start end width file records nucleotides >>> 1 NA NA NA NA ex1.bam 3271 115286 >>> >>> It's not so clear for paired-end. The mate-pairing algorithm >>> used in the >>> readGAlignment* functions is described on the man page and uses a >>> combination of BAM fields and flags. You can't really get at >>> this simply >>> by creating a param with a flag combination such as >>> >>> isPaired = TRUE >>> isProperPair = TRUE >>> isUnmappedQuery = FALSE >>> hasUnmappedMate = FALSE >>> >>> For example, if you're interested in this region >>> >>> gr <- GRanges("seq1", IRanges(100, 500)) >>> >>> and set flags, >>> >>> flags <- scanBamFlag(isPaired = TRUE, >>> isProperPair = TRUE, >>> isUnmappedQuery = FALSE, >>> hasUnmappedMate = FALSE) >>> >>> countBam() reports 335 records. These are individual records so >>> the mate >>> pairs (parts) are each counted separately. Only records that >>> fall within >>> the region defined by 'gr' are counted. >>> >>> > param <- ScanBamParam(which = gr, flag = flags) >>> > countBam(fl, param = param) >>> space start end width file records nucleotides >>> 1 seq1 100 500 401 ex1.bam 335 11785 >>> >>> >>> readGAlignmentPairs() reports 223 pairs (446 individual >>> records). All >>> records in 'gr' are mated if possible. In the case where one >>> mate part >>> falls inside 'gr' and one outside, the other mate will be >>> retrieved >>> (i.e., outside the range of 'gr'). This makes the 335 above >>> confusing >>> for counting the number of aligned pairs in the region. The >>> number is a >>> mixture of (1) mate parts both in 'gr' that belong together >>> (2) mate >>> parts with mate outside 'gr' and (3) mates or mate parts that >>> don't meet >>> other mate-pairing criteria. >>> >>> > galp <- readGAlignmentPairs(fl, param=param) >>> > length(galp) >>> [1] 223 >>> >>> Off hand I can't think of a good way to count aligned pairs >>> w/in a >>> specified region. >>> >>> >>> What I should have said here is I can't think of a good way to >>> accurately count aligned pairs (1 count per pair) in a BAM file w/o >>> using the pairing algo. >>> >>> Valerie >>> >>> >>> >>> Also another thing I notice is that when I use the >>> readGAlignmentPairs function, I lose my ability to cancel >>> the command >>> (ctrl-c no longer works). >>> >>> >>> It's implemented in C/C++. You can try ctrl+\ in the session >>> or kill >>> with the top command from another terminal. Maybe others have >>> suggestions they'll add ... >>> >>> Valerie >>> >>> Not sure this is a bug, or if it is something >>> >>> to do with the internal workings of the function. The loss >>> of ctrl-c >>> also occurs when trying to run as a Rscript. I can seemingly >>> ctrl-c >>> anytime before, but it hits this step it just refuses to >>> cancel. >>> >>> Thanks, >>> >>> >>> >>> >>> >>> >>> On Tue, Jul 1, 2014 at 10:06 AM, Valerie Obenchain >>> <vobencha@fhcrc.org <mailto:vobencha@fhcrc.org=""> >>> <mailto:vobencha@fhcrc.org <mailto:vobencha@fhcrc.org="">>> >>> wrote: >>> >>> Hi, >>> >>> You don't need to call findMateAlignment() when using >>> readGAlignmentPairs(). In a previous iteration, >>> readGAlignmentPairs() called findMateAlignment() but >>> this is no >>> longer the case. >>> >>> readGAlignmentPairs() does have more overhead than >>> readGAlignments() >>> because of the mate paring. These numbers are for a BAM >>> file with >>> 800484 records (400054 final pairs) on my not-so- speedy >>> laptop. >>> >>> library(microbenchmark) >>> library(RNAseqData.HNRNPC.bam.____chr14) >>> fls <- RNAseqData.HNRNPC.bam.chr14_____BAMFILES >>> >>> microbenchmark(____readGAlignments(fls[1]), >>> times=5) >>> >>> Unit: seconds >>> expr min lq median >>> uq >>> max neval >>> readGAlignments(fls[1]) 1.753653 1.788292 1.85527 >>> 2.019071 >>> 2.121211 5 >>> >>> >>> >>> ## mate pairing >>> >>> >>> microbenchmark(____readGAlignmentPairs(fls[1]), times=5) >>> >>> Unit: seconds >>> expr min lq >>> median >>> uq max neval >>> readGAlignmentPairs(fls[1]) 9.283966 9.300328 >>> 9.535194 >>> 9.843282 11.37441 5 >>> >>> >>> ## mate pairing >>> >>> >>> microbenchmark(____readGAlignmentsList(fls[1]), times=5) >>> >>> Unit: seconds >>> expr min lq >>> median >>> uq max neval >>> readGAlignmentsList(fls[1]) 8.409743 8.457232 >>> 8.803696 >>> 9.692845 9.867628 5 >>> >>> >>> >>> You can set a 'yieldSize' when creating the BamFile. >>> This will chunk >>> through the file 'yieldSize' records at a time and is >>> useful when >>> processing a complete file and when there are memory >>> limitations. >>> >>> bf <- BamFile(myfile, yieldSize = 1000000) >>> >>> Herve reported some additional timings for >>> readGAlignmentPairs() on >>> this post: >>> >>> >>> https://stat.ethz.ch/____pipermail/bioconductor/2014-____May/05969 5.html >>> >>> <https: stat.ethz.ch="" __pipermail="" bioconductor="" 2014-__may="" 059695.html=""> >>> >>> >>> <https: stat.ethz.ch="" __pipermail="" bioconductor="" 2014-__may="" 059695.html="">>> >>> <https: stat.ethz.ch="" pipermail="" bioconductor="" 2014-may="" 059695.html="">> >>> >>> >>> Can you provide some numbers for how long your run is >>> taking and the >>> number of records you're trying to read in? >>> >>> I assume 'allExons' is used for counting against the >>> BAM files in >>> later code? Maybe this is why your were looking into >>> summarizeOverlaps(). I should point out that if >>> you're only >>> interested in the regions overlapping 'allExons' you >>> could call >>> range(allExons) and use that as the 'which' in the >>> param. This would >>> focus the area of interest and speed up the reading. >>> >>> Valerie >>> >>> >>> >>> >>> On 06/30/14 12:22, Fong Chun Chan wrote: >>> >>> Hi, >>> >>> I've been using the GenomicFeatures R package to >>> read in some >>> RNA-seq >>> paired-end read data. While the readGAlignments() >>> function reads >>> in the bam >>> file within a minute, I've noticed that the >>> readGAlignmentPairs() function >>> is extremely slow. >>> >>> This is even after restricting the space to just a >>> single >>> chromosome along >>> with setting the flags isDuplicate = FALSE and >>> isNotPrimaryRead >>> = FALSE >>> (the latter being suggested in the reference): >>> >>> "An easy way to avoid this degradation is to load >>> only primary >>> alignments >>> by setting the isNotPrimaryRead ï¬'ag to FALSE in >>> ScanBamParam()" >>> >>> Based on my understanding, it seems that the ï¬ >>> ndMateAlignment() function >>> >>> has to be run first. Is there any other way to >>> speed it up? >>> Perhaps I am >>> doing something wrong here? >>> >>> Code and sessionInfo() below. >>> >>> Thanks! >>> >>> ---- >>> library("GenomicFeatures") >>> library('GenomicAlignments') >>> library('Rsamtools') >>> si <- seqinfo(BamFile(arguments$____bamFile)) >>> gr <- GRanges(arguments$chr, IRanges(100, >>> seqlengths(si)[[arguments$chr]____]-100)) >>> allExons <- exons(txdb, columns = c('gene_id', >>> 'exon_id', >>> 'exon_name')) >>> scf <- scanBamFlag( isDuplicate = FALSE, >>> isNotPrimaryRead = >>> FALSE ) # >>> remove duplicate reads, remove non-primary reads >>> (i.e. >>> multi-aligned reads) >>> reads <- readGAlignmentPairs( arguments$bamFile, >>> param = >>> ScanBamParam( >>> which = gr, flag = scf ) ); # grab reads in >>> specific region >>> >>> sessionInfo() >>> >>> R version 3.1.0 (2014-04-10) >>> Platform: x86_64-unknown-linux-gnu (64-bit) >>> >>> locale: >>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >>> LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 >>> LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 >>> LC_PAPER=en_US.UTF-8 >>> LC_NAME=C LC_ADDRESS=C >>> LC_TELEPHONE=C >>> LC_MEASUREMENT=en_US.UTF-8 >>> LC_IDENTIFICATION=C >>> >>> attached base packages: >>> [1] parallel stats graphics grDevices utils >>> datasets >>> methods >>> base >>> >>> other attached packages: >>> [1] argparse_1.0.1 proto_0.3-10 >>> GenomicAlignments_1.0.1 BSgenome_1.32.0 >>> Rsamtools_1.16.1 >>> Biostrings_2.32.0 XVector_0.4.0 >>> GenomicFeatures_1.16.2 >>> AnnotationDbi_1.26.0 Biobase_2.24.0 >>> GenomicRanges_1.16.3 >>> GenomeInfoDb_1.0.2 IRanges_1.22.9 >>> BiocGenerics_0.10.0 >>> vimcom.plus_0.9-93 >>> [16] setwidth_1.0-3 colorout_1.0-2 >>> >>> loaded via a namespace (and not attached): >>> [1] BatchJobs_1.2 BBmisc_1.7 >>> BiocParallel_0.6.1 >>> biomaRt_2.20.0 bitops_1.0-6 brew_1.0-6 >>> checkmate_1.0 >>> codetools_0.2-8 DBI_0.2-7 >>> digest_0.6.4 >>> fail_1.2 >>> findpython_1.0.1 foreach_1.4.2 >>> getopt_1.20.0 >>> iterators_1.0.7 >>> plyr_1.8.1 Rcpp_0.11.2 >>> RCurl_1.95-4.1 >>> [19] rjson_0.2.14 RSQLite_0.11.4 >>> rtracklayer_1.24.2 >>> sendmailR_1.1-2 stats4_3.1.0 stringr_0.6.2 >>> tools_3.1.0 >>> XML_3.98-1.1 zlibbioc_1.10.0 >>> >>> >>> >>> [[alternative HTML version deleted]] >>> >>> >>> >>> ___________________________________________________ >>> Bioconductor mailing list >>> Bioconductor@r-project.org >>> <mailto:bioconductor@r-project.org> >>> <mailto:bioconductor@r-__project.org>>> <mailto:bioconductor@r-project.org>> >>> https://stat.ethz.ch/mailman/____listinfo/bioconductor >>> <https: stat.ethz.ch="" mailman="" __listinfo="" bioconductor=""> >>> >>> <https: stat.ethz.ch="" mailman="" __listinfo="" bioconductor="">>> <https: stat.ethz.ch="" mailman="" listinfo="" bioconductor="">> >>> Search the archives: >>> >>> >>> http://news.gmane.org/gmane.____science.biology.informatics. >>> ____conductor >>> >>> <http: news.gmane.org="" gmane.__science.biology.informatics.__condu="" ctor=""> >>> >>> >>> <http: news.gmane.org="" gmane.__science.biology.informatics.__conductor="">>> >>> <http: news.gmane.org="" gmane.science.biology.informatics.conductor="">> >>> >>> >>> >>> -- >>> Valerie Obenchain >>> Program in Computational Biology >>> Fred Hutchinson Cancer Research Center >>> 1100 Fairview Ave. N, Seattle, WA 98109 >>> >>> Email: vobencha@fhcrc.org <mailto:vobencha@fhcrc.org> >>> <mailto:vobencha@fhcrc.org <mailto:vobencha@fhcrc.org="">> >>> Phone: (206) 667-3158 <tel:%28206%29%20667-3158> >>> <tel:%28206%29%20667-3158> >>> >>> >>> >>> _________________________________________________ >>> Bioconductor mailing list >>> Bioconductor@r-project.org <mailto:bioconductor@r-project.org> >>> https://stat.ethz.ch/mailman/__listinfo/bioconductor >>> <https: stat.ethz.ch="" mailman="" listinfo="" bioconductor=""> >>> Search the archives: >>> >>> http://news.gmane.org/gmane.__science.biology.informatics.__conductor >>> >>> <http: news.gmane.org="" gmane.science.biology.informatics.conductor=""> >>> >>> >>> >>> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor@r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > [[alternative HTML version deleted]]
ADD REPLYlink written 3.5 years ago by Fong Chun Chan320
Hi Fong, On 07/14/2014 01:02 PM, Fong Chun Chan wrote: > As promised, I put some of the run-time details for a particular bam file > comparing readGAlignments vs. readGAlignmentsPairs. > > readGAlignmentPairs has been running for like over a week I believe and > doesn't look like it will actually stop running...so I don't have any > run-time of this function unfortunately. Could it be that the R process is eating up all your memory so now it's running in "swapping mode"? This tends to slow things down *a lot* (typically by a factor of 100 or 1000 or more). You can check that by running the Unix commend 'top' in a separate terminal. How much RAM do you have? You probably need at least 16Gb of RAM (rough estimate) to read in a BAM file with 55 million pairs with readGAlignmentPairs(). And possibly more than that if you're loading the QNAME field (with 'use.names=TRUE') and/or other BAM fields or tags. Cheers, H. > > ------ > group | nb of | nb of | mean / max > of | records | unique | records per > records | in group | QNAMEs | unique QNAME > All records........................ A |110625579 | 55302565 | 2 / 4 > o template has single segment.... S | 0 | 0 | NA / NA > o template has multiple segments. M |110625579 | 55302565 | 2 / 4 > - first segment.............. F | 55312870 | 55302565 | 1 / 2 > - last segment............... L | 55312709 | 55302565 | 1 / 2 > - other segment.............. O | 0 | 0 | NA / NA > > Note that (S, M) is a partitioning of A, and (F, L, O) is a partitioning of > M. > Indentation reflects this. > [1] "Read using readGAlignments..." > Unit: seconds > > expr > reads <- readGAlignments(bamFile, param = ScanBamParam(which = gr, > flag = scf)) > min lq median uq max neval > 209.1448 216.8919 221.3251 225.7923 243.3946 100 > > [1] "Read using readGAlignmentsPairs..." > ------ > > > On Thu, Jul 3, 2014 at 12:11 PM, Valerie Obenchain <vobencha at="" fhcrc.org=""> > wrote: > >> fyi the test files ('fls') came from, >> >> library(RNAseqData.HNRNPC.bam.chr14) >> fls <- RNAseqData.HNRNPC.bam.chr14_BAMFILES >> >> Valerie >> >> >> >> On 07/03/2014 10:42 AM, Valerie Obenchain wrote: >> >>> No, the division by 2 isn't always safe. 'ex1.bam' is a well- behaved >>> file. Files with more diversity may have more than 2 segments per >>> template. >>> >>> counts <- lapply(fls, function(fl) >>> countBam(fl, param=param)$records) >>> >>> pairs <- lapply(fls, function(fl) >>> length(readGAlignmentPairs(fl, param=param))) >>> >>> unlist(counts)/2 >>>>> >>>> ERR127306 ERR127307 ERR127308 ERR127309 ERR127302 ERR127303 ERR127304 >>>> ERR127305 >>>> 363326 392420 397158 349970 355376 381493 >>>> 389176 355834 >>>> >>>>> unlist(pairs) >>>>> >>>> ERR127306 ERR127307 ERR127308 ERR127309 ERR127302 ERR127303 ERR127304 >>>> ERR127305 >>>> 363147 392252 396933 349822 355215 381274 >>>> 388991 355682 >>>> >>> >>> To see what's going on use readGAlignementsList() on the first file. >>> This function adds some flexibility in that it reads in fragments, pairs >>> w/ no mates, etc. >>> >>> galist <- readGAlignmentsList(fl, param=param) >>> >>> The length matches that of countBam() because it reads all records. >>> >>> length(galist) >>>>> >>>> [1] 363226 >>>> >>> >>> 'mate_status' categorizes records by 'mated', 'ambiguous' and 'unmated'. >>> Here we see the 363147 match to output from readGAlignmentsPairs() and >>> the rest are ambiguous. >>> >>> table(mcols(galist)$mate_status) >>>>> >>>> >>>> mated ambiguous unmated >>>> 363147 79 0 >>>> >>> >>> elementLengths() shows the number of segments per template. Again we see >>> the 363147 match and it shows exactly 2 segments for those. >>> >>> table(elementLengths(galist)) >>>>> >>>> >>>> 2 4 6 8 >>>> 363147 65 7 7 >>>> >>> >>> >>> Valerie >>> >>> >>> On 07/03/2014 10:08 AM, Fong Chun Chan wrote: >>> >>>> Thanks for the reply. I understand what you mean about counting aligned >>>> pairs within a region and how that would give incorrect results. >>>> >>>> However, if I was purely interested in just getting all aligned >>>> read-pairs within the whole genomic space (no restriction to any region >>>> since this is what I need to normalize by), wouldn't the following code >>>> work? >>>> >>>> library('Rsamtools') >>>> library('GenomicAlignments') >>>> >>>> fl <- system.file("extdata", "ex1.bam", package = "Rsamtools") >>>> >>>> flags <- scanBamFlag(isPaired = TRUE, >>>> isProperPair = TRUE, >>>> isUnmappedQuery = FALSE, >>>> hasUnmappedMate = FALSE) >>>> >>>> param <- ScanBamParam(flag = flags) >>>> >>>> countBam(fl, param = param)[['records']]/2 >>>> [1]1572 >>>> >>>> galp <- readGAlignmentPairs(fl, param=param) >>>> length(galp) >>>> [1]1572 >>>> >>>> >>>> On Thu, Jul 3, 2014 at 9:40 AM, Valerie Obenchain <vobencha at="" fhcrc.org="">>>> <mailto:vobencha at="" fhcrc.org="">> wrote: >>>> >>>> Correction below. >>>> >>>> >>>> On 07/03/2014 09:28 AM, Valerie Obenchain wrote: >>>> >>>> Hi, >>>> >>>> On 07/02/2014 12:56 PM, Fong Chun Chan wrote: >>>> >>>> Hi Valerie, >>>> >>>> Thanks for the reply. I'll post some runtime as soon as it >>>> finishes >>>> running the readGAlignments and readGAlignmentPairs. >>>> >>>> Thanks for the tip on passing range(allExons) into the which >>>> param. >>>> There is only one issue in that if I wanted to then generate >>>> say RPKM >>>> values where I need to know the "total number of aligned >>>> reads" in a bam >>>> file. Because I've read in just a subset of the bam (i.e. >>>> the reads that >>>> aligned just to exons) I would have technically produced an >>>> incorrection >>>> calculation of RPKM values (assuming you normalize using the >>>> number of >>>> aligned reads). On that note, is there a quick and easy way >>>> to just >>>> count the number of aligned reads or aligned paired reads so >>>> I can >>>> quickly normalize by this way. This way I can still use the >>>> which >>>> parameter in readGAlignments and readGAlignmentPairs to >>>> speed it up? >>>> >>>> >>>> For single-end you can use countBam(): >>>> >>>> > fl <- system.file("extdata", "ex1.bam", package = >>>> "Rsamtools") >>>> > param <- ScanBamParam(flag = scanBamFlag(isUnmappedQuery = >>>> FALSE)) >>>> > countBam(fl, param = param) >>>> space start end width file records nucleotides >>>> 1 NA NA NA NA ex1.bam 3271 115286 >>>> >>>> It's not so clear for paired-end. The mate-pairing algorithm >>>> used in the >>>> readGAlignment* functions is described on the man page and uses a >>>> combination of BAM fields and flags. You can't really get at >>>> this simply >>>> by creating a param with a flag combination such as >>>> >>>> isPaired = TRUE >>>> isProperPair = TRUE >>>> isUnmappedQuery = FALSE >>>> hasUnmappedMate = FALSE >>>> >>>> For example, if you're interested in this region >>>> >>>> gr <- GRanges("seq1", IRanges(100, 500)) >>>> >>>> and set flags, >>>> >>>> flags <- scanBamFlag(isPaired = TRUE, >>>> isProperPair = TRUE, >>>> isUnmappedQuery = FALSE, >>>> hasUnmappedMate = FALSE) >>>> >>>> countBam() reports 335 records. These are individual records so >>>> the mate >>>> pairs (parts) are each counted separately. Only records that >>>> fall within >>>> the region defined by 'gr' are counted. >>>> >>>> > param <- ScanBamParam(which = gr, flag = flags) >>>> > countBam(fl, param = param) >>>> space start end width file records nucleotides >>>> 1 seq1 100 500 401 ex1.bam 335 11785 >>>> >>>> >>>> readGAlignmentPairs() reports 223 pairs (446 individual >>>> records). All >>>> records in 'gr' are mated if possible. In the case where one >>>> mate part >>>> falls inside 'gr' and one outside, the other mate will be >>>> retrieved >>>> (i.e., outside the range of 'gr'). This makes the 335 above >>>> confusing >>>> for counting the number of aligned pairs in the region. The >>>> number is a >>>> mixture of (1) mate parts both in 'gr' that belong together >>>> (2) mate >>>> parts with mate outside 'gr' and (3) mates or mate parts that >>>> don't meet >>>> other mate-pairing criteria. >>>> >>>> > galp <- readGAlignmentPairs(fl, param=param) >>>> > length(galp) >>>> [1] 223 >>>> >>>> Off hand I can't think of a good way to count aligned pairs >>>> w/in a >>>> specified region. >>>> >>>> >>>> What I should have said here is I can't think of a good way to >>>> accurately count aligned pairs (1 count per pair) in a BAM file w/o >>>> using the pairing algo. >>>> >>>> Valerie >>>> >>>> >>>> >>>> Also another thing I notice is that when I use the >>>> readGAlignmentPairs function, I lose my ability to cancel >>>> the command >>>> (ctrl-c no longer works). >>>> >>>> >>>> It's implemented in C/C++. You can try ctrl+\ in the session >>>> or kill >>>> with the top command from another terminal. Maybe others have >>>> suggestions they'll add ... >>>> >>>> Valerie >>>> >>>> Not sure this is a bug, or if it is something >>>> >>>> to do with the internal workings of the function. The loss >>>> of ctrl-c >>>> also occurs when trying to run as a Rscript. I can seemingly >>>> ctrl-c >>>> anytime before, but it hits this step it just refuses to >>>> cancel. >>>> >>>> Thanks, >>>> >>>> >>>> >>>> >>>> >>>> >>>> On Tue, Jul 1, 2014 at 10:06 AM, Valerie Obenchain >>>> <vobencha at="" fhcrc.org="" <mailto:vobencha="" at="" fhcrc.org=""> >>>> <mailto:vobencha at="" fhcrc.org="" <mailto:vobencha="" at="" fhcrc.org="">>> >>>> wrote: >>>> >>>> Hi, >>>> >>>> You don't need to call findMateAlignment() when using >>>> readGAlignmentPairs(). In a previous iteration, >>>> readGAlignmentPairs() called findMateAlignment() but >>>> this is no >>>> longer the case. >>>> >>>> readGAlignmentPairs() does have more overhead than >>>> readGAlignments() >>>> because of the mate paring. These numbers are for a BAM >>>> file with >>>> 800484 records (400054 final pairs) on my not- so-speedy >>>> laptop. >>>> >>>> library(microbenchmark) >>>> library(RNAseqData.HNRNPC.bam.____chr14) >>>> fls <- RNAseqData.HNRNPC.bam.chr14_____BAMFILES >>>> >>>> microbenchmark(____readGAlignments(fls[1]), >>>> times=5) >>>> >>>> Unit: seconds >>>> expr min lq median >>>> uq >>>> max neval >>>> readGAlignments(fls[1]) 1.753653 1.788292 1.85527 >>>> 2.019071 >>>> 2.121211 5 >>>> >>>> >>>> >>>> ## mate pairing >>>> >>>> >>>> microbenchmark(____readGAlignmentPairs(fls[1]), times=5) >>>> >>>> Unit: seconds >>>> expr min lq >>>> median >>>> uq max neval >>>> readGAlignmentPairs(fls[1]) 9.283966 9.300328 >>>> 9.535194 >>>> 9.843282 11.37441 5 >>>> >>>> >>>> ## mate pairing >>>> >>>> >>>> microbenchmark(____readGAlignmentsList(fls[1]), times=5) >>>> >>>> Unit: seconds >>>> expr min lq >>>> median >>>> uq max neval >>>> readGAlignmentsList(fls[1]) 8.409743 8.457232 >>>> 8.803696 >>>> 9.692845 9.867628 5 >>>> >>>> >>>> >>>> You can set a 'yieldSize' when creating the BamFile. >>>> This will chunk >>>> through the file 'yieldSize' records at a time and is >>>> useful when >>>> processing a complete file and when there are memory >>>> limitations. >>>> >>>> bf <- BamFile(myfile, yieldSize = 1000000) >>>> >>>> Herve reported some additional timings for >>>> readGAlignmentPairs() on >>>> this post: >>>> >>>> >>>> https://stat.ethz.ch/____pipermail/bioconductor/2014-____May/0596 95.html >>>> >>>> <https: stat.ethz.ch="" __pipermail="" bioconductor="" 2014-__may="" 059695.html=""> >>>> >>>> >>>> <https: stat.ethz.ch="" __pipermail="" bioconductor="" 2014-__may="" 059695.html="">>>> >>>> <https: stat.ethz.ch="" pipermail="" bioconductor="" 2014-may="" 059695.html="">> >>>> >>>> >>>> Can you provide some numbers for how long your run is >>>> taking and the >>>> number of records you're trying to read in? >>>> >>>> I assume 'allExons' is used for counting against the >>>> BAM files in >>>> later code? Maybe this is why your were looking into >>>> summarizeOverlaps(). I should point out that if >>>> you're only >>>> interested in the regions overlapping 'allExons' you >>>> could call >>>> range(allExons) and use that as the 'which' in the >>>> param. This would >>>> focus the area of interest and speed up the reading. >>>> >>>> Valerie >>>> >>>> >>>> >>>> >>>> On 06/30/14 12:22, Fong Chun Chan wrote: >>>> >>>> Hi, >>>> >>>> I've been using the GenomicFeatures R package to >>>> read in some >>>> RNA-seq >>>> paired-end read data. While the readGAlignments() >>>> function reads >>>> in the bam >>>> file within a minute, I've noticed that the >>>> readGAlignmentPairs() function >>>> is extremely slow. >>>> >>>> This is even after restricting the space to just a >>>> single >>>> chromosome along >>>> with setting the flags isDuplicate = FALSE and >>>> isNotPrimaryRead >>>> = FALSE >>>> (the latter being suggested in the reference): >>>> >>>> "An easy way to avoid this degradation is to load >>>> only primary >>>> alignments >>>> by setting the isNotPrimaryRead ??'ag to FALSE in >>>> ScanBamParam()" >>>> >>>> Based on my understanding, it seems that the ?? >>>> ndMateAlignment() function >>>> >>>> has to be run first. Is there any other way to >>>> speed it up? >>>> Perhaps I am >>>> doing something wrong here? >>>> >>>> Code and sessionInfo() below. >>>> >>>> Thanks! >>>> >>>> ---- >>>> library("GenomicFeatures") >>>> library('GenomicAlignments') >>>> library('Rsamtools') >>>> si <- seqinfo(BamFile(arguments$____bamFile)) >>>> gr <- GRanges(arguments$chr, IRanges(100, >>>> seqlengths(si)[[arguments$chr]____]-100)) >>>> allExons <- exons(txdb, columns = c('gene_id', >>>> 'exon_id', >>>> 'exon_name')) >>>> scf <- scanBamFlag( isDuplicate = FALSE, >>>> isNotPrimaryRead = >>>> FALSE ) # >>>> remove duplicate reads, remove non-primary reads >>>> (i.e. >>>> multi-aligned reads) >>>> reads <- readGAlignmentPairs( arguments$bamFile, >>>> param = >>>> ScanBamParam( >>>> which = gr, flag = scf ) ); # grab reads in >>>> specific region >>>> >>>> sessionInfo() >>>> >>>> R version 3.1.0 (2014-04-10) >>>> Platform: x86_64-unknown-linux-gnu (64-bit) >>>> >>>> locale: >>>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >>>> LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 >>>> LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 >>>> LC_PAPER=en_US.UTF-8 >>>> LC_NAME=C LC_ADDRESS=C >>>> LC_TELEPHONE=C >>>> LC_MEASUREMENT=en_US.UTF-8 >>>> LC_IDENTIFICATION=C >>>> >>>> attached base packages: >>>> [1] parallel stats graphics grDevices utils >>>> datasets >>>> methods >>>> base >>>> >>>> other attached packages: >>>> [1] argparse_1.0.1 proto_0.3-10 >>>> GenomicAlignments_1.0.1 BSgenome_1.32.0 >>>> Rsamtools_1.16.1 >>>> Biostrings_2.32.0 XVector_0.4.0 >>>> GenomicFeatures_1.16.2 >>>> AnnotationDbi_1.26.0 Biobase_2.24.0 >>>> GenomicRanges_1.16.3 >>>> GenomeInfoDb_1.0.2 IRanges_1.22.9 >>>> BiocGenerics_0.10.0 >>>> vimcom.plus_0.9-93 >>>> [16] setwidth_1.0-3 colorout_1.0-2 >>>> >>>> loaded via a namespace (and not attached): >>>> [1] BatchJobs_1.2 BBmisc_1.7 >>>> BiocParallel_0.6.1 >>>> biomaRt_2.20.0 bitops_1.0-6 brew_1.0-6 >>>> checkmate_1.0 >>>> codetools_0.2-8 DBI_0.2-7 >>>> digest_0.6.4 >>>> fail_1.2 >>>> findpython_1.0.1 foreach_1.4.2 >>>> getopt_1.20.0 >>>> iterators_1.0.7 >>>> plyr_1.8.1 Rcpp_0.11.2 >>>> RCurl_1.95-4.1 >>>> [19] rjson_0.2.14 RSQLite_0.11.4 >>>> rtracklayer_1.24.2 >>>> sendmailR_1.1-2 stats4_3.1.0 stringr_0.6.2 >>>> tools_3.1.0 >>>> XML_3.98-1.1 zlibbioc_1.10.0 >>>> >>>> >>>> >>>> [[alternative HTML version deleted]] >>>> >>>> >>>> >>>> ___________________________________________________ >>>> Bioconductor mailing list >>>> Bioconductor at r-project.org >>>> <mailto:bioconductor at="" r-project.org=""> >>>> <mailto:bioconductor at="" r-__project.org="">>>> <mailto:bioconductor at="" r-project.org="">> >>>> https://stat.ethz.ch/mailman/____listinfo/bioconductor >>>> <https: stat.ethz.ch="" mailman="" __listinfo="" bioconductor=""> >>>> >>>> <https: stat.ethz.ch="" mailman="" __listinfo="" bioconductor="">>>> <https: stat.ethz.ch="" mailman="" listinfo="" bioconductor="">> >>>> Search the archives: >>>> >>>> >>>> http://news.gmane.org/gmane.____science.biology.informatics. >>>> ____conductor >>>> >>>> <http: news.gmane.org="" gmane.__science.biology.informatics.__cond="" uctor=""> >>>> >>>> >>>> <http: news.gmane.org="" gmane.__science.biology.informatics.__conductor="">>>> >>>> <http: news.gmane.org="" gmane.science.biology.informatics.conductor="">> >>>> >>>> >>>> >>>> -- >>>> Valerie Obenchain >>>> Program in Computational Biology >>>> Fred Hutchinson Cancer Research Center >>>> 1100 Fairview Ave. N, Seattle, WA 98109 >>>> >>>> Email: vobencha at fhcrc.org <mailto:vobencha at="" fhcrc.org=""> >>>> <mailto:vobencha at="" fhcrc.org="" <mailto:vobencha="" at="" fhcrc.org="">> >>>> Phone: (206) 667-3158 <tel:%28206%29%20667-3158> >>>> <tel:%28206%29%20667-3158> >>>> >>>> >>>> >>>> _________________________________________________ >>>> Bioconductor mailing list >>>> Bioconductor at r-project.org <mailto:bioconductor at="" r-project.org=""> >>>> https://stat.ethz.ch/mailman/__listinfo/bioconductor >>>> <https: stat.ethz.ch="" mailman="" listinfo="" bioconductor=""> >>>> Search the archives: >>>> >>>> http://news.gmane.org/gmane.__science.biology.informatics.__conductor >>>> >>>> <http: news.gmane.org="" gmane.science.biology.informatics.conductor=""> >>>> >>>> >>>> >>>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >> >> > > [[alternative HTML version deleted]] > > > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
ADD REPLYlink written 3.5 years ago by Hervé Pagès ♦♦ 13k
Hi Herve, The R job is using 24g right now. I am operating on a machine with 128G so I don't think it is swapping as there isn't much running on it either. I am not using the use.names parameter. Fong On Mon, Jul 14, 2014 at 3:06 PM, Hervé Pagès <hpages@fhcrc.org> wrote: > Hi Fong, > > > On 07/14/2014 01:02 PM, Fong Chun Chan wrote: > >> As promised, I put some of the run-time details for a particular bam file >> comparing readGAlignments vs. readGAlignmentsPairs. >> >> readGAlignmentPairs has been running for like over a week I believe and >> doesn't look like it will actually stop running...so I don't have any >> run-time of this function unfortunately. >> > > Could it be that the R process is eating up all your memory so now > it's running in "swapping mode"? This tends to slow things down *a lot* > (typically by a factor of 100 or 1000 or more). You can check that > by running the Unix commend 'top' in a separate terminal. > > How much RAM do you have? You probably need at least 16Gb of RAM > (rough estimate) to read in a BAM file with 55 million pairs with > readGAlignmentPairs(). And possibly more than that if you're > loading the QNAME field (with 'use.names=TRUE') and/or other BAM > fields or tags. > > Cheers, > H. > > >> ------ >> group | nb of | nb of | mean / max >> of | records | unique | records per >> records | in group | QNAMEs | unique >> QNAME >> All records........................ A |110625579 | 55302565 | 2 / 4 >> o template has single segment.... S | 0 | 0 | NA / NA >> o template has multiple segments. M |110625579 | 55302565 | 2 / 4 >> - first segment.............. F | 55312870 | 55302565 | 1 / 2 >> - last segment............... L | 55312709 | 55302565 | 1 / 2 >> - other segment.............. O | 0 | 0 | NA / NA >> >> Note that (S, M) is a partitioning of A, and (F, L, O) is a partitioning >> of >> M. >> Indentation reflects this. >> [1] "Read using readGAlignments..." >> Unit: seconds >> >> expr >> reads <- readGAlignments(bamFile, param = ScanBamParam(which = gr, >> flag = scf)) >> min lq median uq max neval >> 209.1448 216.8919 221.3251 225.7923 243.3946 100 >> >> [1] "Read using readGAlignmentsPairs..." >> ------ >> >> >> On Thu, Jul 3, 2014 at 12:11 PM, Valerie Obenchain <vobencha@fhcrc.org> >> wrote: >> >> fyi the test files ('fls') came from, >>> >>> library(RNAseqData.HNRNPC.bam.chr14) >>> fls <- RNAseqData.HNRNPC.bam.chr14_BAMFILES >>> >>> Valerie >>> >>> >>> >>> On 07/03/2014 10:42 AM, Valerie Obenchain wrote: >>> >>> No, the division by 2 isn't always safe. 'ex1.bam' is a well- behaved >>>> file. Files with more diversity may have more than 2 segments per >>>> template. >>>> >>>> counts <- lapply(fls, function(fl) >>>> countBam(fl, param=param)$records) >>>> >>>> pairs <- lapply(fls, function(fl) >>>> length(readGAlignmentPairs(fl, param=param))) >>>> >>>> unlist(counts)/2 >>>> >>>>> >>>>>> ERR127306 ERR127307 ERR127308 ERR127309 ERR127302 ERR127303 ERR127304 >>>>> ERR127305 >>>>> 363326 392420 397158 349970 355376 381493 >>>>> 389176 355834 >>>>> >>>>> unlist(pairs) >>>>>> >>>>>> ERR127306 ERR127307 ERR127308 ERR127309 ERR127302 ERR127303 ERR127304 >>>>> ERR127305 >>>>> 363147 392252 396933 349822 355215 381274 >>>>> 388991 355682 >>>>> >>>>> >>>> To see what's going on use readGAlignementsList() on the first file. >>>> This function adds some flexibility in that it reads in fragments, pairs >>>> w/ no mates, etc. >>>> >>>> galist <- readGAlignmentsList(fl, param=param) >>>> >>>> The length matches that of countBam() because it reads all records. >>>> >>>> length(galist) >>>> >>>>> >>>>>> [1] 363226 >>>>> >>>>> >>>> 'mate_status' categorizes records by 'mated', 'ambiguous' and 'unmated'. >>>> Here we see the 363147 match to output from readGAlignmentsPairs() and >>>> the rest are ambiguous. >>>> >>>> table(mcols(galist)$mate_status) >>>> >>>>> >>>>>> >>>>> mated ambiguous unmated >>>>> 363147 79 0 >>>>> >>>>> >>>> elementLengths() shows the number of segments per template. Again we see >>>> the 363147 match and it shows exactly 2 segments for those. >>>> >>>> table(elementLengths(galist)) >>>> >>>>> >>>>>> >>>>> 2 4 6 8 >>>>> 363147 65 7 7 >>>>> >>>>> >>>> >>>> Valerie >>>> >>>> >>>> On 07/03/2014 10:08 AM, Fong Chun Chan wrote: >>>> >>>> Thanks for the reply. I understand what you mean about counting aligned >>>>> pairs within a region and how that would give incorrect results. >>>>> >>>>> However, if I was purely interested in just getting all aligned >>>>> read-pairs within the whole genomic space (no restriction to any region >>>>> since this is what I need to normalize by), wouldn't the following code >>>>> work? >>>>> >>>>> library('Rsamtools') >>>>> library('GenomicAlignments') >>>>> >>>>> fl <- system.file("extdata", "ex1.bam", package = "Rsamtools") >>>>> >>>>> flags <- scanBamFlag(isPaired = TRUE, >>>>> isProperPair = TRUE, >>>>> isUnmappedQuery = FALSE, >>>>> hasUnmappedMate = FALSE) >>>>> >>>>> param <- ScanBamParam(flag = flags) >>>>> >>>>> countBam(fl, param = param)[['records']]/2 >>>>> [1]1572 >>>>> >>>>> galp <- readGAlignmentPairs(fl, param=param) >>>>> length(galp) >>>>> [1]1572 >>>>> >>>>> >>>>> On Thu, Jul 3, 2014 at 9:40 AM, Valerie Obenchain <vobencha@fhcrc.org>>>>> <mailto:vobencha@fhcrc.org>> wrote: >>>>> >>>>> Correction below. >>>>> >>>>> >>>>> On 07/03/2014 09:28 AM, Valerie Obenchain wrote: >>>>> >>>>> Hi, >>>>> >>>>> On 07/02/2014 12:56 PM, Fong Chun Chan wrote: >>>>> >>>>> Hi Valerie, >>>>> >>>>> Thanks for the reply. I'll post some runtime as soon as it >>>>> finishes >>>>> running the readGAlignments and readGAlignmentPairs. >>>>> >>>>> Thanks for the tip on passing range(allExons) into the >>>>> which >>>>> param. >>>>> There is only one issue in that if I wanted to then >>>>> generate >>>>> say RPKM >>>>> values where I need to know the "total number of aligned >>>>> reads" in a bam >>>>> file. Because I've read in just a subset of the bam (i.e. >>>>> the reads that >>>>> aligned just to exons) I would have technically produced >>>>> an >>>>> incorrection >>>>> calculation of RPKM values (assuming you normalize using >>>>> the >>>>> number of >>>>> aligned reads). On that note, is there a quick and easy >>>>> way >>>>> to just >>>>> count the number of aligned reads or aligned paired reads >>>>> so >>>>> I can >>>>> quickly normalize by this way. This way I can still use >>>>> the >>>>> which >>>>> parameter in readGAlignments and readGAlignmentPairs to >>>>> speed it up? >>>>> >>>>> >>>>> For single-end you can use countBam(): >>>>> >>>>> > fl <- system.file("extdata", "ex1.bam", package = >>>>> "Rsamtools") >>>>> > param <- ScanBamParam(flag = scanBamFlag(isUnmappedQuery = >>>>> FALSE)) >>>>> > countBam(fl, param = param) >>>>> space start end width file records nucleotides >>>>> 1 NA NA NA NA ex1.bam 3271 115286 >>>>> >>>>> It's not so clear for paired-end. The mate-pairing algorithm >>>>> used in the >>>>> readGAlignment* functions is described on the man page and >>>>> uses a >>>>> combination of BAM fields and flags. You can't really get at >>>>> this simply >>>>> by creating a param with a flag combination such as >>>>> >>>>> isPaired = TRUE >>>>> isProperPair = TRUE >>>>> isUnmappedQuery = FALSE >>>>> hasUnmappedMate = FALSE >>>>> >>>>> For example, if you're interested in this region >>>>> >>>>> gr <- GRanges("seq1", IRanges(100, 500)) >>>>> >>>>> and set flags, >>>>> >>>>> flags <- scanBamFlag(isPaired = TRUE, >>>>> isProperPair = TRUE, >>>>> isUnmappedQuery = FALSE, >>>>> hasUnmappedMate = FALSE) >>>>> >>>>> countBam() reports 335 records. These are individual records >>>>> so >>>>> the mate >>>>> pairs (parts) are each counted separately. Only records that >>>>> fall within >>>>> the region defined by 'gr' are counted. >>>>> >>>>> > param <- ScanBamParam(which = gr, flag = flags) >>>>> > countBam(fl, param = param) >>>>> space start end width file records nucleotides >>>>> 1 seq1 100 500 401 ex1.bam 335 11785 >>>>> >>>>> >>>>> readGAlignmentPairs() reports 223 pairs (446 individual >>>>> records). All >>>>> records in 'gr' are mated if possible. In the case where one >>>>> mate part >>>>> falls inside 'gr' and one outside, the other mate will be >>>>> retrieved >>>>> (i.e., outside the range of 'gr'). This makes the 335 above >>>>> confusing >>>>> for counting the number of aligned pairs in the region. The >>>>> number is a >>>>> mixture of (1) mate parts both in 'gr' that belong together >>>>> (2) mate >>>>> parts with mate outside 'gr' and (3) mates or mate parts that >>>>> don't meet >>>>> other mate-pairing criteria. >>>>> >>>>> > galp <- readGAlignmentPairs(fl, param=param) >>>>> > length(galp) >>>>> [1] 223 >>>>> >>>>> Off hand I can't think of a good way to count aligned pairs >>>>> w/in a >>>>> specified region. >>>>> >>>>> >>>>> What I should have said here is I can't think of a good way to >>>>> accurately count aligned pairs (1 count per pair) in a BAM file >>>>> w/o >>>>> using the pairing algo. >>>>> >>>>> Valerie >>>>> >>>>> >>>>> >>>>> Also another thing I notice is that when I use the >>>>> readGAlignmentPairs function, I lose my ability to cancel >>>>> the command >>>>> (ctrl-c no longer works). >>>>> >>>>> >>>>> It's implemented in C/C++. You can try ctrl+\ in the session >>>>> or kill >>>>> with the top command from another terminal. Maybe others have >>>>> suggestions they'll add ... >>>>> >>>>> Valerie >>>>> >>>>> Not sure this is a bug, or if it is something >>>>> >>>>> to do with the internal workings of the function. The loss >>>>> of ctrl-c >>>>> also occurs when trying to run as a Rscript. I can >>>>> seemingly >>>>> ctrl-c >>>>> anytime before, but it hits this step it just refuses to >>>>> cancel. >>>>> >>>>> Thanks, >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> On Tue, Jul 1, 2014 at 10:06 AM, Valerie Obenchain >>>>> <vobencha@fhcrc.org <mailto:vobencha@fhcrc.org=""> >>>>> <mailto:vobencha@fhcrc.org <mailto:vobencha@fhcrc.org="">>> >>>>> wrote: >>>>> >>>>> Hi, >>>>> >>>>> You don't need to call findMateAlignment() when using >>>>> readGAlignmentPairs(). In a previous iteration, >>>>> readGAlignmentPairs() called findMateAlignment() but >>>>> this is no >>>>> longer the case. >>>>> >>>>> readGAlignmentPairs() does have more overhead than >>>>> readGAlignments() >>>>> because of the mate paring. These numbers are for a >>>>> BAM >>>>> file with >>>>> 800484 records (400054 final pairs) on my >>>>> not-so-speedy >>>>> laptop. >>>>> >>>>> library(microbenchmark) >>>>> library(RNAseqData.HNRNPC.bam.____chr14) >>>>> fls <- RNAseqData.HNRNPC.bam.chr14_____BAMFILES >>>>> >>>>> microbenchmark(____ >>>>> readGAlignments(fls[1]), >>>>> times=5) >>>>> >>>>> Unit: seconds >>>>> expr min lq >>>>> median >>>>> uq >>>>> max neval >>>>> readGAlignments(fls[1]) 1.753653 1.788292 >>>>> 1.85527 >>>>> 2.019071 >>>>> 2.121211 5 >>>>> >>>>> >>>>> >>>>> ## mate pairing >>>>> >>>>> >>>>> microbenchmark(____readGAlignmentPairs(fls[1]), times=5) >>>>> >>>>> Unit: seconds >>>>> expr min lq >>>>> median >>>>> uq max neval >>>>> readGAlignmentPairs(fls[1]) 9.283966 9.300328 >>>>> 9.535194 >>>>> 9.843282 11.37441 5 >>>>> >>>>> >>>>> ## mate pairing >>>>> >>>>> >>>>> microbenchmark(____readGAlignmentsList(fls[1]), times=5) >>>>> >>>>> Unit: seconds >>>>> expr min lq >>>>> median >>>>> uq max neval >>>>> readGAlignmentsList(fls[1]) 8.409743 8.457232 >>>>> 8.803696 >>>>> 9.692845 9.867628 5 >>>>> >>>>> >>>>> >>>>> You can set a 'yieldSize' when creating the BamFile. >>>>> This will chunk >>>>> through the file 'yieldSize' records at a time and is >>>>> useful when >>>>> processing a complete file and when there are memory >>>>> limitations. >>>>> >>>>> bf <- BamFile(myfile, yieldSize = 1000000) >>>>> >>>>> Herve reported some additional timings for >>>>> readGAlignmentPairs() on >>>>> this post: >>>>> >>>>> >>>>> https://stat.ethz.ch/____pipermail/bioconductor/2014-__ >>>>> __May/059695.html >>>>> >>>>> <https: stat.ethz.ch="" __pipermail="" bioconductor="" 2014-__may="" 059695.html=""> >>>>> >>>>> >>>>> <https: stat.ethz.ch="" __pipermail="" bioconductor="" 2014-__may="" 059695.html="">>>>> >>>>> <https: stat.ethz.ch="" pipermail="" bioconductor="" 2014-may="" 059695.html="">> >>>>> >>>>> >>>>> Can you provide some numbers for how long your run is >>>>> taking and the >>>>> number of records you're trying to read in? >>>>> >>>>> I assume 'allExons' is used for counting against the >>>>> BAM files in >>>>> later code? Maybe this is why your were looking into >>>>> summarizeOverlaps(). I should point out that if >>>>> you're only >>>>> interested in the regions overlapping 'allExons' you >>>>> could call >>>>> range(allExons) and use that as the 'which' in the >>>>> param. This would >>>>> focus the area of interest and speed up the reading. >>>>> >>>>> Valerie >>>>> >>>>> >>>>> >>>>> >>>>> On 06/30/14 12:22, Fong Chun Chan wrote: >>>>> >>>>> Hi, >>>>> >>>>> I've been using the GenomicFeatures R package to >>>>> read in some >>>>> RNA-seq >>>>> paired-end read data. While the readGAlignments() >>>>> function reads >>>>> in the bam >>>>> file within a minute, I've noticed that the >>>>> readGAlignmentPairs() function >>>>> is extremely slow. >>>>> >>>>> This is even after restricting the space to just >>>>> a >>>>> single >>>>> chromosome along >>>>> with setting the flags isDuplicate = FALSE and >>>>> isNotPrimaryRead >>>>> = FALSE >>>>> (the latter being suggested in the reference): >>>>> >>>>> "An easy way to avoid this degradation is to load >>>>> only primary >>>>> alignments >>>>> by setting the isNotPrimaryRead ï¬'ag to FALSE in >>>>> ScanBamParam()" >>>>> >>>>> Based on my understanding, it seems that the ï¬ >>>>> ndMateAlignment() function >>>>> >>>>> has to be run first. Is there any other way to >>>>> speed it up? >>>>> Perhaps I am >>>>> doing something wrong here? >>>>> >>>>> Code and sessionInfo() below. >>>>> >>>>> Thanks! >>>>> >>>>> ---- >>>>> library("GenomicFeatures") >>>>> library('GenomicAlignments') >>>>> library('Rsamtools') >>>>> si <- seqinfo(BamFile(arguments$____bamFile)) >>>>> gr <- GRanges(arguments$chr, IRanges(100, >>>>> seqlengths(si)[[arguments$chr]____]-100)) >>>>> allExons <- exons(txdb, columns = c('gene_id', >>>>> 'exon_id', >>>>> 'exon_name')) >>>>> scf <- scanBamFlag( isDuplicate = FALSE, >>>>> isNotPrimaryRead = >>>>> FALSE ) # >>>>> remove duplicate reads, remove non-primary reads >>>>> (i.e. >>>>> multi-aligned reads) >>>>> reads <- readGAlignmentPairs( arguments$bamFile, >>>>> param = >>>>> ScanBamParam( >>>>> which = gr, flag = scf ) ); # grab reads in >>>>> specific region >>>>> >>>>> sessionInfo() >>>>> >>>>> R version 3.1.0 (2014-04-10) >>>>> Platform: x86_64-unknown-linux-gnu (64-bit) >>>>> >>>>> locale: >>>>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >>>>> LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 >>>>> LC_MONETARY=en_US.UTF-8 >>>>> LC_MESSAGES=en_US.UTF-8 >>>>> LC_PAPER=en_US.UTF-8 >>>>> LC_NAME=C LC_ADDRESS=C >>>>> LC_TELEPHONE=C >>>>> LC_MEASUREMENT=en_US.UTF-8 >>>>> LC_IDENTIFICATION=C >>>>> >>>>> attached base packages: >>>>> [1] parallel stats graphics grDevices utils >>>>> datasets >>>>> methods >>>>> base >>>>> >>>>> other attached packages: >>>>> [1] argparse_1.0.1 proto_0.3-10 >>>>> GenomicAlignments_1.0.1 BSgenome_1.32.0 >>>>> Rsamtools_1.16.1 >>>>> Biostrings_2.32.0 XVector_0.4.0 >>>>> GenomicFeatures_1.16.2 >>>>> AnnotationDbi_1.26.0 Biobase_2.24.0 >>>>> GenomicRanges_1.16.3 >>>>> GenomeInfoDb_1.0.2 IRanges_1.22.9 >>>>> BiocGenerics_0.10.0 >>>>> vimcom.plus_0.9-93 >>>>> [16] setwidth_1.0-3 colorout_1.0-2 >>>>> >>>>> loaded via a namespace (and not attached): >>>>> [1] BatchJobs_1.2 BBmisc_1.7 >>>>> BiocParallel_0.6.1 >>>>> biomaRt_2.20.0 bitops_1.0-6 brew_1.0-6 >>>>> checkmate_1.0 >>>>> codetools_0.2-8 DBI_0.2-7 >>>>> digest_0.6.4 >>>>> fail_1.2 >>>>> findpython_1.0.1 foreach_1.4.2 >>>>> getopt_1.20.0 >>>>> iterators_1.0.7 >>>>> plyr_1.8.1 Rcpp_0.11.2 >>>>> RCurl_1.95-4.1 >>>>> [19] rjson_0.2.14 RSQLite_0.11.4 >>>>> rtracklayer_1.24.2 >>>>> sendmailR_1.1-2 stats4_3.1.0 >>>>> stringr_0.6.2 >>>>> tools_3.1.0 >>>>> XML_3.98-1.1 zlibbioc_1.10.0 >>>>> >>>>> >>>>> >>>>> [[alternative HTML version deleted]] >>>>> >>>>> >>>>> >>>>> ______________________________ >>>>> _____________________ >>>>> Bioconductor mailing list >>>>> Bioconductor@r-project.org >>>>> <mailto:bioconductor@r-project.org> >>>>> <mailto:bioconductor@r-__project.org>>>>> <mailto:bioconductor@r-project.org>> >>>>> https://stat.ethz.ch/mailman/____listinfo/bioconductor >>>>> <https: stat.ethz.ch="" mailman="" __listinfo="" bioconductor=""> >>>>> >>>>> <https: stat.ethz.ch="" mailman="" __listinfo="" bioconductor="">>>>> <https: stat.ethz.ch="" mailman="" listinfo="" bioconductor="">> >>>>> Search the archives: >>>>> >>>>> >>>>> http://news.gmane.org/gmane.____science.biology.informatics. >>>>> ____conductor >>>>> >>>>> <http: news.gmane.org="" gmane.__science.biology.informatics.__conductor="">>>>> > >>>>> >>>>> >>>>> <http: news.gmane.org="" gmane.__science.biology.informatics.__conductor="">>>>> >>>>> <http: news.gmane.org="" gmane.science.biology.informatics.conductor="">> >>>>> >>>>> >>>>> >>>>> -- >>>>> Valerie Obenchain >>>>> Program in Computational Biology >>>>> Fred Hutchinson Cancer Research Center >>>>> 1100 Fairview Ave. N, Seattle, WA 98109 >>>>> >>>>> Email: vobencha@fhcrc.org <mailto:vobencha@fhcrc.org>>>>> > >>>>> <mailto:vobencha@fhcrc.org <mailto:vobencha@fhcrc.org="">> >>>>> Phone: (206) 667-3158 <tel:%28206%29%20667-3158> >>>>> <tel:%28206%29%20667-3158> >>>>> >>>>> >>>>> >>>>> _________________________________________________ >>>>> Bioconductor mailing list >>>>> Bioconductor@r-project.org <mailto:bioconductor@r-project.org>>>>> > >>>>> https://stat.ethz.ch/mailman/__listinfo/bioconductor >>>>> <https: stat.ethz.ch="" mailman="" listinfo="" bioconductor=""> >>>>> Search the archives: >>>>> >>>>> http://news.gmane.org/gmane.__science.biology.informatics.__conductor >>>>> >>>>> <http: news.gmane.org="" gmane.science.biology.informatics.conductor=""> >>>>> >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>> Bioconductor mailing list >>>> Bioconductor@r-project.org >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> Search the archives: >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>> >>>> >>> >>> >> [[alternative HTML version deleted]] >> >> >> >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor@r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane. >> science.biology.informatics.conductor >> >> > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages@fhcrc.org > Phone: (206) 667-5791 > Fax: (206) 667-1319 > [[alternative HTML version deleted]]
ADD REPLYlink written 3.5 years ago by Fong Chun Chan320
On 07/14/2014 03:19 PM, Fong Chun Chan wrote: > Hi Herve, > > The R job is using 24g right now. I am operating on a machine with 128G > so I don't think it is swapping as there isn't much running on it either. > > I am not using the use.names parameter. Not sure why it's taking so long. Your file is big but not huge. Puzzling! Just did some timings with a BAM file containing 81 million pairs (the file itself is 12 Gb). Here is the code I ran: bamfile <- "/home/hpages/rhino04-loc/E-GEOD-49538/STAR_alignments/SRR947892.out/A ligned.out.sorted.bam" library(GenomicAlignments) flag <- scanBamFlag(isNotPrimaryRead=FALSE, isDuplicate=FALSE) system.time(galp <- readGAlignmentPairs(bamfile, param=ScanBamParam(flag=flag))) gc() galp I used the latest GenomicAlignments + Rsamtools from BioC devel (see my sessionInfo at the bottom). Here is the output I got: > system.time(galp <- readGAlignmentPairs(bamfile, param=ScanBamParam(flag=flag))) user system elapsed 1270.079 55.003 1325.520 > gc() used (Mb) gc trigger (Mb) max used (Mb) Ncells 3597450 192.2 5241317 280.0 3794227 202.7 Vcells 293262720 2237.5 2101496948 16033.2 2365791455 18049.6 > galp GAlignmentPairs with 81349223 alignment pairs and 0 metadata columns: seqnames strand : ranges -- ranges <rle> <rle> : <iranges> -- <iranges> [1] chr1 + : [3016554, 3025097] -- [3025054, 3025115] [2] chr1 - : [3042064, 3042163] -- [3042028, 3042127] [3] chr1 - : [3043193, 3043292] -- [3043183, 3043249] [4] chr1 + : [3058287, 3058386] -- [3058341, 3058440] [5] chr1 - : [3139237, 3139336] -- [3139143, 3139242] ... ... ... ... ... ... ... [81349219] chrUn_JH584304 - : [ 99839, 99938] -- [ 78007, 78086] [81349220] chrUn_JH584304 + : [101259, 101336] -- [103894, 103993] [81349221] chrUn_JH584304 + : [101259, 101336] -- [103894, 103993] [81349222] chrUn_JH584304 + : [103827, 103924] -- [104644, 104742] [81349223] chrUn_JH584304 + : [106458, 106557] -- [106495, 106552] --- seqlengths: chr1 chr2 ... chrUn_JH584304 195471971 182113224 ... 114452 So it took 22 min and used 18 Gb of RAM. The top command reported similar memory usage (almost 20 Gb). Make sure your file is sorted and indexed (mine is). I don't have timings to back this up but that could make a difference... and a big one apparently! Otherwise maybe there is something unusual about your file. If you don't mind making it available to us then we could take a look. Thanks, H. > sessionInfo() R version 3.1.0 (2014-04-10) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats graphics grDevices utils datasets methods [8] base other attached packages: [1] GenomicAlignments_1.1.20 Rsamtools_1.17.29 Biostrings_2.33.12 [4] XVector_0.5.7 GenomicRanges_1.17.23 GenomeInfoDb_1.1.10 [7] IRanges_1.99.21 S4Vectors_0.1.2 BiocGenerics_0.11.3 loaded via a namespace (and not attached): [1] BatchJobs_1.3 BBmisc_1.7 BiocParallel_0.7.7 bitops_1.0-6 [5] brew_1.0-6 checkmate_1.1 codetools_0.2-8 DBI_0.2-7 [9] digest_0.6.4 fail_1.2 foreach_1.4.2 iterators_1.0.7 [13] RSQLite_0.11.4 sendmailR_1.1-2 stats4_3.1.0 stringr_0.6.2 [17] tools_3.1.0 zlibbioc_1.11.1 > > Fong > > > On Mon, Jul 14, 2014 at 3:06 PM, Hervé Pagès <hpages at="" fhcrc.org=""> <mailto:hpages at="" fhcrc.org="">> wrote: > > Hi Fong, > > > On 07/14/2014 01:02 PM, Fong Chun Chan wrote: > > As promised, I put some of the run-time details for a particular > bam file > comparing readGAlignments vs. readGAlignmentsPairs. > > readGAlignmentPairs has been running for like over a week I > believe and > doesn't look like it will actually stop running...so I don't > have any > run-time of this function unfortunately. > > > Could it be that the R process is eating up all your memory so now > it's running in "swapping mode"? This tends to slow things down *a lot* > (typically by a factor of 100 or 1000 or more). You can check that > by running the Unix commend 'top' in a separate terminal. > > How much RAM do you have? You probably need at least 16Gb of RAM > (rough estimate) to read in a BAM file with 55 million pairs with > readGAlignmentPairs(). And possibly more than that if you're > loading the QNAME field (with 'use.names=TRUE') and/or other BAM > fields or tags. > > Cheers, > H. > > > ------ > group | nb of | nb of | mean / max > of | records | unique | > records per > records | in group | QNAMEs | > unique QNAME > All records.......................__. A |110625579 | 55302565 | > 2 / 4 > o template has single segment.... S | 0 | 0 | > NA / NA > o template has multiple segments. M |110625579 | 55302565 | > 2 / 4 > - first segment.............. F | 55312870 | 55302565 | > 1 / 2 > - last segment............... L | 55312709 | 55302565 | > 1 / 2 > - other segment.............. O | 0 | 0 | > NA / NA > > Note that (S, M) is a partitioning of A, and (F, L, O) is a > partitioning of > M. > Indentation reflects this. > [1] "Read using readGAlignments..." > Unit: seconds > > expr > reads <- readGAlignments(bamFile, param = ScanBamParam(which > = gr, > flag = scf)) > min lq median uq max neval > 209.1448 216.8919 221.3251 225.7923 243.3946 100 > > [1] "Read using readGAlignmentsPairs..." > ------ > > > On Thu, Jul 3, 2014 at 12:11 PM, Valerie Obenchain > <vobencha at="" fhcrc.org="" <mailto:vobencha="" at="" fhcrc.org="">> > wrote: > > fyi the test files ('fls') came from, > > library(RNAseqData.HNRNPC.bam.__chr14) > fls <- RNAseqData.HNRNPC.bam.chr14___BAMFILES > > Valerie > > > > On 07/03/2014 10:42 AM, Valerie Obenchain wrote: > > No, the division by 2 isn't always safe. 'ex1.bam' is a > well-behaved > file. Files with more diversity may have more than 2 > segments per > template. > > counts <- lapply(fls, function(fl) > countBam(fl, param=param)$records) > > pairs <- lapply(fls, function(fl) > length(readGAlignmentPairs(fl, param=param))) > > unlist(counts)/2 > > > ERR127306 ERR127307 ERR127308 ERR127309 ERR127302 > ERR127303 ERR127304 > ERR127305 > 363326 392420 397158 349970 355376 > 381493 > 389176 355834 > > unlist(pairs) > > ERR127306 ERR127307 ERR127308 ERR127309 ERR127302 > ERR127303 ERR127304 > ERR127305 > 363147 392252 396933 349822 355215 > 381274 > 388991 355682 > > > To see what's going on use readGAlignementsList() on the > first file. > This function adds some flexibility in that it reads in > fragments, pairs > w/ no mates, etc. > > galist <- readGAlignmentsList(fl, param=param) > > The length matches that of countBam() because it reads > all records. > > length(galist) > > > [1] 363226 > > > 'mate_status' categorizes records by 'mated', > 'ambiguous' and 'unmated'. > Here we see the 363147 match to output from > readGAlignmentsPairs() and > the rest are ambiguous. > > table(mcols(galist)$mate___status) > > > > mated ambiguous unmated > 363147 79 0 > > > elementLengths() shows the number of segments per > template. Again we see > the 363147 match and it shows exactly 2 segments for those. > > table(elementLengths(galist)) > > > > 2 4 6 8 > 363147 65 7 7 > > > > Valerie > > > On 07/03/2014 10:08 AM, Fong Chun Chan wrote: > > Thanks for the reply. I understand what you mean > about counting aligned > pairs within a region and how that would give > incorrect results. > > However, if I was purely interested in just getting > all aligned > read-pairs within the whole genomic space (no > restriction to any region > since this is what I need to normalize by), wouldn't > the following code > work? > > library('Rsamtools') > library('GenomicAlignments') > > fl <- system.file("extdata", "ex1.bam", package = > "Rsamtools") > > flags <- scanBamFlag(isPaired = TRUE, > isProperPair = TRUE, > isUnmappedQuery = FALSE, > hasUnmappedMate = FALSE) > > param <- ScanBamParam(flag = flags) > > countBam(fl, param = param)[['records']]/2 > [1]1572 > > galp <- readGAlignmentPairs(fl, param=param) > length(galp) > [1]1572 > > > On Thu, Jul 3, 2014 at 9:40 AM, Valerie Obenchain > <vobencha at="" fhcrc.org="" <mailto:vobencha="" at="" fhcrc.org=""> > <mailto:vobencha at="" fhcrc.org=""> <mailto:vobencha at="" fhcrc.org="">>> wrote: > > Correction below. > > > On 07/03/2014 09:28 AM, Valerie Obenchain wrote: > > Hi, > > On 07/02/2014 12:56 PM, Fong Chun Chan wrote: > > Hi Valerie, > > Thanks for the reply. I'll post some > runtime as soon as it > finishes > running the readGAlignments and > readGAlignmentPairs. > > Thanks for the tip on passing > range(allExons) into the which > param. > There is only one issue in that if I > wanted to then generate > say RPKM > values where I need to know the "total > number of aligned > reads" in a bam > file. Because I've read in just a > subset of the bam (i.e. > the reads that > aligned just to exons) I would have > technically produced an > incorrection > calculation of RPKM values (assuming > you normalize using the > number of > aligned reads). On that note, is there > a quick and easy way > to just > count the number of aligned reads or > aligned paired reads so > I can > quickly normalize by this way. This > way I can still use the > which > parameter in readGAlignments and > readGAlignmentPairs to > speed it up? > > > For single-end you can use countBam(): > > > fl <- system.file("extdata", > "ex1.bam", package = > "Rsamtools") > > param <- ScanBamParam(flag = > scanBamFlag(isUnmappedQuery = > FALSE)) > > countBam(fl, param = param) > space start end width file records > nucleotides > 1 NA NA NA NA ex1.bam 3271 > 115286 > > It's not so clear for paired-end. The > mate-pairing algorithm > used in the > readGAlignment* functions is described on > the man page and uses a > combination of BAM fields and flags. You > can't really get at > this simply > by creating a param with a flag > combination such as > > isPaired = TRUE > isProperPair = TRUE > isUnmappedQuery = FALSE > hasUnmappedMate = FALSE > > For example, if you're interested in this > region > > gr <- GRanges("seq1", IRanges(100, 500)) > > and set flags, > > flags <- scanBamFlag(isPaired = TRUE, > isProperPair = TRUE, > isUnmappedQuery = > FALSE, > hasUnmappedMate = > FALSE) > > countBam() reports 335 records. These are > individual records so > the mate > pairs (parts) are each counted separately. > Only records that > fall within > the region defined by 'gr' are counted. > > > param <- ScanBamParam(which = gr, flag > = flags) > > countBam(fl, param = param) > space start end width file records > nucleotides > 1 seq1 100 500 401 ex1.bam 335 > 11785 > > > readGAlignmentPairs() reports 223 pairs > (446 individual > records). All > records in 'gr' are mated if possible. In > the case where one > mate part > falls inside 'gr' and one outside, the > other mate will be > retrieved > (i.e., outside the range of 'gr'). This > makes the 335 above > confusing > for counting the number of aligned pairs > in the region. The > number is a > mixture of (1) mate parts both in 'gr' > that belong together > (2) mate > parts with mate outside 'gr' and (3) mates > or mate parts that > don't meet > other mate-pairing criteria. > > > galp <- readGAlignmentPairs(fl, > param=param) > > length(galp) > [1] 223 > > Off hand I can't think of a good way to > count aligned pairs > w/in a > specified region. > > > What I should have said here is I can't think > of a good way to > accurately count aligned pairs (1 count per > pair) in a BAM file w/o > using the pairing algo. > > Valerie > > > > Also another thing I notice is that > when I use the > readGAlignmentPairs function, I lose > my ability to cancel > the command > (ctrl-c no longer works). > > > It's implemented in C/C++. You can try > ctrl+\ in the session > or kill > with the top command from another > terminal. Maybe others have > suggestions they'll add ... > > Valerie > > Not sure this is a bug, or if it is something > > to do with the internal workings of > the function. The loss > of ctrl-c > also occurs when trying to run as a > Rscript. I can seemingly > ctrl-c > anytime before, but it hits this step > it just refuses to > cancel. > > Thanks, > > > > > > > On Tue, Jul 1, 2014 at 10:06 AM, > Valerie Obenchain > <vobencha at="" fhcrc.org=""> <mailto:vobencha at="" fhcrc.org=""> > <mailto:vobencha at="" fhcrc.org="" <mailto:vobencha="" at="" fhcrc.org="">> > <mailto:vobencha at="" fhcrc.org=""> <mailto:vobencha at="" fhcrc.org=""> > <mailto:vobencha at="" fhcrc.org=""> <mailto:vobencha at="" fhcrc.org="">>>> > wrote: > > Hi, > > You don't need to call > findMateAlignment() when using > readGAlignmentPairs(). In a > previous iteration, > readGAlignmentPairs() called > findMateAlignment() but > this is no > longer the case. > > readGAlignmentPairs() does have > more overhead than > readGAlignments() > because of the mate paring. These > numbers are for a BAM > file with > 800484 records (400054 final > pairs) on my not-so-speedy > laptop. > > library(microbenchmark) > > library(RNAseqData.HNRNPC.bam.______chr14) > fls <- > RNAseqData.HNRNPC.bam.chr14_______BAMFILES > > > microbenchmark(______readGAlignments(fls[1]), > times=5) > > Unit: seconds > expr > min lq median > uq > max neval > readGAlignments(fls[1]) > 1.753653 1.788292 1.85527 > 2.019071 > 2.121211 5 > > > > ## mate pairing > > > > microbenchmark(______readGAlignmentPairs(fls[1]), > times=5) > > Unit: seconds > expr > min lq > median > uq max neval > readGAlignmentPairs(fls[1]) > 9.283966 9.300328 > 9.535194 > 9.843282 11.37441 5 > > > ## mate pairing > > > > microbenchmark(______readGAlignmentsList(fls[1]), > times=5) > > Unit: seconds > expr > min lq > median > uq max neval > readGAlignmentsList(fls[1]) > 8.409743 8.457232 > 8.803696 > 9.692845 9.867628 5 > > > > You can set a 'yieldSize' when > creating the BamFile. > This will chunk > through the file 'yieldSize' > records at a time and is > useful when > processing a complete file and > when there are memory > limitations. > > bf <- BamFile(myfile, > yieldSize = 1000000) > > Herve reported some additional > timings for > readGAlignmentPairs() on > this post: > > > https://stat.ethz.ch/______pipermail/bioconducto r/2014-______May/059695.html > <https: stat.ethz.ch="" ____pipermail="" bioconductor="" 2014-____may="" 059695.html=""> > > <https: stat.ethz.ch="" ____pipermail="" bioconductor="" 2014-____may="" 059695.html=""> <https: stat.ethz.ch="" __pipermail="" bioconductor="" 2="" 014-__may="" 059695.html="">> > > > <https: stat.ethz.ch="" ____pipermail="" bioconductor="" 2014-____may="" 059695.html=""> <https: stat.ethz.ch="" __pipermail="" bioconductor="" 2014-__may="" 059695.html=""> > > <https: stat.ethz.ch="" __pipermail="" bioconductor="" 2014-__may="" 059695.html=""> <https: stat.ethz.ch="" pipermail="" bioconductor="" 2014-may="" 059695.html="">>> > > > Can you provide some numbers for > how long your run is > taking and the > number of records you're trying > to read in? > > I assume 'allExons' is used for > counting against the > BAM files in > later code? Maybe this is why > your were looking into > summarizeOverlaps(). I should > point out that if > you're only > interested in the regions > overlapping 'allExons' you > could call > range(allExons) and use that as > the 'which' in the > param. This would > focus the area of interest and > speed up the reading. > > Valerie > > > > > On 06/30/14 12:22, Fong Chun Chan > wrote: > > Hi, > > I've been using the > GenomicFeatures R package to > read in some > RNA-seq > paired-end read data. While > the readGAlignments() > function reads > in the bam > file within a minute, I've > noticed that the > readGAlignmentPairs() function > is extremely slow. > > This is even after > restricting the space to just a > single > chromosome along > with setting the flags > isDuplicate = FALSE and > isNotPrimaryRead > = FALSE > (the latter being suggested > in the reference): > > "An easy way to avoid this > degradation is to load > only primary > alignments > by setting the > isNotPrimaryRead ??'ag to FALSE in > ScanBamParam()" > > Based on my understanding, it > seems that the ?? > ndMateAlignment() function > > has to be run first. Is there > any other way to > speed it up? > Perhaps I am > doing something wrong here? > > Code and sessionInfo() below. > > Thanks! > > ---- > library("GenomicFeatures") > library('GenomicAlignments') > library('Rsamtools') > si <- > seqinfo(BamFile(arguments$______bamFile)) > gr <- GRanges(arguments$chr, > IRanges(100, > > seqlengths(si)[[arguments$chr]______]-100)) > allExons <- exons(txdb, > columns = c('gene_id', > 'exon_id', > 'exon_name')) > scf <- scanBamFlag( > isDuplicate = FALSE, > isNotPrimaryRead = > FALSE ) # > remove duplicate reads, > remove non-primary reads > (i.e. > multi-aligned reads) > reads <- readGAlignmentPairs( > arguments$bamFile, > param = > ScanBamParam( > which = gr, flag = scf ) ); # > grab reads in > specific region > > sessionInfo() > > R version 3.1.0 (2014-04-10) > Platform: > x86_64-unknown-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_US.UTF-8 > LC_NUMERIC=C > LC_TIME=en_US.UTF-8 > LC_COLLATE=en_US.UTF-8 > LC_MONETARY=en_US.UTF-8 > LC_MESSAGES=en_US.UTF-8 > LC_PAPER=en_US.UTF-8 > LC_NAME=C > LC_ADDRESS=C > LC_TELEPHONE=C > > LC_MEASUREMENT=en_US.UTF-8 > LC_IDENTIFICATION=C > > attached base packages: > [1] parallel stats > graphics grDevices utils > datasets > methods > base > > other attached packages: > [1] argparse_1.0.1 > proto_0.3-10 > GenomicAlignments_1.0.1 > BSgenome_1.32.0 > Rsamtools_1.16.1 > Biostrings_2.32.0 > XVector_0.4.0 > GenomicFeatures_1.16.2 > AnnotationDbi_1.26.0 > Biobase_2.24.0 > GenomicRanges_1.16.3 > GenomeInfoDb_1.0.2 > IRanges_1.22.9 > BiocGenerics_0.10.0 > vimcom.plus_0.9-93 > [16] setwidth_1.0-3 > colorout_1.0-2 > > loaded via a namespace (and > not attached): > [1] BatchJobs_1.2 > BBmisc_1.7 > BiocParallel_0.6.1 > biomaRt_2.20.0 > bitops_1.0-6 brew_1.0-6 > checkmate_1.0 > codetools_0.2-8 DBI_0.2-7 > digest_0.6.4 > fail_1.2 > findpython_1.0.1 > foreach_1.4.2 > getopt_1.20.0 > iterators_1.0.7 > plyr_1.8.1 > Rcpp_0.11.2 > RCurl_1.95-4.1 > [19] rjson_0.2.14 > RSQLite_0.11.4 > rtracklayer_1.24.2 > sendmailR_1.1-2 > stats4_3.1.0 stringr_0.6.2 > tools_3.1.0 > XML_3.98-1.1 > zlibbioc_1.10.0 > > > > [[alternative HTML > version deleted]] > > > > > _____________________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > <mailto:bioconductor at="" r-project.org=""> > <mailto:bioconductor at="" r-__project.org=""> <mailto:bioconductor at="" r-project.org="">> > <mailto:bioconductor at="" r-____project.org=""> <mailto:bioconductor at="" r-__project.org=""> > <mailto:bioconductor at="" r-__project.org=""> <mailto:bioconductor at="" r-project.org="">>> > https://stat.ethz.ch/mailman/______listinfo/bioconductor > <https: stat.ethz.ch="" mailman="" ____listinfo="" bioconductor=""> > > <https: stat.ethz.ch="" mailman="" ____listinfo="" bioconductor="" <https:="" stat.ethz.ch="" mailman="" __listinfo="" bioconductor="">> > > > <https: stat.ethz.ch="" mailman="" ____listinfo="" bioconductor="" <https:="" stat.ethz.ch="" mailman="" __listinfo="" bioconductor=""> > > <https: stat.ethz.ch="" mailman="" __listinfo="" bioconductor="" <https:="" stat.ethz.ch="" mailman="" listinfo="" bioconductor="">>> > Search the archives: > > > http://news.gmane.org/gmane.______science.biology.informatics > <http: news.gmane.org="" gmane.____science.biology.informatics="">. > ____conductor > > <http: news.gmane.org="" gmane.____science.biology="" .informatics.____conductor=""> <http: news.gmane.org="" gmane.__science.biology.i="" nformatics.__conductor="">> > > > <http: news.gmane.org="" gmane.____science.biology="" .informatics.____conductor=""> <http: news.gmane.org="" gmane.__science.biology.i="" nformatics.__conductor=""> > > <http: news.gmane.org="" gmane.__science.biology.informatics.__conductor=""> <http: news.gmane.org="" gmane.science.biology.informatics.conductor="">>> > > > > -- > Valerie Obenchain > Program in Computational Biology > Fred Hutchinson Cancer Research > Center > 1100 Fairview Ave. N, Seattle, WA > 98109 > > Email: vobencha at fhcrc.org > <mailto:vobencha at="" fhcrc.org=""> > <mailto:vobencha at="" fhcrc.org="" <mailto:vobencha="" at="" fhcrc.org="">> > <mailto:vobencha at="" fhcrc.org=""> <mailto:vobencha at="" fhcrc.org=""> > <mailto:vobencha at="" fhcrc.org="" <mailto:vobencha="" at="" fhcrc.org="">>> > Phone: (206) 667-3158 > <tel:%28206%29%20667-3158> <tel:%28206%29%20667-3158> > <tel:%28206%29%20667-3158> > > > > > ___________________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > <mailto:bioconductor at="" r-project.org=""> > <mailto:bioconductor at="" r-__project.org=""> <mailto:bioconductor at="" r-project.org="">> > https://stat.ethz.ch/mailman/____listinfo/bioconductor > <https: stat.ethz.ch="" mailman="" __listinfo="" bioconductor=""> > > <https: stat.ethz.ch="" mailman="" __listinfo="" bioconductor="" <https:="" stat.ethz.ch="" mailman="" listinfo="" bioconductor="">> > Search the archives: > > http://news.gmane.org/gmane.____science.biology. informatics.____conductor > <http: news.gmane.org="" gmane.__science.biology.i="" nformatics.__conductor=""> > > <http: news.gmane.org="" gmane.__science.biology.informatics.__conductor=""> <http: news.gmane.org="" gmane.science.biology.informatics.conductor="">> > > > > > _________________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > <mailto:bioconductor at="" r-project.org=""> > https://stat.ethz.ch/mailman/__listinfo/bioconductor > <https: stat.ethz.ch="" mailman="" listinfo="" bioconductor=""> > Search the archives: > http://news.gmane.org/gmane.__science.biology.informatics.__conductor > <http: news.gmane.org="" gmane.science.biology.informatics.conductor=""> > > > > > [[alternative HTML version deleted]] > > > > > _________________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org <mailto:bioconductor at="" r-project.org=""> > https://stat.ethz.ch/mailman/__listinfo/bioconductor > <https: stat.ethz.ch="" mailman="" listinfo="" bioconductor=""> > Search the archives: > http://news.gmane.org/gmane.__science.biology.informatics.__conductor > <http: news.gmane.org="" gmane.science.biology.informatics.conductor=""> > > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages at fhcrc.org <mailto:hpages at="" fhcrc.org=""> > Phone: (206) 667-5791 <tel:%28206%29%20667-5791> > Fax: (206) 667-1319 <tel:%28206%29%20667-1319> > > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
ADD REPLYlink written 3.5 years ago by Hervé Pagès ♦♦ 13k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 260 users visited in the last hour