Thanks Varlerie for looking into these suggestions and improvements. I
really appreciate the time and effort you keep investing into my
unsolicited questions. - Obviously, read counting has become a major
activity for many of us and your software will definitley put to great
use. In our projects, flexiblity and accuracy of the read counter is
most important to us, which summarizeOverlaps and associates address
very well. Since research facilities like ours have to perform those
analyses on very large disk storage devices with several hundred TBs
of
space that are shared among many NGS users, disk I/O and network
speeds
on compute clusters have become a major bottleneck. Thus, we are also
very interested in solutions that collect counts for many feature
types
and counting modes in a single pass-through of these big files, as
opposed to accessing them over and over again when it is not really
necessary. Usually, I also like to delete the bam files early on in
most
of our RNA-Seq workflows to save storage space. However, our
collaborators often keep coming up with all kinds of creative and
exciting new ideas what features/modes to count on where the best
analysis strategy is often to generate by default counts for many
feature types right away (even if I don't intend and recommend or to
use
them) rather than only one prescibed solution that usually ends up not
being enough. I guess RNA-Seq is still very much in an engineering
stage
where most of us depend on software design that provides a very high
level of flexiblitly.
Again, thanks a lot.
Thomas
On Thu, May 23, 2013 at 09:06:10PM +0000, Valerie Obenchain wrote:
> Hi Thomas,
>
> On 05/20/2013 10:12 PM, Thomas Girke wrote:
> > I apologize up front to Valarie for posting additional questions
related
> > to her excellent read counter. This time my questions are related
to the
> > economics of accessing big files as few times as possible to
improve
> > workflow performance in large RNA-Seq projects and to subsequently
allow
> > deleting all input bam files since BIG DATA is killing us :).
> >
> > (1) What is currently the intended approach to obtain counts for
> > multiple feature types in a single pass-through of one or many bam
> > files? For instance, often we want to count the reads overlapping
with
> > many different features types (not just one), such as exonBygenes,
> > cdsBygenes, UTRs, introns and intergenics. Computing all those
feature
> > counts sequentially in a loop is easy, but can be time consuming
and
> > heavy on disk I/O (the most expensive resource in NGS analysis)
and
> > internal networks when running many counting processes in
parallel. As a
> > partial solution to this, I often organize all feature types in
one big
> > GRanges/GRangesList container, perform the counting and then split
the
> > resulting count tables accordingly. However, is this really the
best way
> > of doing this? A major limitation of this approach is that it
does not
> > support feature overlap aware counting modes.
>
> I would not recommend combining different annotations into a single
> GRanges/GRangesList. To summarizeOverlaps(), each row of a GRanges
or
> each list element of a GRangesList represents a feature. If a read
hits
> more than one feature we have a double hit situation that must be
> resolved. If you're interested in counting both exonsByGene and
> transcriptsByGene you probably want these to be independent counts.
By
> combining them into one object they can be confounding and create
> multiple hits situations where there really aren't any.
>
> Toy example, but let's say 'gr1' comes from exonsBy() and 'gr2' from
> 3UTRsByTranscript().
>
> gr1 <- GRanges(c("chr1", "chr1"), IRanges(c(1, 10), width=10))
> gr2 <- GRanges("chr1", IRanges(25, width=10))
> grl <- GRangesList(gr1, gr2)
> reads <- GAlignments("chr1", pos=18L, cigar="10M",
strand=strand("*"))
>
>
> The read hits gr1 and gr2. If counted separately,
>
> > assays(summarizeOverlaps(grl[1], reads, Union))$counts
> reads
> [1,] 1
> > assays(summarizeOverlaps(grl[2], reads, Union))$counts
> reads
> [1,] 1
>
> Now count as GRangesList,
>
> > assays(summarizeOverlaps(grl, reads, Union))$counts
> reads
> [1,] 0
> [2,] 0
>
>
> Using 'inter.feature=FALSE' will regain the hits with modes Union
and
> IntersectionStrict but not necessarily IntersectionNotEmpty because
> shared regions of the annotation are removed before counting.
>
> > assays(summarizeOverlaps(grl, reads, Union,
inter.feature=FALSE))$counts
> reads
> [1,] 1
> [2,] 1
>
> >
> > (2) Similarly, what if I want to generate counts for multiple
counting
> > modes in a single call to summarizeOverlaps, such as different
overlap
> > modes, or sense and antisense counts which we often want to report
when
> > working with strand specific RNA-Seq data? Are there plans to
support
> > this? BTW: currently, I generate antisense read counts by
inversing the
> > strand information of the features which works fine just not for
both
> > ways in one step.
>
> We don't have a built-in method to count Bams over multiple
annotations
> and multiple count modes. There are several approaches you could
take.
> After some discussion this was one we came up with.
>
> ## Count a single file.
> count1bam <-
> function(file, features, mode, ...,
> yieldSize=1000000L, reader=readGAlignments,
> param=ScanBamParam())
> {
> ## initialize
> bf <- open(BamFile(file, yieldSize=yieldSize))
> counts <- lapply(sapply(features, length), integer)
>
> ## iterate
> while(length(reads <- reader(bf, param=param)))
> for (i in seq_along(features))
> counts[[i]] <-
> counts[[i]] + mode(features[[i]], reads, ...)
> close(bf)
> counts
> }
>
> ## Count over files in parallel.
> countbams <-
> function(features, files, ...)
> {
> counts0 <- mclapply(files, count1bam, features=features,
...)
> counts <- lapply(names(features), function(i, counts) {
> sapply(counts0, "[[", i)
> }, counts0)
> }
>
> ## In practice.
> library(parallel); options(mc.cores=detectCores())
> library(GenomicRanges)
> library(Rsamtools)
> features <-
> list(coarse=GRanges("seq1", successiveIRanges(rep(100,
15))),
> fine=GRanges("seq2", successiveIRanges(rep(20, 45))))
> fl <- system.file("extdata", "ex1.bam", package="Rsamtools")
> files <- list(fl, fl) ## usually different files
>
> ## As 'mode' you can directly call Union, IntersectionStrict,
> ## and IntersectionNotEmpty.
> result <- countbams(features, files, mode=Union,
> ignore.strand=TRUE, inter.feature=FALSE)
>
> ## 'reader' specifies how the data are read in. For paired-end
> ## use readGAlignmentsList (qname sorted or
readGAlignmentPairs.
> result <- countbams(features, files, mode=Union,
> reader=readGAlignmentPairs)
>
> This code handles the case of multiple Bam with multiple
annotations. To
> handle combinations for multiple arguments ('mode', 'ignore.strand',
> 'inter.feature', etc.) you could add a Map() in the 'while' loop in
> count1bam. Instead of passing a single value you would pass lists
the
> same length as the list of annotations. For example,
>
> modeList <- c(Union, IntersectionStrict)
> ignoreStrandList <- c(TRUE, TRUE)
> interFeatureList <- c(TRUE, FALSE)
>
> >
> > (3) Is it currently possible to return the total numbers of
aligned
> > reads in bam files with summarizeOverlaps and store them in the
> > resulting counting object? There are many places from where the
numbers
> > of aligned reads can be obtained, but to me the most obvious step
to
> > generate and store them would be right here. They are useful
parameters
> > for alignment QC'ing and reporting proper RPKM/FPKM values to
biologists.
> >
>
> I've added the output of countBam() to the 'colData' slot of the
> SummarizedExperiment. The 'mapped' counts are computed with
countBam()
> and a param where 'isUnmappedQuery=FALSE'. This isn't a great
example
> but does demonstrate the new output.
>
> library(Rsamtools)
> library(pasillaBamSubset)
> library("TxDb.Dmelanogaster.UCSC.dm3.ensGene")
> exbygene <- exonsBy(TxDb.Dmelanogaster.UCSC.dm3.ensGene)
> se <- summarizeOverlaps(exbygene, BamFile(untreated1_chr4()))
>
> > colData(se)
> DataFrame with 1 row and 3 columns
> records nucleotides mapped
> <integer> <numeric> <integer>
> untreated1_chr4.bam 204355 15326625 204355
>
>
>
> Code updates are in GenomicRanges 1.13.15 and Rsamtools 1.13.16.
>
> Valerie
>
> > Thomas
> >
> > _______________________________________________
> > Bioconductor mailing list
> > Bioconductor at r-project.org
> >
https://stat.ethz.ch/mailman/listinfo/bioconductor
> > Search the archives:
http://news.gmane.org/gmane.science.biology.informatics.conductor
> >