Dear Wei,
Thanks. I guess this makes sense, but is it possible this way to
perform
transcript level counting in addition to exonic gene-level counting?
Also, this reminds me of one question related to Rsubread for which I
couldn't find an answer in the documentation (haven't had time to read
your NAR paper though). Is it possible to use Rsubread for aligning
directly against transcriptome sequences where we want to set no
restrictions on multiple mappings of equal quality in order to obtain
transcript read counts? Right I am often doing this with Bowtie2 for
our
own transcriptome assemblies where we don't have a reference genome
and
then use Rsamtools for counting the reads mapping to individual
transcripts. With Rsubread I was not sure if the algorithm can return
any number of alternative hits? If it can then I would love to give it
a try for our transcriptome assembly projects.
Thomas
On Thu, Apr 11, 2013 at 02:32:03AM +0000, Wei Shi wrote:
> Hi Tom,
>
> The featureCounts function in Rsubread package goes some way like
what you suggested. It is different from countOverlaps in that each
read is counted at most once. It is also different from
summarizeOverlaps in that it assigns the read to one of the features
if it overlaps with >1 features.
>
> featureCounts assigns reads to exons first and then summarize exon-
level read counts to gene-level read counts. Therefore, if a read
overlaps with >1 exons within the same gene it will be counted only
once for the gene.
>
> featureCounts includes in-built annotations for human and mouse
genomes (NCBI RefSeq annotation), but you can provide your own
annotation as well. You can also use it for the general-purpose read
assignment. It is not restricted to the read counting for RNA-seq
data.
>
> Cheers,
> Wei
>
> On Apr 11, 2013, at 12:09 PM, Thomas Girke wrote:
>
> > Hi Martin,
> >
> > Yes, inter_feature=TRUE would maintain the current counting
mode(s) that
> > prohibits counting of reads mapping to multiple features. This is
a
> > special case of counting that is very useful for counting exonic
regions
> > of genes. However, one also wants to be able to turn off this
behavior
> > by ignoring inter feature overlaps (just like ignore.strand=T/F).
> > Otherwise we cannot use summarizeOverlaps along with its current
modes
> > for important operations like transcript level counting because
many
> > transcript variants from the same gene will mask each others reads
when
> > inter_feature=TRUE. Providing the option to output the results
from both
> > inter_feature=TRUE and inter_feature=FALSE could be a very
sensible
> > solution and time saver for users working with new genomes/GFFs,
where
> > one cannot trust every nested annotation for various reasons, and
> > inter_feature=TRUE can quickly become a very risky counting
strategy
> > since it tends to erase counts. Every biologist will scream in
your
> > face if the counter tells them that their favorite gene has zero
counts
> > just because of some overlap with some annotation error:).
> >
> > For multiple range features stored in GRangesList objects, I would
> > currently favor making "inter_feature=FALSE" ignore the overlaps
> > occurring among different list components, but not necessarily
those
> > within a list component (e.g. exon ranges of a gene). This way one
> > can benefit from the current infrastructure by restricting its
feature
> > overlap scope to the range sets stored within individual list
components
> > but ignoring those among different list components. Utilities like
reduce
> > and other range modifier functions could handle situations where
one wants
> > to ignore all feature overlaps within and among list components.
However,
> > I am sure there could be other solutions to this.
> >
> > Long story short if inter_feature=TRUE/FALSE could be used in
> > combination with
modes=Union/IntersectionStrict/IntersectionNonEmpty
> > resulting in six different counting flavors, I would be happy and,
I am
> > sure, many other Bioc users as well.
> >
> > I hope this makes sense.
> >
> > Thomas
> >
> >
> >
> > On Wed, Apr 10, 2013 at 11:37:04PM +0000, Martin Morgan wrote:
> >> On 04/10/2013 12:42 PM, Thomas Girke wrote:
> >>> Thanks. Adding an inter-feature unaware mode will be very
helpful and also
> >>> broaden summarizeOverlaps' application spectrum for a lot of use
cases.
> >>
> >> I'm probably being quite dense here, and am mostly an outside
observer. What I
> >> hear you saying is that there are currently three modes -- Union,
> >> IntersectionStrict, IntersectionNonEmpty. These modes are
summarized in the
> >> seven rows of figure 1 of
> >>
> >> vignette("summarizeOverlaps", package="GenomicRanges")
> >>
> >> Let's say there is a flag inter_feature, and when its value is
TRUE then the
> >> current counting schemes are obtained. These modes differ in the
way counting
> >> works for reads illustrated in rows 5, 6, and 7 of the figure.
You'd like a
> >> count scored where a '1' appears in the table below. With
inter_feature=TRUE
> >> reads are 'never counted more than once' ('Hits per read' is <=
1)
> >>
> >>
|----------------------+-----+-----------+-----------+---------------|
> >> | Mode | Row | Feature 1 | Feature 2 | Hits per
read |
> >>
|----------------------+-----+-----------+-----------+---------------|
> >> | Union | 5 | 1 | 0 | 1
|
> >> | | 6 | 0 | 0 | 0
|
> >> | | 7 | 0 | 0 | 0
|
> >> | IntersectionStrict | 5 | 1 | 0 | 1
|
> >> | | 6 | 1 | 0 | 1
|
> >> | | 7 | 0 | 0 | 0
|
> >> | IntersectionNotEmpty | 5 | 1 | 0 | 1
|
> >> | | 6 | 1 | 0 | 1
|
> >> | | 7 | 0 | 0 | 0
|
> >>
|----------------------+-----+-----------+-----------+---------------|
> >>
> >>
> >> You'd like to add 3 more modes, with inter_feature=FALSE. Reads
are sometimes
> >> 'counted twice'
> >>
> >>
|----------------------+-----+-----------+-----------+---------------|
> >> | Mode | Row | Feature 1 | Feature 2 | Hits per
read |
> >>
|----------------------+-----+-----------+-----------+---------------|
> >> | Union | 5 | 1 | 0 |
1 |
> >> | | 6 | 1 | 1 |
2 |
> >> | | 7 | 1 | 1 |
2 |
> >> | IntersectionStrict | 5 | 1 | 0 |
1 |
> >> | | 6 | 1 | 0 |
1 |
> >> | | 7 | 1 | 1 |
2 |
> >> | IntersectionNotEmpty | 5 | 1 | 0 |
1 |
> >> | | 6 | 1 | 1 |
2 |
> >> | | 7 | 1 | 1 |
2 |
> >>
|----------------------+-----+-----------+-----------+---------------|
> >>
> >>
> >> Martin
> >>
> >>>
> >>> Thomas
> >>>
> >>>
> >>> On Wed, Apr 10, 2013 at 04:59:49PM +0000, Valerie Obenchain
wrote:
> >>>> Hi Thomas,
> >>>>
> >>>> On 04/10/2013 09:23 AM, Thomas Girke wrote:
> >>>>> Valerie,
> >>>>>
> >>>>> Please see my inserted comments below.
> >>>>>
> >>>>> On Wed, Apr 10, 2013 at 03:50:54PM +0000, Valerie Obenchain
wrote:
> >>>>>> Ah, I see. You'd like to count with one of the existing modes
but have
> >>>>>> the option to pick up counts for these inter-feature reads
(fall
> >>>>>> completely 'within' >1 feature). These inter-feature reads
would be
> >>>>>> double (triple, quadruple, etc.) counted. Essentially they
would add one
> >>>>>> count to each feature they hit. Right?
> >>>>>
> >>>>> Correct. Perhaps let's discuss this with a very common example
of
> >>>>> transcript-level counting rather than counting on the unified
(virtual)
> >>>>> exonic regions of genes. With the current description provided
in the
> >>>>> summarizeOverlaps vignette at
> >>>>>
http://www.bioconductor.org/packages/2.12/bioc/vignettes/Genom
icRanges/inst/doc/summarizeOverlaps.pdf
> >>>>> I don't see how this can be achieved without falling back to
using
> >>>>> countOverlaps without loosing the new counting modes provided
by
> >>>>> summarizeOverlaps?
> >>>>
> >>>> It can't be achieved with the function as is but we could add
an
> >>>> argument to handle this (as you suggested from the start). If
> >>>> 'inter-feature=TRUE' then these counts would be added to the
counts
> >>>> already obtained from using a particular 'mode'. I will move
ahead with
> >>>> implementing this argument.
> >>>>
> >>>>>
> >>>>>>
> >>>>>> One more thought about memory usage. If you are working with
single-end
> >>>>>> reads the summarizeOverlaps,BamFileList-method I mentioned
below should
> >>>>>> work fine. The approach is slightly different with paired-end
reads. Our
> >>>>>> current algorithm for pairing paired-end reads requires the
whole file
> >>>>>> to be read into memory. A different approach is currently
being
> >>>>>> developed but in the meantime you can take the qname-sorted
approach.
> >>>>>> The Bam file will need to be sorted by qname and both the
'yieldSize'
> >>>>>> and 'obeyQname=TRUE' set when creating the
BamFile/BamFileList. An
> >>>>>> example is on ?BamFile.
> >>>>>
> >>>>> Thanks for pointing this out. My fault that I didn't read
through all
> >>>>> the linked documentation. Perhaps it may not be a bad idea to
make the
> >>>>> memory restricted bam read instances the default setting in
the future.
> >>>>> This will definitely help biologists using those utilities
without
> >>>>> crashing their machines on the first attempt.
> >>>>
> >>>> Good suggestion, thanks.
> >>>>
> >>>> Valerie
> >>>>
> >>>>>
> >>>>> Thomas
> >>>>>
> >>>>>
> >>>>>>
> >>>>>> Valerie
> >>>>>>
> >>>>>>
> >>>>>> On 04/09/2013 08:01 PM, Thomas Girke wrote:
> >>>>>>> Thanks for the tip. I guess doing it this way reverses the
counting mode
> >>>>>>> back to countOverlaps, but how can I use at the same time
> >>>>>>> "IntersectionStrict" or any of the other modes provided by
> >>>>>>> summarizeOverlaps if its mode argument is already used and
countOverlaps
> >>>>>>> doesn't accept one?
> >>>>>>>
> >>>>>>> Thomas
> >>>>>>>
> >>>>>>> On Tue, Apr 09, 2013 at 09:08:02PM +0000, Valerie Obenchain
wrote:
> >>>>>>>> Thanks for the example. You're right, none of the modes
will count a
> >>>>>>>> read falling completely within >1 feature.
> >>>>>>>>
> >>>>>>>> You can supply your own function as a 'mode'. Two
requirements must be met:
> >>>>>>>>
> >>>>>>>> (1) The function must have 3 arguments that correspond to
> >>>>>>>> 'features', 'reads' and 'ignore.strand', in that
order.
> >>>>>>>>
> >>>>>>>> (2) The function must return a vector of counts the same
length
> >>>>>>>> as 'features'
> >>>>>>>>
> >>>>>>>> Here is an example using countOverlaps() which I think gets
at the
> >>>>>>>> counting you want.
> >>>>>>>>
> >>>>>>>> counter <- function(x, y, ignore.strand)
> >>>>>>>> countOverlaps(y, x, ignore.strand=ignore.strand)
> >>>>>>>>
> >>>>>>>>> assays(summarizeOverlaps(gr, rd, mode=counter))$counts
> >>>>>>>> [,1]
> >>>>>>>> [1,] 1
> >>>>>>>> [2,] 1
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Valerie
> >>>>>>>>
> >>>>>>>> On 04/09/2013 09:37 AM, Thomas Girke wrote:
> >>>>>>>>> Hi Valerie,
> >>>>>>>>>
> >>>>>>>>> Perhaps let's call this more explicitly an
ignore_inter_feature_overlap option
> >>>>>>>>> that we often need for cases like this:
> >>>>>>>>>
> >>>>>>>>> Read1 ----
> >>>>>>>>> F1 ----------------
> >>>>>>>>> F2 -----------
> >>>>>>>>>
> >>>>>>>>> where we would like to have at least the option to assign
Read1 to both F1 and F2:
> >>>>>>>>>
> >>>>>>>>> F1: 1
> >>>>>>>>> F2: 1
> >>>>>>>>>
> >>>>>>>>> However, summarizeOverlapse doesn't count Read1 at all in
all of its currently
> >>>>>>>>> available modes that I am aware of. This lack of an
ignore_inter_feature_overlap
> >>>>>>>>> option is frequently a problem when working with poorly
annotated genomes (high
> >>>>>>>>> uncertainty about hidden/incorrect feature overlaps) or
various
> >>>>>>>>> RNA/ChIP-Seq _engineering_ projects where I rather take
the risk of ambiguous read
> >>>>>>>>> assignments than not counting at all as shown above.
> >>>>>>>>>
> >>>>>>>>> ## Here is a code example illustrating the same case:
> >>>>>>>>> library(GenomicRanges); library(Rsamtools)
> >>>>>>>>> rd <- GappedAlignments(letters[1], seqnames =
Rle(rep("chr1",1)),
> >>>>>>>>> pos = as.integer(c(500)),
> >>>>>>>>> cigar = rep("100M", 1), strand = strand(rep("+",
1)))
> >>>>>>>>> gr1 <- GRanges("chr1", IRanges(start=100, width=1001),
strand="+", ID="feat1")
> >>>>>>>>> gr2 <- GRanges("chr1", IRanges(start=500, width=101),
strand="+", ID="feat2")
> >>>>>>>>> gr <- c(gr1, gr2)
> >>>>>>>>>
> >>>>>>>>> ## All of the three current modes in summarizeOverlaps
return a count of zero
> >>>>>>>>> ## because of its inter_feature_overlap awareness:
> >>>>>>>>> assays(summarizeOverlaps(gr, rd, mode="Union",
ignore.strand=TRUE))$counts
> >>>>>>>>> assays(summarizeOverlaps(gr, rd,
mode="IntersectionStrict", ignore.strand=TRUE))$counts
> >>>>>>>>> assays(summarizeOverlaps(gr, rd,
mode="IntersectionNotEmpty", ignore.strand=TRUE))$counts
> >>>>>>>>> [,1]
> >>>>>>>>> [1,] 0
> >>>>>>>>> [2,] 0
> >>>>>>>>>
> >>>>>>>>> ## However, countOverlaps handles this correctly, but is
not the best choice when
> >>>>>>>>> ## counting multiple range features.
> >>>>>>>>> countOverlaps(gr, rd)
> >>>>>>>>> [1] 1 1
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Thomas
> >>>>>>>>>
> >>>>>>>>>> sessionInfo()
> >>>>>>>>> R version 3.0.0 (2013-04-03)
> >>>>>>>>> Platform: x86_64-apple-darwin10.8.0 (64-bit)
> >>>>>>>>>
> >>>>>>>>> locale:
> >>>>>>>>> [1]
en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
> >>>>>>>>>
> >>>>>>>>> attached base packages:
> >>>>>>>>> [1] parallel stats graphics grDevices utils
datasets methods
> >>>>>>>>> [8] base
> >>>>>>>>>
> >>>>>>>>> other attached packages:
> >>>>>>>>> [1] Rsamtools_1.12.0 Biostrings_2.28.0
GenomicRanges_1.12.1
> >>>>>>>>> [4] IRanges_1.18.0 BiocGenerics_0.6.0
> >>>>>>>>>
> >>>>>>>>> loaded via a namespace (and not attached):
> >>>>>>>>> [1] bitops_1.0-5 stats4_3.0.0 tools_3.0.0
zlibbioc_1.6.0
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On Tue, Apr 09, 2013 at 03:42:03PM +0000, Valerie
Obenchain wrote:
> >>>>>>>>>> Hi Thomas,
> >>>>>>>>>>
> >>>>>>>>>> On 04/08/2013 05:52 PM, Thomas Girke wrote:
> >>>>>>>>>>> Dear Valerie,
> >>>>>>>>>>>
> >>>>>>>>>>> Is there currently any way to run summarizeOverlaps in a
feature-overlap
> >>>>>>>>>>> unaware mode, e.g with an ignorefeatureOL=FALSE/TRUE
setting? Currently,
> >>>>>>>>>>> one can switch back to countOverlaps when feature
overlap unawareness is
> >>>>>>>>>>> the more appropriate counting mode for a biological
question, but then
> >>>>>>>>>>> double counting of reads mapping to multiple-range
features is not
> >>>>>>>>>>> accounted for. It would be really nice to have such a
feature-overlap
> >>>>>>>>>>> unaware option directly in summarizeOverlaps.
> >>>>>>>>>>
> >>>>>>>>>> No, we don't currently have an option to ignore feature-
overlap. It
> >>>>>>>>>> sounds like you want to count with countOverlaps() but
still want the
> >>>>>>>>>> double counting to be resolved. This is essentially what
the other modes
> >>>>>>>>>> are doing so I must be missing something.
> >>>>>>>>>>
> >>>>>>>>>> In this example 2 reads hit feature A, 1 read hits
feature B. With
> >>>>>>>>>> something like ignorefeature0L=TRUE, what results would
you expect to
> >>>>>>>>>> see? Maybe you have another, more descriptive example?
> >>>>>>>>>>
> >>>>>>>>>> reads <- GRanges("chr1", IRanges(c(1, 5, 20), width=3))
> >>>>>>>>>> features <- GRanges("chr1", IRanges(c(1, 20), width=10,
> >>>>>>>>>> names=c("A", "B")))
> >>>>>>>>>>
> >>>>>>>>>>> countOverlaps(features, reads)
> >>>>>>>>>> [1] 2 1
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> Another question relates to the memory usage of
summarizeOverlaps. Has
> >>>>>>>>>>> this been optimized yet? On a typical bam file with
~50-100 million
> >>>>>>>>>>> reads the memory usage of summarizeOverlaps is often
around 10-20GB. To
> >>>>>>>>>>> use the function on a desktop computer or in large-scale
RNA-Seq
> >>>>>>>>>>> projects on a commodity compute cluster, it would be
desirable if every
> >>>>>>>>>>> counting instance would consume not more than 5GB of
RAM.
> >>>>>>>>>>
> >>>>>>>>>> Have you tried the BamFileList-method? There is an
example at the bottom
> >>>>>>>>>> of the ?BamFileList man page using summarizeOverlaps().
As Ryan
> >>>>>>>>>> mentioned, the key is to set the 'yieldSize' parameter
when creating the
> >>>>>>>>>> BamFile. This method also makes use of mclapply().
> >>>>>>>>>>
> >>>>>>>>>> Valerie
> >>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> Thanks in advance for your help and suggestions,
> >>>>>>>>>>>
> >>>>>>>>>>> Thomas
> >>>>>>>>>>>
> >>>>>>>>>>> _______________________________________________
> >>>>>>>>>>> Bioconductor mailing list
> >>>>>>>>>>> Bioconductor at r-project.org
> >>>>>>>>>>>
https://stat.ethz.ch/mailman/listinfo/bioconductor
> >>>>>>>>>>> Search the archives:
http://news.gmane.org/gmane.science.biology.informatics.conductor
> >>>>>>>>>>>
> >>>
> >>> _______________________________________________
> >>> Bioconductor mailing list
> >>> Bioconductor at r-project.org
> >>>
https://stat.ethz.ch/mailman/listinfo/bioconductor
> >>> Search the archives:
http://news.gmane.org/gmane.science.biology.informatics.conductor
> >>>
> >>
> >>
> >> --
> >> Computational Biology / Fred Hutchinson Cancer Research Center
> >> 1100 Fairview Ave. N.
> >> PO Box 19024 Seattle, WA 98109
> >>
> >> Location: Arnold Building M1 B861
> >> Phone: (206) 667-2793
> >
> > _______________________________________________
> > Bioconductor mailing list
> > Bioconductor at r-project.org
> >
https://stat.ethz.ch/mailman/listinfo/bioconductor
> > Search the archives:
http://news.gmane.org/gmane.science.biology.informatics.conductor
>
>
______________________________________________________________________
> The information in this email is confidential and
inte...{{dropped:4}}