Entering edit mode
Hi Karolis,
I hope you don't mind that I've cc'd the Bioc help list for this
reply. The
"arbitrary" select mode is not meant to be random; it's deterministic
and
is just whatever the algorithm finds first (not necessarily according
to
the order in the subject). Not all of the select modes are supported
for
every combination of object types. In your "real world" example, you
have a
GAlignments and a GRangesList. For that particular combination, only
"all"
and "first" are supported.
To solve your problem, you should probably use "all" and then select
randomly from the resulting Hits object. Something like this:
hits <- findOverlaps(bam_aligns, exonRanges)
perm <- sample(length(hits))
features <- rep(NA_integer_, queryLength(hits))
features[queryHits(hits)[perm]] <- subjectHits(hits)[perm]
Then see tabulate() for counting the reads for each feature. You might
also
want to look into summarizeOverlaps() and its custom function mode
feature.
Michael
On Mon, Jun 9, 2014 at 2:39 AM, Karolis Uziela
<karolis.uziela@scilifelab.se> wrote:
> Dear Mr. Michael Lawrence,
>
> I am using findOverlaps function from GenomicRanges to assign reads
to
> custom features in reference genome. I am concerned about the cases
where a
> read overlaps with several features. In this case, I would like the
read to
> be assigned exactly to one feature which is picked at random from
all
> overlapping ones.
>
> According to the manual, this behaviour can be controlled using
"select"
> parameter. If select="first", the value of the findOverlaps should
be a
> vector of query length, containing the index of the first
overlapping
> interval in the subject. Analogically, if select="last", the
function
> should return the indices of last overlapping interval. Finally, if
> select="arbitrary", the function should return the indices of
"arbitrary"
> overlapping interval, but it is not explained clearly what
"arbitrary"
> means. I assumed that "arbitrary" option will return an index of an
> interval picked at random from all overlapping ones, which is the
behaviour
> that I want, however, it does not seem to be the case. From a toy
example
> (see EXAMPLE 1), it seems that select="arbitrary" always returns the
same
> value as select="last". Moreover, in a real world example (see
EXAMPLE 2),
> where query is human genes and subject is reads scanned from a
sample bam
> file (attached), it seems that "arbitrary" option does not work at
all.
>
> Could you, please, explain what was the supposed behaviour of
> select="arbitrary" function and why it does not work in a real-world
> example? Moreover, can you give me advice, on how to achieve the
behaviour
> that I want?
>
> Regards,
> Karolis Uziela
>
> *============= EXAMPLE 1 ==============*
> library(GenomicRanges)
> > query <- IRanges(c(1, 4, 9), c(5, 7, 10))
> > subject <- IRanges(c(2, 2, 10), c(2, 3, 12))
> > findOverlaps(query, subject, select="last")
> [1] 2 NA 3
> > findOverlaps(query, subject, select="first")
> [1] 1 NA 3
> > findOverlaps(query, subject, select="arbitrary")
> [1] 2 NA 3
> > findOverlaps(query, subject, select="arbitrary")
> [1] 2 NA 3
> > findOverlaps(query, subject, select="arbitrary")
> [1] 2 NA 3
>
> *============= EXAMPLE 2 ==============*
> library(GenomicFeatures)
> library(GenomicRanges)
> library(Rsamtools)
> txdb <- makeTranscriptDbFromBiomart(biomart="ensembl",
> dataset="hsapiens_gene_ensembl")
> exonRanges <- exonsBy(txdb, "gene")
> bam_aligns <- readBamGappedAlignments("input1.bam")
> counts <- findOverlaps(bam_aligns, exonRanges, select="arbitrary")
>
> Error in match.arg(select) : 'arg' should be one of âallâ,
âfirstâ
> *============= Session info ==============*
> > sessionInfo()
> R version 2.14.1 (2011-12-22)
> Platform: x86_64-pc-linux-gnu (64-bit)
>
> locale:
> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
> [7] LC_PAPER=C LC_NAME=C
> [9] LC_ADDRESS=C LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats graphics grDevices utils datasets methods base
>
> other attached packages:
> [1] Rsamtools_1.6.3 Biostrings_2.22.0
GenomicFeatures_1.6.9
> [4] AnnotationDbi_1.16.19 Biobase_2.14.0 GenomicRanges_1.6.7
> [7] IRanges_1.12.6 BiocInstaller_1.2.1
>
> loaded via a namespace (and not attached):
> [1] biomaRt_2.10.0 bitops_1.0-5 BSgenome_1.22.0
DBI_0.2-5
>
> [5] RCurl_1.95-3 RSQLite_0.11.2 rtracklayer_1.14.4
tools_2.14.1
>
> [9] XML_3.95-0.1 zlibbioc_1.0.1
>
>
> --
>
> Karolis Uziela
>
> PhD student
> Science for Life Laboratory
> Box 1031
> 17121 Solna, Sweden
> Mob. +46 729 120 395
>
[[alternative HTML version deleted]]