Question

GenomicRanges select parameter

0

Entering edit mode

Michael Lawrence ★ 11k

@michael-lawrence-3846

Last seen 3.0 years ago

United States

Hi Karolis, I hope you don't mind that I've cc'd the Bioc help list for this reply. The "arbitrary" select mode is not meant to be random; it's deterministic and is just whatever the algorithm finds first (not necessarily according to the order in the subject). Not all of the select modes are supported for every combination of object types. In your "real world" example, you have a GAlignments and a GRangesList. For that particular combination, only "all" and "first" are supported. To solve your problem, you should probably use "all" and then select randomly from the resulting Hits object. Something like this: hits <- findOverlaps(bam_aligns, exonRanges) perm <- sample(length(hits)) features <- rep(NA_integer_, queryLength(hits)) features[queryHits(hits)[perm]] <- subjectHits(hits)[perm] Then see tabulate() for counting the reads for each feature. You might also want to look into summarizeOverlaps() and its custom function mode feature. Michael On Mon, Jun 9, 2014 at 2:39 AM, Karolis Uziela <karolis.uziela@scilifelab.se> wrote: > Dear Mr. Michael Lawrence, > > I am using findOverlaps function from GenomicRanges to assign reads to > custom features in reference genome. I am concerned about the cases where a > read overlaps with several features. In this case, I would like the read to > be assigned exactly to one feature which is picked at random from all > overlapping ones. > > According to the manual, this behaviour can be controlled using "select" > parameter. If select="first", the value of the findOverlaps should be a > vector of query length, containing the index of the first overlapping > interval in the subject. Analogically, if select="last", the function > should return the indices of last overlapping interval. Finally, if > select="arbitrary", the function should return the indices of "arbitrary" > overlapping interval, but it is not explained clearly what "arbitrary" > means. I assumed that "arbitrary" option will return an index of an > interval picked at random from all overlapping ones, which is the behaviour > that I want, however, it does not seem to be the case. From a toy example > (see EXAMPLE 1), it seems that select="arbitrary" always returns the same > value as select="last". Moreover, in a real world example (see EXAMPLE 2), > where query is human genes and subject is reads scanned from a sample bam > file (attached), it seems that "arbitrary" option does not work at all. > > Could you, please, explain what was the supposed behaviour of > select="arbitrary" function and why it does not work in a real-world > example? Moreover, can you give me advice, on how to achieve the behaviour > that I want? > > Regards, > Karolis Uziela > > *============= EXAMPLE 1 ==============* > library(GenomicRanges) > > query <- IRanges(c(1, 4, 9), c(5, 7, 10)) > > subject <- IRanges(c(2, 2, 10), c(2, 3, 12)) > > findOverlaps(query, subject, select="last") > [1] 2 NA 3 > > findOverlaps(query, subject, select="first") > [1] 1 NA 3 > > findOverlaps(query, subject, select="arbitrary") > [1] 2 NA 3 > > findOverlaps(query, subject, select="arbitrary") > [1] 2 NA 3 > > findOverlaps(query, subject, select="arbitrary") > [1] 2 NA 3 > > *============= EXAMPLE 2 ==============* > library(GenomicFeatures) > library(GenomicRanges) > library(Rsamtools) > txdb <- makeTranscriptDbFromBiomart(biomart="ensembl", > dataset="hsapiens_gene_ensembl") > exonRanges <- exonsBy(txdb, "gene") > bam_aligns <- readBamGappedAlignments("input1.bam") > counts <- findOverlaps(bam_aligns, exonRanges, select="arbitrary") > > Error in match.arg(select) : 'arg' should be one of âallâ, âfirstâ > *============= Session info ==============* > > sessionInfo() > R version 2.14.1 (2011-12-22) > Platform: x86_64-pc-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 > [7] LC_PAPER=C LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] Rsamtools_1.6.3 Biostrings_2.22.0 GenomicFeatures_1.6.9 > [4] AnnotationDbi_1.16.19 Biobase_2.14.0 GenomicRanges_1.6.7 > [7] IRanges_1.12.6 BiocInstaller_1.2.1 > > loaded via a namespace (and not attached): > [1] biomaRt_2.10.0 bitops_1.0-5 BSgenome_1.22.0 DBI_0.2-5 > > [5] RCurl_1.95-3 RSQLite_0.11.2 rtracklayer_1.14.4 tools_2.14.1 > > [9] XML_3.95-0.1 zlibbioc_1.0.1 > > > -- > > Karolis Uziela > > PhD student > Science for Life Laboratory > Box 1031 > 17121 Solna, Sweden > Mob. +46 729 120 395 > [[alternative HTML version deleted]]

GenomicRanges ASSIGN GenomicRanges ASSIGN • 1.6k views

ADD COMMENT • link updated 10.5 years ago by Karolis Uziela ▴ 30 • written 10.5 years ago by Michael Lawrence ★ 11k

score 0 · Answer 1 · 2014-06-09

Import your data with readGAlignmentPairs(), then things should work as desired. And you probably always want ignore.strand=TRUE, regardless of whether the reads are paired or not. Unless your data were generated by a strand-specifc protocol. On Mon, Jun 9, 2014 at 8:45 AM, Karolis Uziela <karolis.uziela@scilifelab.se> wrote: > Hi again, > > Actually, I have one more question. Let's assume, I have paired ended > reads and I read them using command: > > bam_aligns <- readGAlignmentsFromBam(bam_file) > > Then, I use your approach to count the reads for each gene, but only > adding the option ignore.strand=TRUE to findOverlaps function: > > hits <- findOverlaps(bam_aligns, exonRanges, ignore.strand=TRUE) > perm <- sample(length(hits)) > features <- rep(NA_integer_, queryLength(hits)) > features[queryHits(hits)[perm]] <- subjectHits(hits)[perm] > counts <- tabulate(features) > > If both of the mates in a paired ended read will hit some gene, will that > gene be double counted or not? If yes, is there a way, to treat the two > mates of a paired-ended read as a single entity and count it only once? > > Karolis > > > > On Mon, Jun 9, 2014 at 6:20 PM, Karolis Uziela < > karolis.uziela@scilifelab.se> wrote: > >> Thanks Michael, that's a very smart solution! This is exactly what I >> wanted! :-) >> >> Regards, >> Karolis >> >> >> On Mon, Jun 9, 2014 at 5:05 PM, Michael Lawrence < >> lawrence.michael@gene.com> wrote: >> >>> Hi Karolis, >>> >>> I hope you don't mind that I've cc'd the Bioc help list for this reply. >>> The "arbitrary" select mode is not meant to be random; it's deterministic >>> and is just whatever the algorithm finds first (not necessarily according >>> to the order in the subject). Not all of the select modes are supported for >>> every combination of object types. In your "real world" example, you have a >>> GAlignments and a GRangesList. For that particular combination, only "all" >>> and "first" are supported. >>> >>> To solve your problem, you should probably use "all" and then select >>> randomly from the resulting Hits object. Something like this: >>> >>> hits <- findOverlaps(bam_aligns, exonRanges) >>> perm <- sample(length(hits)) >>> features <- rep(NA_integer_, queryLength(hits)) >>> features[queryHits(hits)[perm]] <- subjectHits(hits)[perm] >>> >>> Then see tabulate() for counting the reads for each feature. You might >>> also want to look into summarizeOverlaps() and its custom function mode >>> feature. >>> >>> Michael >>> >>> On Mon, Jun 9, 2014 at 2:39 AM, Karolis Uziela < >>> karolis.uziela@scilifelab.se> wrote: >>> >>>> Dear Mr. Michael Lawrence, >>>> >>>> I am using findOverlaps function from GenomicRanges to assign reads to >>>> custom features in reference genome. I am concerned about the cases where a >>>> read overlaps with several features. In this case, I would like the read to >>>> be assigned exactly to one feature which is picked at random from all >>>> overlapping ones. >>>> >>>> According to the manual, this behaviour can be controlled using >>>> "select" parameter. If select="first", the value of the findOverlaps should >>>> be a vector of query length, containing the index of the first overlapping >>>> interval in the subject. Analogically, if select="last", the function >>>> should return the indices of last overlapping interval. Finally, if >>>> select="arbitrary", the function should return the indices of "arbitrary" >>>> overlapping interval, but it is not explained clearly what "arbitrary" >>>> means. I assumed that "arbitrary" option will return an index of an >>>> interval picked at random from all overlapping ones, which is the behaviour >>>> that I want, however, it does not seem to be the case. From a toy example >>>> (see EXAMPLE 1), it seems that select="arbitrary" always returns the same >>>> value as select="last". Moreover, in a real world example (see EXAMPLE 2), >>>> where query is human genes and subject is reads scanned from a sample bam >>>> file (attached), it seems that "arbitrary" option does not work at all. >>>> >>>> Could you, please, explain what was the supposed behaviour of >>>> select="arbitrary" function and why it does not work in a real- world >>>> example? Moreover, can you give me advice, on how to achieve the behaviour >>>> that I want? >>>> >>>> Regards, >>>> Karolis Uziela >>>> >>>> *============= EXAMPLE 1 ==============* >>>> library(GenomicRanges) >>>> > query <- IRanges(c(1, 4, 9), c(5, 7, 10)) >>>> > subject <- IRanges(c(2, 2, 10), c(2, 3, 12)) >>>> > findOverlaps(query, subject, select="last") >>>> [1] 2 NA 3 >>>> > findOverlaps(query, subject, select="first") >>>> [1] 1 NA 3 >>>> > findOverlaps(query, subject, select="arbitrary") >>>> [1] 2 NA 3 >>>> > findOverlaps(query, subject, select="arbitrary") >>>> [1] 2 NA 3 >>>> > findOverlaps(query, subject, select="arbitrary") >>>> [1] 2 NA 3 >>>> >>>> *============= EXAMPLE 2 ==============* >>>> library(GenomicFeatures) >>>> library(GenomicRanges) >>>> library(Rsamtools) >>>> txdb <- makeTranscriptDbFromBiomart(biomart="ensembl", >>>> dataset="hsapiens_gene_ensembl") >>>> exonRanges <- exonsBy(txdb, "gene") >>>> bam_aligns <- readBamGappedAlignments("input1.bam") >>>> counts <- findOverlaps(bam_aligns, exonRanges, select="arbitrary") >>>> >>>> Error in match.arg(select) : 'arg' should be one of âallâ, âfirstâ >>>> *============= Session info ==============* >>>> > sessionInfo() >>>> R version 2.14.1 (2011-12-22) >>>> Platform: x86_64-pc-linux-gnu (64-bit) >>>> >>>> locale: >>>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >>>> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 >>>> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 >>>> [7] LC_PAPER=C LC_NAME=C >>>> [9] LC_ADDRESS=C LC_TELEPHONE=C >>>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >>>> >>>> attached base packages: >>>> [1] stats graphics grDevices utils datasets methods base >>>> >>>> >>>> other attached packages: >>>> [1] Rsamtools_1.6.3 Biostrings_2.22.0 GenomicFeatures_1.6.9 >>>> [4] AnnotationDbi_1.16.19 Biobase_2.14.0 GenomicRanges_1.6.7 >>>> [7] IRanges_1.12.6 BiocInstaller_1.2.1 >>>> >>>> loaded via a namespace (and not attached): >>>> [1] biomaRt_2.10.0 bitops_1.0-5 BSgenome_1.22.0 DBI_0.2-5 >>>> >>>> [5] RCurl_1.95-3 RSQLite_0.11.2 rtracklayer_1.14.4 >>>> tools_2.14.1 >>>> [9] XML_3.95-0.1 zlibbioc_1.0.1 >>>> >>>> >>>> -- >>>> >>>> Karolis Uziela >>>> >>>> PhD student >>>> Science for Life Laboratory >>>> Box 1031 >>>> 17121 Solna, Sweden >>>> Mob. +46 729 120 395 >>>> >>> >>> >> >> >> -- >> >> Karolis Uziela >> >> PhD student >> Science for Life Laboratory >> Box 1031 >> 17121 Solna, Sweden >> Mob. +46 729 120 395 >> > > > > -- > > Karolis Uziela > > PhD student > Science for Life Laboratory > Box 1031 > 17121 Solna, Sweden > Mob. +46 729 120 395 > [[alternative HTML version deleted]]

score 0 · Answer 2 · 2014-06-09

Thanks Michael, that's a very smart solution! This is exactly what I wanted! :-) Regards, Karolis On Mon, Jun 9, 2014 at 5:05 PM, Michael Lawrence <lawrence.michael@gene.com> wrote: > Hi Karolis, > > I hope you don't mind that I've cc'd the Bioc help list for this reply. > The "arbitrary" select mode is not meant to be random; it's deterministic > and is just whatever the algorithm finds first (not necessarily according > to the order in the subject). Not all of the select modes are supported for > every combination of object types. In your "real world" example, you have a > GAlignments and a GRangesList. For that particular combination, only "all" > and "first" are supported. > > To solve your problem, you should probably use "all" and then select > randomly from the resulting Hits object. Something like this: > > hits <- findOverlaps(bam_aligns, exonRanges) > perm <- sample(length(hits)) > features <- rep(NA_integer_, queryLength(hits)) > features[queryHits(hits)[perm]] <- subjectHits(hits)[perm] > > Then see tabulate() for counting the reads for each feature. You might > also want to look into summarizeOverlaps() and its custom function mode > feature. > > Michael > > On Mon, Jun 9, 2014 at 2:39 AM, Karolis Uziela < > karolis.uziela@scilifelab.se> wrote: > >> Dear Mr. Michael Lawrence, >> >> I am using findOverlaps function from GenomicRanges to assign reads to >> custom features in reference genome. I am concerned about the cases where a >> read overlaps with several features. In this case, I would like the read to >> be assigned exactly to one feature which is picked at random from all >> overlapping ones. >> >> According to the manual, this behaviour can be controlled using "select" >> parameter. If select="first", the value of the findOverlaps should be a >> vector of query length, containing the index of the first overlapping >> interval in the subject. Analogically, if select="last", the function >> should return the indices of last overlapping interval. Finally, if >> select="arbitrary", the function should return the indices of "arbitrary" >> overlapping interval, but it is not explained clearly what "arbitrary" >> means. I assumed that "arbitrary" option will return an index of an >> interval picked at random from all overlapping ones, which is the behaviour >> that I want, however, it does not seem to be the case. From a toy example >> (see EXAMPLE 1), it seems that select="arbitrary" always returns the same >> value as select="last". Moreover, in a real world example (see EXAMPLE 2), >> where query is human genes and subject is reads scanned from a sample bam >> file (attached), it seems that "arbitrary" option does not work at all. >> >> Could you, please, explain what was the supposed behaviour of >> select="arbitrary" function and why it does not work in a real- world >> example? Moreover, can you give me advice, on how to achieve the behaviour >> that I want? >> >> Regards, >> Karolis Uziela >> >> *============= EXAMPLE 1 ==============* >> library(GenomicRanges) >> > query <- IRanges(c(1, 4, 9), c(5, 7, 10)) >> > subject <- IRanges(c(2, 2, 10), c(2, 3, 12)) >> > findOverlaps(query, subject, select="last") >> [1] 2 NA 3 >> > findOverlaps(query, subject, select="first") >> [1] 1 NA 3 >> > findOverlaps(query, subject, select="arbitrary") >> [1] 2 NA 3 >> > findOverlaps(query, subject, select="arbitrary") >> [1] 2 NA 3 >> > findOverlaps(query, subject, select="arbitrary") >> [1] 2 NA 3 >> >> *============= EXAMPLE 2 ==============* >> library(GenomicFeatures) >> library(GenomicRanges) >> library(Rsamtools) >> txdb <- makeTranscriptDbFromBiomart(biomart="ensembl", >> dataset="hsapiens_gene_ensembl") >> exonRanges <- exonsBy(txdb, "gene") >> bam_aligns <- readBamGappedAlignments("input1.bam") >> counts <- findOverlaps(bam_aligns, exonRanges, select="arbitrary") >> >> Error in match.arg(select) : 'arg' should be one of âallâ, âfirstâ >> *============= Session info ==============* >> > sessionInfo() >> R version 2.14.1 (2011-12-22) >> Platform: x86_64-pc-linux-gnu (64-bit) >> >> locale: >> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 >> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 >> [7] LC_PAPER=C LC_NAME=C >> [9] LC_ADDRESS=C LC_TELEPHONE=C >> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >> >> attached base packages: >> [1] stats graphics grDevices utils datasets methods base >> >> other attached packages: >> [1] Rsamtools_1.6.3 Biostrings_2.22.0 GenomicFeatures_1.6.9 >> [4] AnnotationDbi_1.16.19 Biobase_2.14.0 GenomicRanges_1.6.7 >> [7] IRanges_1.12.6 BiocInstaller_1.2.1 >> >> loaded via a namespace (and not attached): >> [1] biomaRt_2.10.0 bitops_1.0-5 BSgenome_1.22.0 DBI_0.2-5 >> >> [5] RCurl_1.95-3 RSQLite_0.11.2 rtracklayer_1.14.4 >> tools_2.14.1 >> [9] XML_3.95-0.1 zlibbioc_1.0.1 >> >> >> -- >> >> Karolis Uziela >> >> PhD student >> Science for Life Laboratory >> Box 1031 >> 17121 Solna, Sweden >> Mob. +46 729 120 395 >> > > -- Karolis Uziela PhD student Science for Life Laboratory Box 1031 17121 Solna, Sweden Mob. +46 729 120 395 [[alternative HTML version deleted]]