Question: Getting first or last hit from a Hits object
0
gravatar for Peter Hickey
5.8 years ago by
Peter Hickey450
Walter and Eliza Hall Institute of Medical Research, Melbourne, Australia
Peter Hickey450 wrote:
Hi all, Suppose I've created a Hits object by all_hits <- findOverlaps(query, subject, select = 'all', type = 'any') and now I want to get the first (or last) hit for each query. Of course, I could use first_hits <- findOverlaps(query, subject, select = 'first', type = 'any') last_hits <- findOverlaps(query, subject, select = 'last', type = 'any') but I this isn't necessary as all the required information is in the all_hits Hits object. I was unable to find a function to do what I wanted, however, based at the source for findOverlaps-methods.R (actually Hector's branch at https://github.com/hcorrada/GenomicRanges/blob/master/R/findOverlaps- methods.R) I wrote the following simple functions getFirstHit <- function(all_hits){ ans <- rep.int(NA_integer_, queryLength(all_hits)) oo <- IRanges:::orderIntegerPairs(queryHits(all_hits), subjectHits(all_hits), decreasing = TRUE) ans[queryHits(all_hits)[oo]] <- subjectHits(all_hits)[oo] return(ans) } getLastHit <- function(all_hits){ ans <- rep.int(NA_integer_, queryLength(all_hits)) ans[queryHits(all_hits)] <- subjectHits(all_hits) return(ans) } These do what I want and are faster than re-running findOverlaps for my use case (millions of query intervals with tens of thousands of subject intervals) Now to my questions: (1) Did I just overlook built-in "getFirstHit" and "getLastHit" functions? (2) Would others find "getFirstHit" and "getLastHit" useful enough to justify their inclusion in GenomicRanges (or IRanges, wherever it belongs)? (3) I recall some discussion on this mailing list and Rdevel stating that using ":::" to access non-exported functions isn't a great idea for use in package code. If I need to include my own "getFirstHit" would someone be so kind as to point me to the best advice for avoiding calls to ":::" in my code to access IRanges:::orderIntegerPairs(queryHits)? Many thanks, Pete -------------------------------- Peter Hickey, PhD Student/Research Assistant, Bioinformatics Division, Walter and Eliza Hall Institute of Medical Research, 1G Royal Parade, Parkville, Vic 3052, Australia. Ph: +613 9345 2324 hickey@wehi.edu.au http://www.wehi.edu.au ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:8}}
iranges genomicranges • 734 views
ADD COMMENTlink modified 5.8 years ago by Michael Lawrence11k • written 5.8 years ago by Peter Hickey450
Answer: Getting first or last hit from a Hits object
0
gravatar for Harris A. Jaffee
5.8 years ago by
United States
Harris A. Jaffee590 wrote:
I needed essentially your "last_hits" for bumphunter:::fuzzy.match2 In your notation, it comes out: last_hits_rows <- cumsum(rle(queryHits(all_hits))$lengths) last_hits <- all_hits[last_hits_rows,] Having expended that effort, you could do first_hits_rows <- c(1, 1 + last_hits_rows[-length(query)]) Maybe there's a better way. I have not compared these to anything of Hector's. On Oct 20, 2013, at 4:32 PM, Peter Hickey <hickey at="" wehi.edu.au=""> wrote: > Hi all, > > Suppose I've created a Hits object by > all_hits <- findOverlaps(query, subject, select = 'all', type = 'any') > and now I want to get the first (or last) hit for each query. Of course, I could use > first_hits <- findOverlaps(query, subject, select = 'first', type = 'any') > last_hits <- findOverlaps(query, subject, select = 'last', type = 'any') > but I this isn't necessary as all the required information is in the all_hits Hits object. > > I was unable to find a function to do what I wanted, however, based at the source for findOverlaps-methods.R (actually Hector's branch at https://github.com/hcorrada/GenomicRanges/blob/master/R/findOverlaps- methods.R) I wrote the following simple functions > getFirstHit <- function(all_hits){ > ans <- rep.int(NA_integer_, queryLength(all_hits)) > oo <- IRanges:::orderIntegerPairs(queryHits(all_hits), subjectHits(all_hits), decreasing = TRUE) > ans[queryHits(all_hits)[oo]] <- subjectHits(all_hits)[oo] > return(ans) > } > getLastHit <- function(all_hits){ > ans <- rep.int(NA_integer_, queryLength(all_hits)) > ans[queryHits(all_hits)] <- subjectHits(all_hits) > return(ans) > } > > These do what I want and are faster than re-running findOverlaps for my use case (millions of query intervals with tens of thousands of subject intervals) > > Now to my questions: > (1) Did I just overlook built-in "getFirstHit" and "getLastHit" functions? > (2) Would others find "getFirstHit" and "getLastHit" useful enough to justify their inclusion in GenomicRanges (or IRanges, wherever it belongs)? > (3) I recall some discussion on this mailing list and Rdevel stating that using ":::" to access non-exported functions isn't a great idea for use in package code. If I need to include my own "getFirstHit" would someone be so kind as to point me to the best advice for avoiding calls to ":::" in my code to access IRanges:::orderIntegerPairs(queryHits)? > > Many thanks, > Pete > -------------------------------- > Peter Hickey, > PhD Student/Research Assistant, > Bioinformatics Division, > Walter and Eliza Hall Institute of Medical Research, > 1G Royal Parade, Parkville, Vic 3052, Australia. > Ph: +613 9345 2324 > > hickey at wehi.edu.au > http://www.wehi.edu.au > > > ______________________________________________________________________ > The information in this email is confidential and intend...{{dropped:8}} > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
ADD COMMENTlink written 5.8 years ago by Harris A. Jaffee590
Answer: Getting first or last hit from a Hits object
0
gravatar for Michael Lawrence
5.8 years ago by
United States
Michael Lawrence11k wrote:
Thanks for providing this use case. It's actually documented that the Hits are sorted by query. So using that assumption, an Rle-based approach like that provided by Harris is probably the most efficient. But I would use Rle from IRanges instead of the base "rle" class, because it offers more conveniences; for example, the code is pretty intuitive for this use case: queryHits_rle <- Rle(queryHits(all_hits)) first_hits <- all_hits[start(queryHits_rle)] last_hits <- all_hits[end(queryHits_rle)] Michael On Sun, Oct 20, 2013 at 1:32 PM, Peter Hickey <hickey@wehi.edu.au> wrote: > Hi all, > > Suppose I've created a Hits object by > all_hits <- findOverlaps(query, subject, select = 'all', type = 'any') > and now I want to get the first (or last) hit for each query. Of course, I > could use > first_hits <- findOverlaps(query, subject, select = 'first', type = 'any') > last_hits <- findOverlaps(query, subject, select = 'last', type = 'any') > but I this isn't necessary as all the required information is in the > all_hits Hits object. > > I was unable to find a function to do what I wanted, however, based at the > source for findOverlaps-methods.R (actually Hector's branch at > https://github.com/hcorrada/GenomicRanges/blob/master/R /findOverlaps-methods.R) > I wrote the following simple functions > getFirstHit <- function(all_hits){ > ans <- rep.int(NA_integer_, queryLength(all_hits)) > oo <- IRanges:::orderIntegerPairs(queryHits(all_hits), > subjectHits(all_hits), decreasing = TRUE) > ans[queryHits(all_hits)[oo]] <- subjectHits(all_hits)[oo] > return(ans) > } > getLastHit <- function(all_hits){ > ans <- rep.int(NA_integer_, queryLength(all_hits)) > ans[queryHits(all_hits)] <- subjectHits(all_hits) > return(ans) > } > > These do what I want and are faster than re-running findOverlaps for my > use case (millions of query intervals with tens of thousands of subject > intervals) > > Now to my questions: > (1) Did I just overlook built-in "getFirstHit" and "getLastHit" functions? > (2) Would others find "getFirstHit" and "getLastHit" useful enough to > justify their inclusion in GenomicRanges (or IRanges, wherever it belongs)? > (3) I recall some discussion on this mailing list and Rdevel stating that > using ":::" to access non-exported functions isn't a great idea for use in > package code. If I need to include my own "getFirstHit" would someone be so > kind as to point me to the best advice for avoiding calls to ":::" in my > code to access IRanges:::orderIntegerPairs(queryHits)? > > Many thanks, > Pete > -------------------------------- > Peter Hickey, > PhD Student/Research Assistant, > Bioinformatics Division, > Walter and Eliza Hall Institute of Medical Research, > 1G Royal Parade, Parkville, Vic 3052, Australia. > Ph: +613 9345 2324 > > hickey@wehi.edu.au > http://www.wehi.edu.au > > > ______________________________________________________________________ > The information in this email is confidential and inte...{{dropped:13}}
ADD COMMENTlink written 5.8 years ago by Michael Lawrence11k
Thanks to Harris, Chuck and Michael for their solutions. Much appreciated. Pete On 21/10/2013, at 3:25 PM, Michael Lawrence wrote: > Thanks for providing this use case. > > It's actually documented that the Hits are sorted by query. So using that assumption, an Rle-based approach like that provided by Harris is probably the most efficient. But I would use Rle from IRanges instead of the base "rle" class, because it offers more conveniences; for example, the code is pretty intuitive for this use case: > > queryHits_rle <- Rle(queryHits(all_hits)) > first_hits <- all_hits[start(queryHits_rle)] > last_hits <- all_hits[end(queryHits_rle)] > > Michael > > > > On Sun, Oct 20, 2013 at 1:32 PM, Peter Hickey <hickey@wehi.edu.au> wrote: > Hi all, > > Suppose I've created a Hits object by > all_hits <- findOverlaps(query, subject, select = 'all', type = 'any') > and now I want to get the first (or last) hit for each query. Of course, I could use > first_hits <- findOverlaps(query, subject, select = 'first', type = 'any') > last_hits <- findOverlaps(query, subject, select = 'last', type = 'any') > but I this isn't necessary as all the required information is in the all_hits Hits object. > > I was unable to find a function to do what I wanted, however, based at the source for findOverlaps-methods.R (actually Hector's branch at https://github.com/hcorrada/GenomicRanges/blob/master/R/findOverlaps- methods.R) I wrote the following simple functions > getFirstHit <- function(all_hits){ > ans <- rep.int(NA_integer_, queryLength(all_hits)) > oo <- IRanges:::orderIntegerPairs(queryHits(all_hits), subjectHits(all_hits), decreasing = TRUE) > ans[queryHits(all_hits)[oo]] <- subjectHits(all_hits)[oo] > return(ans) > } > getLastHit <- function(all_hits){ > ans <- rep.int(NA_integer_, queryLength(all_hits)) > ans[queryHits(all_hits)] <- subjectHits(all_hits) > return(ans) > } > > These do what I want and are faster than re-running findOverlaps for my use case (millions of query intervals with tens of thousands of subject intervals) > > Now to my questions: > (1) Did I just overlook built-in "getFirstHit" and "getLastHit" functions? > (2) Would others find "getFirstHit" and "getLastHit" useful enough to justify their inclusion in GenomicRanges (or IRanges, wherever it belongs)? > (3) I recall some discussion on this mailing list and Rdevel stating that using ":::" to access non-exported functions isn't a great idea for use in package code. If I need to include my own "getFirstHit" would someone be so kind as to point me to the best advice for avoiding calls to ":::" in my code to access IRanges:::orderIntegerPairs(queryHits)? > > Many thanks, > Pete > -------------------------------- > Peter Hickey, > PhD Student/Research Assistant, > Bioinformatics Division, > Walter and Eliza Hall Institute of Medical Research, > 1G Royal Parade, Parkville, Vic 3052, Australia. > Ph: +613 9345 2324 > > hickey@wehi.edu.au > http://www.wehi.edu.au > > > ______________________________________________________________________ > The information in this email is confidential and inte...{{dropped:31}}
ADD REPLYlink written 5.8 years ago by Peter Hickey450
Answer: Getting first or last hit from a Hits object
0
gravatar for Charles Berry
5.8 years ago by
Charles Berry290
United States
Charles Berry290 wrote:
Peter Hickey <hickey at="" ...=""> writes: > > Hi all, > > Suppose I've created a Hits object by > all_hits <- findOverlaps(query, subject, select = 'all', type = 'any') > and now I want to get the first (or last) hit for each query. Of course, > [...] gr1 <- GRanges(rep(letters,each=10), IRanges(start=sample(1e6,260),width=20)) hit1 <- findOverlaps( gr1, gr1, maxgap=1e5) hit1 # show what hit1 is hit1[!duplicated(queryHits(hit1))] # only the first hit for each query value hit1[!duplicated(queryHits(hit1),fromLast=TRUE)] # only last hits HTH, Chuck
ADD COMMENTlink written 5.8 years ago by Charles Berry290
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 160 users visited in the last hour