Determining an overlapping annotation data subset (overlap/overlaps)
2
0
Entering edit mode
@stephen-montgomery-2305
Last seen 7.7 years ago
Hello Bioconductor - Apologies as this a fairly rookie bioinformatics based R question, but I am trying to determine if there is a R one-liner to extract a subset of a data frame which possesses annotation contained within it that has been stored in another data frame? (For example extracting genomic intervals which contain certain features/annotation) Such that: If I have dataframe "A" possessing an "id", "start", and "end"; And dataframe "B" also possessing an "id", "start", and "end"; The output is all the rows of A which contain an entry of B (B$start, B$end) within A$start and A$end. I have tried my own fairly uninformed variants like this to no-avail A[length(B[B$start <= A$end & B$end >= A$start]) > 0,] I fear the solution will be trivial but as yet it has eluded me. :/ Thanks for any help! (Theoretically, I can also see doing this in its own function by creating a vector of counts for each member of "A" and then reporting those that are non-zero but I was wondering if there was a more succinct and likely efficient way) Thanks again, Stephen Stephen Montgomery, B.A.Sc., Ph.D. Postdoctoral Researcher, Team 16 Wellcome Trust Sanger Institute Hinxton, Cambridge CB10 1SA Phone: 44-1223-834244 (ext 7297) Skype: stephen.b.montgomery -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.
Annotation Annotation • 738 views
ADD COMMENT
0
Entering edit mode
alex lam RI ▴ 310
@alex-lam-ri-1491
Last seen 7.7 years ago
Hi Stephen, Don't know if it does what you want (and it isn't a one-liner), but here it is anyway: > a<-data.frame(id=1:4, start=seq(10, 40, 10), end=seq(15, 45, 10)) > b<-data.frame(id=5:8, start=c(11,24,44,55), end=c(14,26,45,57)) > a # large sequence features id start end 1 1 10 15 2 2 20 25 3 3 30 35 4 4 40 45 > b # smaller sequence features id start end 1 5 11 14 2 6 24 26 3 7 44 45 4 8 55 57 > bool.matrix<-NULL > for(i in 1:nrow(b)) {bool.matrix<-rbind(bool.matrix, b$start[i] >= a$start & b$end[i] <= a$end)} > colnames(bool.matrix)<-a$id > rownames(bool.matrix)<-b$id > bool.matrix 1 2 3 4 5 TRUE FALSE FALSE FALSE 6 FALSE FALSE FALSE FALSE 7 FALSE FALSE FALSE TRUE 8 FALSE FALSE FALSE FALSE Cheers, Alex ------------------------------------ Alex Lam Roslin Institute (Edinburgh) Roslin Midlothian EH25 9PS Great Britain Phone +44 131 5274471 Web http://www.roslin.ac.uk Roslin Institute is a company limited by guarantee, registered in Scotland (registered number SC157100) and a Scottish Charity (registered number SC023592). Our registered office is at Roslin, Midlothian, EH25 9PS. VAT registration number 847380013. The information contained in this e-mail (including any attachments) is confidential and is intended for the use of the addressee only. The opinions expressed within this e-mail (including any attachments) are the opinions of the sender and do not necessarily constitute those of Roslin Institute (Edinburgh) ("the Institute") unless specifically stated by a sender who is duly authorised to do so on behalf of the Institute -----Original Message----- From: bioconductor-bounces@stat.math.ethz.ch [mailto:bioconductor-bounces at stat.math.ethz.ch] On Behalf Of Stephen Montgomery Sent: 06 August 2007 13:52 To: bioconductor at stat.math.ethz.ch Subject: [BioC] Determining an overlapping annotation data subset(overlap/overlaps) Hello Bioconductor - Apologies as this a fairly rookie bioinformatics based R question, but I am trying to determine if there is a R one-liner to extract a subset of a data frame which possesses annotation contained within it that has been stored in another data frame? (For example extracting genomic intervals which contain certain features/annotation) Such that: If I have dataframe "A" possessing an "id", "start", and "end"; And dataframe "B" also possessing an "id", "start", and "end"; The output is all the rows of A which contain an entry of B (B$start, B$end) within A$start and A$end. I have tried my own fairly uninformed variants like this to no-avail A[length(B[B$start <= A$end & B$end >= A$start]) > 0,] I fear the solution will be trivial but as yet it has eluded me. :/ Thanks for any help! (Theoretically, I can also see doing this in its own function by creating a vector of counts for each member of "A" and then reporting those that are non-zero but I was wondering if there was a more succinct and likely efficient way) Thanks again, Stephen Stephen Montgomery, B.A.Sc., Ph.D. Postdoctoral Researcher, Team 16 Wellcome Trust Sanger Institute Hinxton, Cambridge CB10 1SA Phone: 44-1223-834244 (ext 7297) Skype: stephen.b.montgomery -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. _______________________________________________ Bioconductor mailing list Bioconductor at stat.math.ethz.ch https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
ADD COMMENT
0
Entering edit mode
@herve-pages-1542
Last seen 11 hours ago
Seattle, WA, United States
Hi Stephen, > A <- data.frame(start=(1:5)*10L, end=(4:8)*10L) > A start end 1 10 40 2 20 50 3 30 60 4 40 70 5 50 80 > B <- data.frame(start=c(31L, 39L, 80L), end=c(60L, 40L, 84L)) > B start end 1 31 60 2 39 40 3 80 84 You can create a logical vector of the length the number of rows in A: for each A-row it says if there is any B-row inside: contains_a_Brow <- mapply(function(Astart, Aend) any(Astart <= B$start & B$end <= Aend), A$start, A$end) Then use this logical vector to subset A: A[contains_a_Brow, ] Cheers, H. Stephen Montgomery wrote: > Hello Bioconductor - > > Apologies as this a fairly rookie bioinformatics based R question, but I > am trying to determine if there is a R one-liner to extract a subset of > a data frame which possesses annotation contained within it that has > been stored in another data frame? (For example extracting genomic > intervals which contain certain features/annotation) > > Such that: > If I have dataframe "A" possessing an "id", "start", and "end"; And > dataframe "B" also possessing an "id", "start", and "end"; The output is > all the rows of A which contain an entry of B (B$start, B$end) within > A$start and A$end. > > I have tried my own fairly uninformed variants like this to no-avail > A[length(B[B$start <= A$end & B$end >= A$start]) > 0,] > I fear the solution will be trivial but as yet it has eluded me. :/ > > Thanks for any help! (Theoretically, I can also see doing this in its > own function by creating a vector of counts for each member of "A" and > then reporting those that are non-zero but I was wondering if there was > a more succinct and likely efficient way) > > Thanks again, > Stephen > > > > Stephen Montgomery, B.A.Sc., Ph.D. > Postdoctoral Researcher, Team 16 > Wellcome Trust Sanger Institute > Hinxton, Cambridge CB10 1SA > Phone: 44-1223-834244 (ext 7297) > Skype: stephen.b.montgomery > > > >
ADD COMMENT
0
Entering edit mode
Herve Pages wrote: > Hi Stephen, > >> A <- data.frame(start=(1:5)*10L, end=(4:8)*10L) >> A > start end > 1 10 40 > 2 20 50 > 3 30 60 > 4 40 70 > 5 50 80 > >> B <- data.frame(start=c(31L, 39L, 80L), end=c(60L, 40L, 84L)) >> B > start end > 1 31 60 > 2 39 40 > 3 80 84 > > You can create a logical vector of the length the number of rows in A: for each > A-row it says if there is any B-row inside: > > contains_a_Brow <- mapply(function(Astart, Aend) any(Astart <= B$start & B$end <= Aend), > A$start, A$end) This will be TRUE for A-rows that have at least 1 B-row within their limits. For selecting the A-rows that are _overlapping_ with at least 1 B-rows, use: contains_a_Brow <- mapply(function(Astart, Aend) any(Astart <= B$end & B$start <= Aend), A$start, A$end) H. > > Then use this logical vector to subset A: > > A[contains_a_Brow, ] > > Cheers, > H. > > Stephen Montgomery wrote: >> Hello Bioconductor - >> >> Apologies as this a fairly rookie bioinformatics based R question, but I >> am trying to determine if there is a R one-liner to extract a subset of >> a data frame which possesses annotation contained within it that has >> been stored in another data frame? (For example extracting genomic >> intervals which contain certain features/annotation) >> >> Such that: >> If I have dataframe "A" possessing an "id", "start", and "end"; And >> dataframe "B" also possessing an "id", "start", and "end"; The output is >> all the rows of A which contain an entry of B (B$start, B$end) within >> A$start and A$end. >> >> I have tried my own fairly uninformed variants like this to no- avail >> A[length(B[B$start <= A$end & B$end >= A$start]) > 0,] >> I fear the solution will be trivial but as yet it has eluded me. :/ >> >> Thanks for any help! (Theoretically, I can also see doing this in its >> own function by creating a vector of counts for each member of "A" and >> then reporting those that are non-zero but I was wondering if there was >> a more succinct and likely efficient way) >> >> Thanks again, >> Stephen >> >> >> >> Stephen Montgomery, B.A.Sc., Ph.D. >> Postdoctoral Researcher, Team 16 >> Wellcome Trust Sanger Institute >> Hinxton, Cambridge CB10 1SA >> Phone: 44-1223-834244 (ext 7297) >> Skype: stephen.b.montgomery >> >> >> >> > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
ADD REPLY

Login before adding your answer.

Traffic: 988 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6