Determining an overlapping annotation data subset (overlap/overlaps)
2
0
Entering edit mode
@stephen-montgomery-2305
Last seen 7.8 years ago
Hello Bioconductor - Apologies as this a fairly rookie bioinformatics based R question, but I am trying to determine if there is a R one-liner to extract a subset of a data frame which possesses annotation contained within it that has been stored in another data frame? (For example extracting genomic intervals which contain certain features/annotation) Such that: If I have dataframe "A" possessing an "id", "start", and "end"; And dataframe "B" also possessing an "id", "start", and "end"; The output is all the rows of A which contain an entry of B (B$start, B$end) within A$start and A$end. I have tried my own fairly uninformed variants like this to no-avail A[length(B[B$start <= A$end & B$end >= A$start]) > 0,] I fear the solution will be trivial but as yet it has eluded me. :/ Thanks for any help! (Theoretically, I can also see doing this in its own function by creating a vector of counts for each member of "A" and then reporting those that are non-zero but I was wondering if there was a more succinct and likely efficient way) Thanks again, Stephen Stephen Montgomery, B.A.Sc., Ph.D. Postdoctoral Researcher, Team 16 Wellcome Trust Sanger Institute Hinxton, Cambridge CB10 1SA Phone: 44-1223-834244 (ext 7297) Skype: stephen.b.montgomery -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.
Annotation Annotation • 748 views
0
Entering edit mode
alex lam RI ▴ 310
@alex-lam-ri-1491
Last seen 7.8 years ago
Hi Stephen, Don't know if it does what you want (and it isn't a one-liner), but here it is anyway: > a<-data.frame(id=1:4, start=seq(10, 40, 10), end=seq(15, 45, 10)) > b<-data.frame(id=5:8, start=c(11,24,44,55), end=c(14,26,45,57)) > a # large sequence features id start end 1 1 10 15 2 2 20 25 3 3 30 35 4 4 40 45 > b # smaller sequence features id start end 1 5 11 14 2 6 24 26 3 7 44 45 4 8 55 57 > bool.matrix<-NULL > for(i in 1:nrow(b)) {bool.matrix<-rbind(bool.matrix, b$start[i] >= a$start & b$end[i] <= a$end)} > colnames(bool.matrix)<-a$id > rownames(bool.matrix)<-b$id > bool.matrix 1 2 3 4 5 TRUE FALSE FALSE FALSE 6 FALSE FALSE FALSE FALSE 7 FALSE FALSE FALSE TRUE 8 FALSE FALSE FALSE FALSE Cheers, Alex ------------------------------------ Alex Lam Roslin Institute (Edinburgh) Roslin Midlothian EH25 9PS Great Britain Phone +44 131 5274471 Web http://www.roslin.ac.uk Roslin Institute is a company limited by guarantee, registered in Scotland (registered number SC157100) and a Scottish Charity (registered number SC023592). Our registered office is at Roslin, Midlothian, EH25 9PS. VAT registration number 847380013. The information contained in this e-mail (including any attachments) is confidential and is intended for the use of the addressee only. The opinions expressed within this e-mail (including any attachments) are the opinions of the sender and do not necessarily constitute those of Roslin Institute (Edinburgh) ("the Institute") unless specifically stated by a sender who is duly authorised to do so on behalf of the Institute -----Original Message----- From: bioconductor-bounces@stat.math.ethz.ch [mailto:bioconductor-bounces at stat.math.ethz.ch] On Behalf Of Stephen Montgomery Sent: 06 August 2007 13:52 To: bioconductor at stat.math.ethz.ch Subject: [BioC] Determining an overlapping annotation data subset(overlap/overlaps) Hello Bioconductor - Apologies as this a fairly rookie bioinformatics based R question, but I am trying to determine if there is a R one-liner to extract a subset of a data frame which possesses annotation contained within it that has been stored in another data frame? (For example extracting genomic intervals which contain certain features/annotation) Such that: If I have dataframe "A" possessing an "id", "start", and "end"; And dataframe "B" also possessing an "id", "start", and "end"; The output is all the rows of A which contain an entry of B (B$start, B$end) within A$start and A$end. I have tried my own fairly uninformed variants like this to no-avail A[length(B[B$start <= A$end & B$end >= A$start]) > 0,] I fear the solution will be trivial but as yet it has eluded me. :/ Thanks for any help! (Theoretically, I can also see doing this in its own function by creating a vector of counts for each member of "A" and then reporting those that are non-zero but I was wondering if there was a more succinct and likely efficient way) Thanks again, Stephen Stephen Montgomery, B.A.Sc., Ph.D. Postdoctoral Researcher, Team 16 Wellcome Trust Sanger Institute Hinxton, Cambridge CB10 1SA Phone: 44-1223-834244 (ext 7297) Skype: stephen.b.montgomery -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. _______________________________________________ Bioconductor mailing list Bioconductor at stat.math.ethz.ch https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
0
Entering edit mode
@herve-pages-1542
Last seen 21 hours ago
Seattle, WA, United States
Hi Stephen, > A <- data.frame(start=(1:5)*10L, end=(4:8)*10L) > A start end 1 10 40 2 20 50 3 30 60 4 40 70 5 50 80 > B <- data.frame(start=c(31L, 39L, 80L), end=c(60L, 40L, 84L)) > B start end 1 31 60 2 39 40 3 80 84 You can create a logical vector of the length the number of rows in A: for each A-row it says if there is any B-row inside: contains_a_Brow <- mapply(function(Astart, Aend) any(Astart <= B$start & B$end <= Aend), A$start, A$end) Then use this logical vector to subset A: A[contains_a_Brow, ] Cheers, H. Stephen Montgomery wrote: > Hello Bioconductor - > > Apologies as this a fairly rookie bioinformatics based R question, but I > am trying to determine if there is a R one-liner to extract a subset of > a data frame which possesses annotation contained within it that has > been stored in another data frame? (For example extracting genomic > intervals which contain certain features/annotation) > > Such that: > If I have dataframe "A" possessing an "id", "start", and "end"; And > dataframe "B" also possessing an "id", "start", and "end"; The output is > all the rows of A which contain an entry of B (B$start, B$end) within > A$start and A$end. > > I have tried my own fairly uninformed variants like this to no-avail > A[length(B[B$start <= A$end & B$end >= A$start]) > 0,] > I fear the solution will be trivial but as yet it has eluded me. :/ > > Thanks for any help! (Theoretically, I can also see doing this in its > own function by creating a vector of counts for each member of "A" and > then reporting those that are non-zero but I was wondering if there was > a more succinct and likely efficient way) > > Thanks again, > Stephen > > > > Stephen Montgomery, B.A.Sc., Ph.D. > Postdoctoral Researcher, Team 16 > Wellcome Trust Sanger Institute > Hinxton, Cambridge CB10 1SA > Phone: 44-1223-834244 (ext 7297) > Skype: stephen.b.montgomery > > > >
0
Entering edit mode
Herve Pages wrote: > Hi Stephen, > >> A <- data.frame(start=(1:5)*10L, end=(4:8)*10L) >> A > start end > 1 10 40 > 2 20 50 > 3 30 60 > 4 40 70 > 5 50 80 > >> B <- data.frame(start=c(31L, 39L, 80L), end=c(60L, 40L, 84L)) >> B > start end > 1 31 60 > 2 39 40 > 3 80 84 > > You can create a logical vector of the length the number of rows in A: for each > A-row it says if there is any B-row inside: > > contains_a_Brow <- mapply(function(Astart, Aend) any(Astart <= B$start & B$end <= Aend), > A$start, A$end) This will be TRUE for A-rows that have at least 1 B-row within their limits. For selecting the A-rows that are _overlapping_ with at least 1 B-rows, use: contains_a_Brow <- mapply(function(Astart, Aend) any(Astart <= B$end & B$start <= Aend), A$start, A$end) H. > > Then use this logical vector to subset A: > > A[contains_a_Brow, ] > > Cheers, > H. > > Stephen Montgomery wrote: >> Hello Bioconductor - >> >> Apologies as this a fairly rookie bioinformatics based R question, but I >> am trying to determine if there is a R one-liner to extract a subset of >> a data frame which possesses annotation contained within it that has >> been stored in another data frame? (For example extracting genomic >> intervals which contain certain features/annotation) >> >> Such that: >> If I have dataframe "A" possessing an "id", "start", and "end"; And >> dataframe "B" also possessing an "id", "start", and "end"; The output is >> all the rows of A which contain an entry of B (B$start, B$end) within >> A$start and A$end. >> >> I have tried my own fairly uninformed variants like this to no- avail >> A[length(B[B$start <= A$end & B$end >= A$start]) > 0,] >> I fear the solution will be trivial but as yet it has eluded me. :/ >> >> Thanks for any help! (Theoretically, I can also see doing this in its >> own function by creating a vector of counts for each member of "A" and >> then reporting those that are non-zero but I was wondering if there was >> a more succinct and likely efficient way) >> >> Thanks again, >> Stephen >> >> >> >> Stephen Montgomery, B.A.Sc., Ph.D. >> Postdoctoral Researcher, Team 16 >> Wellcome Trust Sanger Institute >> Hinxton, Cambridge CB10 1SA >> Phone: 44-1223-834244 (ext 7297) >> Skype: stephen.b.montgomery >> >> >> >> > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor