Hello Bioconductor - Apologies as this a fairly rookie bioinformatics based R question, but I am trying to determine if there is a R one-liner to extract a subset of a data frame which possesses annotation contained within it that has been stored in another data frame? (For example extracting genomic intervals which contain certain features/annotation) Such that: If I have dataframe "A" possessing an "id", "start", and "end"; And dataframe "B" also possessing an "id", "start", and "end"; The output is all the rows of A which contain an entry of B (B$start, B$end) within A$start and A$end. I have tried my own fairly uninformed variants like this to no-avail A[length(B[B$start <= A$end & B$end >= A$start]) > 0,] I fear the solution will be trivial but as yet it has eluded me. :/ Thanks for any help! (Theoretically, I can also see doing this in its own function by creating a vector of counts for each member of "A" and then reporting those that are non-zero but I was wondering if there was a more succinct and likely efficient way) Thanks again, Stephen
Hi Stephen, Don't know if it does what you want (and it isn't a one-liner), but here it is anyway: > a<-data.frame(id=1:4, start=seq(10, 40, 10), end=seq(15, 45, 10)) > b<-data.frame(id=5:8, start=c(11,24,44,55), end=c(14,26,45,57)) > a # large sequence features id start end 1 1 10 15 2 2 20 25 3 3 30 35 4 4 40 45 > b # smaller sequence features id start end 1 5 11 14 2 6 24 26 3 7 44 45 4 8 55 57 > bool.matrix<-NULL > for(i in 1:nrow(b)) {bool.matrix<-rbind(bool.matrix, b$start[i] >= a$start & b$end[i] <= a$end)} > colnames(bool.matrix)<-a$id > rownames(bool.matrix)<-b$id > bool.matrix 1 2 3 4 5 TRUE FALSE FALSE FALSE 6 FALSE FALSE FALSE FALSE 7 FALSE FALSE FALSE TRUE 8 FALSE FALSE FALSE FALSE Cheers, Alex
Hi Stephen, > A <- data.frame(start=(1:5)*10L, end=(4:8)*10L) > A start end 1 10 40 2 20 50 3 30 60 4 40 70 5 50 80 > B <- data.frame(start=c(31L, 39L, 80L), end=c(60L, 40L, 84L)) > B start end 1 31 60 2 39 40 3 80 84 You can create a logical vector of the length the number of rows in A: for each A-row it says if there is any B-row inside: contains_a_Brow <- mapply(function(Astart, Aend) any(Astart <= B$start & B$end <= Aend), A$start, A$end) Then use this logical vector to subset A: A[contains_a_Brow, ] Cheers, H.
Herve Pages wrote: > Hi Stephen, > >> A <- data.frame(start=(1:5)*10L, end=(4:8)*10L) >> A > start end > 1 10 40 > 2 20 50 > 3 30 60 > 4 40 70 > 5 50 80 > >> B <- data.frame(start=c(31L, 39L, 80L), end=c(60L, 40L, 84L)) >> B > start end > 1 31 60 > 2 39 40 > 3 80 84 > > You can create a logical vector of the length the number of rows in A: for each > A-row it says if there is any B-row inside: > > contains_a_Brow <- mapply(function(Astart, Aend) any(Astart <= B$start & B$end <= Aend), > A$start, A$end) This will be TRUE for A-rows that have at least 1 B-row within their limits. For selecting the A-rows that are _overlapping_ with at least 1 B-rows, use: contains_a_Brow <- mapply(function(Astart, Aend) any(Astart <= B$end & B$start <= Aend), A$start, A$end) H.