Hi,
The easiest solution is to just call findOverlaps()
without worrying about the extra condition, and then to subset the returned Hits object to keep only the hits that satisfy the extra condition. For example, let's say the sample ID is stored in the metadata column sampleID
on the 2 GRanges objects query
and subject
, then you could do something like:
hits <- findOverlaps(query, subject, ...)
hit_is_within_same_sample <-
mcols(query)$sampleID[queryHits(hits)] ==
mcols(subject)$sampleID[subjectHits(hits)]
hits <- hits[hit_is_within_same_sample]
A legitimate concern with this approach is that the call to findOverlaps()
might be too costly because it returns many hits that will be removed. Depending on the input objects, these "to-be-removed" hits could make the Hits object thousands or millions times bigger than necessary and could consume too much memory. Here are 2 ways to avoid this:
1) Split the 2 GRanges objects by sample ID, and loop over all the sample IDs with mapply()
all_sampleIDs <- union(mcols(query)$sampleID, mcols(subject)$sampleID)
## We split 'query' and 'subject' with factors that have the same levels
## so we get 2 GRangesList objects that are *parallel* to each other (and
## to 'all_sampleIDs'):
query_by_sampleID <- split(query, factor(mcols(query)$sampleID,
levels=all_sampleIDs)
subject_by_sampleID <- split(subject, factor(mcols(subject)$sampleID,
levels=all_sample_IDs)
all_hits <- mapply(findOverlaps, query_by_sampleID, subject_by_sampleID)
all_hits
is a list of Hits objects that is parallel to query_by_sampleID
, subject_by_sampleID
, and all_sampleIDs
. Depending on what you want to do downstream, this might or might not be a convenient data structure.
2) Mangle the seqnames of the 2 GRanges objects with the sample IDs
This is probably much more efficient and should return exactly the same result as the easiest solution above. Do the mangling with something like:
query2 <- GRanges(paste(mcols(query)$sampleID, seqnames(query), sep=":"),
ranges(query), strand(query))
subject2 <- GRanges(paste(mcols(subject)$sampleID, seqnames(subject), sep=":"),
ranges(subject), strand(subject))
Then call findOverlaps()
on query2
and subject2
.
Hope this helps,
H.