Question: Findoverlaps insame sample
0
2.5 years ago by
julrodr8030
julrodr8030 wrote:

I have two GR objects with 1000 samples. Each sample, several cnvs (genomic ranges). I would like to find overlaps between the genomic ranges of the two objets, but also with the condition that the overlap region is in the same individual.

 seqnames ranges strand | Sample 154 chr15 [ 22755185,  23228712] 00_017 154 chr15 [ 22876889,  23086929] 00_017 215 chr12 [  8003758,   8126839] 01_006 527 chr10 [135258147, 135379710] 01_025 527 chr10 [135258147, 135267445] 01_025

thank you very much

modified 2.5 years ago • written 2.5 years ago by julrodr8030
2
2.5 years ago by
Hervé Pagès ♦♦ 14k
United States
Hervé Pagès ♦♦ 14k wrote:

Hi,

The easiest solution is to just call findOverlaps() without worrying about the extra condition, and then to subset the returned Hits object to keep only the hits that satisfy the extra condition. For example, let's say the sample ID is stored in the metadata column sampleID on the 2 GRanges objects query and subject, then you could do something like:

hits <- findOverlaps(query, subject, ...)
hit_is_within_same_sample <-
mcols(query)$sampleID[queryHits(hits)] == mcols(subject)$sampleID[subjectHits(hits)]
hits <- hits[hit_is_within_same_sample]

A legitimate concern with this approach is that the call to findOverlaps() might be too costly because it returns many hits that will be removed. Depending on the input objects, these "to-be-removed" hits could make the Hits object thousands or millions times bigger than necessary and could consume too much memory. Here are 2 ways to avoid this:

1) Split the 2 GRanges objects by sample ID, and loop over all the sample IDs with mapply()

all_sampleIDs <- union(mcols(query)$sampleID, mcols(subject)$sampleID)
## We split 'query' and 'subject' with factors that have the same levels
## so we get 2 GRangesList objects that are *parallel* to each other (and
## to 'all_sampleIDs'):
query_by_sampleID <- split(query, factor(mcols(query)$sampleID, levels=all_sampleIDs) subject_by_sampleID <- split(subject, factor(mcols(subject)$sampleID,
levels=all_sample_IDs)
all_hits <- mapply(findOverlaps, query_by_sampleID, subject_by_sampleID)

all_hits is a list of Hits objects that is parallel to query_by_sampleIDsubject_by_sampleID, and all_sampleIDs. Depending on what you want to do downstream, this might or might not be a convenient data structure.

2) Mangle the seqnames of the 2 GRanges objects with the sample IDs

This is probably much more efficient and should return exactly the same result as the easiest solution above. Do the mangling with something like:

query2 <- GRanges(paste(mcols(query)$sampleID, seqnames(query), sep=":"), ranges(query), strand(query)) subject2 <- GRanges(paste(mcols(subject)$sampleID, seqnames(subject), sep=":"),
ranges(subject), strand(subject))

Then call findOverlaps() on query2 and subject2.

Hope this helps,

H.

0
2.5 years ago by
julrodr8030
julrodr8030 wrote:

Hi,

It works. I've tried the last options.