Question: Findoverlaps insame sample
0
gravatar for julrodr80
2.5 years ago by
julrodr8030
julrodr8030 wrote:

I have two GR objects with 1000 samples. Each sample, several cnvs (genomic ranges). I would like to find overlaps between the genomic ranges of the two objets, but also with the condition that the overlap region is in the same individual. 

seqnames   ranges strand | Sample
  <Rle> <IRanges> <character>
154 chr15 [ 22755185,  23228712] 00_017
154 chr15 [ 22876889,  23086929] 00_017
215 chr12 [  8003758,   8126839] 01_006
527 chr10 [135258147, 135379710] 01_025
527 chr10 [135258147, 135267445] 01_025

 thank you very much 

ADD COMMENTlink modified 2.5 years ago • written 2.5 years ago by julrodr8030
Answer: Findoverlaps insame sample
2
gravatar for Hervé Pagès
2.5 years ago by
Hervé Pagès ♦♦ 14k
United States
Hervé Pagès ♦♦ 14k wrote:

Hi,

The easiest solution is to just call findOverlaps() without worrying about the extra condition, and then to subset the returned Hits object to keep only the hits that satisfy the extra condition. For example, let's say the sample ID is stored in the metadata column sampleID on the 2 GRanges objects query and subject, then you could do something like:

hits <- findOverlaps(query, subject, ...)
hit_is_within_same_sample <-
    mcols(query)$sampleID[queryHits(hits)] ==
    mcols(subject)$sampleID[subjectHits(hits)]
hits <- hits[hit_is_within_same_sample]

A legitimate concern with this approach is that the call to findOverlaps() might be too costly because it returns many hits that will be removed. Depending on the input objects, these "to-be-removed" hits could make the Hits object thousands or millions times bigger than necessary and could consume too much memory. Here are 2 ways to avoid this:

1) Split the 2 GRanges objects by sample ID, and loop over all the sample IDs with mapply()

all_sampleIDs <- union(mcols(query)$sampleID, mcols(subject)$sampleID)
## We split 'query' and 'subject' with factors that have the same levels
## so we get 2 GRangesList objects that are *parallel* to each other (and
## to 'all_sampleIDs'):
query_by_sampleID <- split(query, factor(mcols(query)$sampleID,
                                         levels=all_sampleIDs)
subject_by_sampleID <- split(subject, factor(mcols(subject)$sampleID,
                                             levels=all_sample_IDs)
all_hits <- mapply(findOverlaps, query_by_sampleID, subject_by_sampleID)

all_hits is a list of Hits objects that is parallel to query_by_sampleIDsubject_by_sampleID, and all_sampleIDs. Depending on what you want to do downstream, this might or might not be a convenient data structure.

2) Mangle the seqnames of the 2 GRanges objects with the sample IDs

This is probably much more efficient and should return exactly the same result as the easiest solution above. Do the mangling with something like:

query2 <- GRanges(paste(mcols(query)$sampleID, seqnames(query), sep=":"),
                  ranges(query), strand(query))
subject2 <- GRanges(paste(mcols(subject)$sampleID, seqnames(subject), sep=":"),
                    ranges(subject), strand(subject))

Then call findOverlaps() on query2 and subject2.

Hope this helps,

H.

ADD COMMENTlink modified 2.5 years ago • written 2.5 years ago by Hervé Pagès ♦♦ 14k
Answer: Findoverlaps insame sample
0
gravatar for julrodr80
2.5 years ago by
julrodr8030
julrodr8030 wrote:

Hi, 

It works. I've tried the last options.

ADD COMMENTlink modified 2.5 years ago • written 2.5 years ago by julrodr8030
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 167 users visited in the last hour