Question

Findoverlaps insame sample

0

Entering edit mode

julrodr80 ▴ 30

@julrodr80-11991

Last seen 7.5 years ago

I have two GR objects with 1000 samples. Each sample, several cnvs (genomic ranges). I would like to find overlaps between the genomic ranges of the two objets, but also with the condition that the overlap region is in the same individual.

seqnames		ranges strand \|	Sample
	<Rle>	<IRanges>	<character>
154	chr15	[ 22755185, 23228712]	00_017
154	chr15	[ 22876889, 23086929]	00_017
215	chr12	[ 8003758, 8126839]	01_006
527	chr10	[135258147, 135379710]	01_025
527	chr10	[135258147, 135267445]	01_025

thank you very much

genomicranges granges findoverlaps • 1.3k views

ADD COMMENT • link 8.0 years ago julrodr80 ▴ 30

0

Entering edit mode

julrodr80 ▴ 30

@julrodr80-11991

Last seen 7.5 years ago

Hi,

It works. I've tried the last options.

ADD COMMENT • link 8.0 years ago julrodr80 ▴ 30

score 2 · Accepted Answer · 2017-05-09

Hi,

The easiest solution is to just call findOverlaps() without worrying about the extra condition, and then to subset the returned Hits object to keep only the hits that satisfy the extra condition. For example, let's say the sample ID is stored in the metadata column sampleID on the 2 GRanges objects query and subject, then you could do something like:

hits <- findOverlaps(query, subject, ...)
hit_is_within_same_sample <-
    mcols(query)$sampleID[queryHits(hits)] ==
    mcols(subject)$sampleID[subjectHits(hits)]
hits <- hits[hit_is_within_same_sample]

A legitimate concern with this approach is that the call to findOverlaps() might be too costly because it returns many hits that will be removed. Depending on the input objects, these "to-be-removed" hits could make the Hits object thousands or millions times bigger than necessary and could consume too much memory. Here are 2 ways to avoid this:

1) Split the 2 GRanges objects by sample ID, and loop over all the sample IDs with mapply()

all_sampleIDs <- union(mcols(query)$sampleID, mcols(subject)$sampleID)
## We split 'query' and 'subject' with factors that have the same levels
## so we get 2 GRangesList objects that are *parallel* to each other (and
## to 'all_sampleIDs'):
query_by_sampleID <- split(query, factor(mcols(query)$sampleID,
                                         levels=all_sampleIDs)
subject_by_sampleID <- split(subject, factor(mcols(subject)$sampleID,
                                             levels=all_sample_IDs)
all_hits <- mapply(findOverlaps, query_by_sampleID, subject_by_sampleID)

all_hits is a list of Hits objects that is parallel to query_by_sampleID, subject_by_sampleID, and all_sampleIDs. Depending on what you want to do downstream, this might or might not be a convenient data structure.

2) Mangle the seqnames of the 2 GRanges objects with the sample IDs

This is probably much more efficient and should return exactly the same result as the easiest solution above. Do the mangling with something like:

query2 <- GRanges(paste(mcols(query)$sampleID, seqnames(query), sep=":"),
                  ranges(query), strand(query))
subject2 <- GRanges(paste(mcols(subject)$sampleID, seqnames(subject), sep=":"),
                    ranges(subject), strand(subject))

Then call findOverlaps() on query2 and subject2.

Hope this helps,

H.