Question

Genomic Ranges findOverlaps by sample

0

Entering edit mode

gaiusjaugustus • 0

@gaiusjaugustus-10041

Last seen 5.6 years ago

University of Arizona

I have 2 (very similar but not identical) genomic ranges objects, each with up to 33 samples, that I am trying to find overlaps and combine in a particular way. I'm trying to do this separately for each sample. I know I could do this with a for loop:

    subsets <- c(1:33)

    for (i in subsets){
         subset <- df[df$subset == i,]
         ...do tasks...
    }

However, I assume there must be a better way?? Perhaps with data.table, though this isn't a requirement. Some guidance on where to start would be helpful.

**Tasks**

The tasks I'm doing include:

- Create GRanges objects
- find Overlaps between df1 & 2
- Use overlaps to combine segments

#Example:
The below is for context. Everything below works fine if I use the forloop structure above, but I'm just trying to wrap my head around how to do this for each File, instead of for the entire df, without doing a for loop for each File.

df1

    File   Chromosome      Min      Max    CN.State
    C_28        1            1       100        1
    C_28        1            150     200        1
    A_1         1            20       25        3
    A_1         1            150     200        3
    
    df1 <- data.frame(File=c("C_28","C_28","A_1","A_1"), 
    +                      Chromosome=rep(1, 4),
    +                      Min=c(1, 150, 20, 150),
    +                      Max=c(100, 200, 25, 200),
    +                      CN.State=c(1,1,3,3))

df2

    File Chromosome Min Max CN.State
    C_28          1   1 210        1
    A_1           1  15 250        3
    
    df2 <- data.frame(File=c("C_28","A_1"), 
    +                      Chromosome=rep(1, 2),
    +                      Min=c(1, 15),
    +                      Max=c(210, 250),
    +                      CN.State=c(1,3))

##Simplified Tasks

**Make Genomic Ranges Objects**

    df1 <- makeGRangesFromDataFrame(df1, keep.extra.columns = TRUE, seqnames.field="Chromosome", start.field="Min", end.field = "Max")
    df2 <- makeGRangesFromDataFrame(df2, keep.extra.columns = TRUE, seqnames.field = "Chromosome", start.field = "Min", end.field = "Max")

**Find overlaps & combine**

    hits <- findOverlaps(df1, df2)
    ranges(df1)[queryHits(hits)] <- ranges(df2)[subjectHits(hits)]

genomicranges • 1.7k views

ADD COMMENT • link updated 7.7 years ago by Michael Lawrence ★ 11k • written 7.7 years ago by gaiusjaugustus • 0

score 2 · Accepted Answer · 2016-08-26

2

Entering edit mode

Michael Lawrence ★ 11k

@michael-lawrence-3846

Last seen 2.4 years ago

United States

You don't need to use a for() loop, but you will need to iterate over the samples. You can just split the GRanges and loop in parallel over the two lists, like:

ans <- mapply(function(a, b) {
    hits <- findOverlaps(a, b)
    ranges(a)[queryHits(hits)] <- ranges(b)[subjectHits(hits)]
    a
}, split(df1, ~File), split(df2, ~File))

ADD COMMENT • link 7.7 years ago Michael Lawrence ★ 11k

0

Entering edit mode

As an aside, this would be easier if findOverlaps,GRangesList,GRangesList operated within elements. Could use GenomicRangesList, or maybe make a pfindOverlaps()?

ADD REPLY • link 7.7 years ago Michael Lawrence ★ 11k

0

Entering edit mode

I actually just realized I could use reduce() on the combined regions to do what I want to do, except that it won't keep my extra columns. I tried translating your solution into

mapply(reduce, split(CombinedRegions, ~File))

but this doesn't work. When I try just split(CombinedRegions, ~File), that doesn't work either, and the error makes me think it's because it is a GRanges object. If you could offer a solution with this, that'd be great.

The error: Error in (function (classes, fdef, mtable) : unable to find an inherited method for function ‘splitAsList’ for signature ‘"GRanges", "formula"’

ADD REPLY • link 7.6 years ago gaiusjaugustus • 0

0

Entering edit mode

The splitting by formula probably only works in devel. I think you want something like:

reduce(split(CombinedRegions, CombinedRegions$File))

ADD REPLY • link 7.6 years ago Michael Lawrence ★ 11k