Genomic Ranges findOverlaps by sample
1
0
Entering edit mode
@gaiusjaugustus-10041
Last seen 5.6 years ago
University of Arizona

I have 2 (very similar but not identical) genomic ranges objects, each with up to 33 samples, that I am trying to find overlaps and combine in a particular way.  I'm trying to do this separately for each sample.  I know I could do this with a for loop:

    subsets <- c(1:33)

    for (i in subsets){
         subset <- df[df$subset == i,]
         ...do tasks...
    }

However, I assume there must be a better way??  Perhaps with data.table, though this isn't a requirement. Some guidance on where to start would be helpful.


**Tasks**

The tasks I'm doing include:

 - Create GRanges objects
 - find Overlaps between df1 & 2
 - Use overlaps to combine segments


#Example:  
The below is for context.  Everything below works fine if I use the forloop structure above, but I'm just trying to wrap my head around how to do this for each File, instead of for the entire df, without doing a for loop for each File.

df1

    File   Chromosome      Min      Max    CN.State
    C_28        1            1       100        1
    C_28        1            150     200        1
    A_1         1            20       25        3
    A_1         1            150     200        3
    
    df1 <- data.frame(File=c("C_28","C_28","A_1","A_1"), 
    +                      Chromosome=rep(1, 4),
    +                      Min=c(1, 150, 20, 150),
    +                      Max=c(100, 200, 25, 200),
    +                      CN.State=c(1,1,3,3))

df2

    File Chromosome Min Max CN.State
    C_28          1   1 210        1
    A_1           1  15 250        3
    
    df2 <- data.frame(File=c("C_28","A_1"), 
    +                      Chromosome=rep(1, 2),
    +                      Min=c(1, 15),
    +                      Max=c(210, 250),
    +                      CN.State=c(1,3))

##Simplified Tasks

**Make Genomic Ranges Objects**

    df1 <- makeGRangesFromDataFrame(df1, keep.extra.columns = TRUE, seqnames.field="Chromosome", start.field="Min", end.field = "Max")
    df2 <- makeGRangesFromDataFrame(df2, keep.extra.columns = TRUE, seqnames.field = "Chromosome", start.field = "Min", end.field = "Max")

**Find overlaps & combine**

    hits <- findOverlaps(df1, df2)
    ranges(df1)[queryHits(hits)] <- ranges(df2)[subjectHits(hits)]
genomicranges • 1.7k views
ADD COMMENT
2
Entering edit mode
@michael-lawrence-3846
Last seen 2.4 years ago
United States

You don't need to use a for() loop, but you will need to iterate over the samples. You can just split the GRanges and loop in parallel over the two lists, like:

ans <- mapply(function(a, b) {
    hits <- findOverlaps(a, b)
    ranges(a)[queryHits(hits)] <- ranges(b)[subjectHits(hits)]
    a
}, split(df1, ~File), split(df2, ~File))

 

ADD COMMENT
0
Entering edit mode

As an aside, this would be easier if findOverlaps,GRangesList,GRangesList operated within elements. Could use GenomicRangesList, or maybe make a pfindOverlaps()?

ADD REPLY
0
Entering edit mode

I actually just realized I could use reduce() on the combined regions to do what I want to do, except that it won't keep my extra columns.  I tried translating your solution into 

mapply(reduce, split(CombinedRegions, ~File))

but this doesn't work.  When I try just split(CombinedRegions, ~File), that doesn't work either, and the error makes me think it's because it is a GRanges object.  If you could offer a solution with this, that'd be great.

 

The error: Error in (function (classes, fdef, mtable)  : 
  unable to find an inherited method for function ‘splitAsList’ for signature ‘"GRanges", "formula"’

ADD REPLY
0
Entering edit mode

The splitting by formula probably only works in devel. I think you want something like:

reduce(split(CombinedRegions, CombinedRegions$File))

 

ADD REPLY

Login before adding your answer.

Traffic: 899 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6