Search
Question: Genomic Ranges findOverlaps by sample
0
2.1 years ago by
University of Arizona
gaiusjaugustus0 wrote:

I have 2 (very similar but not identical) genomic ranges objects, each with up to 33 samples, that I am trying to find overlaps and combine in a particular way.  I'm trying to do this separately for each sample.  I know I could do this with a for loop:

    subsets <- c(1:33)

for (i in subsets){
subset <- df[df$subset == i,] ...do tasks... } However, I assume there must be a better way?? Perhaps with data.table, though this isn't a requirement. Some guidance on where to start would be helpful. **Tasks** The tasks I'm doing include: - Create GRanges objects - find Overlaps between df1 & 2 - Use overlaps to combine segments #Example: The below is for context. Everything below works fine if I use the forloop structure above, but I'm just trying to wrap my head around how to do this for each File, instead of for the entire df, without doing a for loop for each File. df1  File Chromosome Min Max CN.State C_28 1 1 100 1 C_28 1 150 200 1 A_1 1 20 25 3 A_1 1 150 200 3 df1 <- data.frame(File=c("C_28","C_28","A_1","A_1"), + Chromosome=rep(1, 4), + Min=c(1, 150, 20, 150), + Max=c(100, 200, 25, 200), + CN.State=c(1,1,3,3)) df2  File Chromosome Min Max CN.State C_28 1 1 210 1 A_1 1 15 250 3 df2 <- data.frame(File=c("C_28","A_1"), + Chromosome=rep(1, 2), + Min=c(1, 15), + Max=c(210, 250), + CN.State=c(1,3)) ##Simplified Tasks **Make Genomic Ranges Objects**  df1 <- makeGRangesFromDataFrame(df1, keep.extra.columns = TRUE, seqnames.field="Chromosome", start.field="Min", end.field = "Max") df2 <- makeGRangesFromDataFrame(df2, keep.extra.columns = TRUE, seqnames.field = "Chromosome", start.field = "Min", end.field = "Max") **Find overlaps & combine**  hits <- findOverlaps(df1, df2) ranges(df1)[queryHits(hits)] <- ranges(df2)[subjectHits(hits)] ADD COMMENTlink modified 2.1 years ago by Michael Lawrence10k • written 2.1 years ago by gaiusjaugustus0 2 2.1 years ago by United States Michael Lawrence10k wrote: You don't need to use a for() loop, but you will need to iterate over the samples. You can just split the GRanges and loop in parallel over the two lists, like: ans <- mapply(function(a, b) { hits <- findOverlaps(a, b) ranges(a)[queryHits(hits)] <- ranges(b)[subjectHits(hits)] a }, split(df1, ~File), split(df2, ~File)) ADD COMMENTlink written 2.1 years ago by Michael Lawrence10k As an aside, this would be easier if findOverlaps,GRangesList,GRangesList operated within elements. Could use GenomicRangesList, or maybe make a pfindOverlaps()? ADD REPLYlink written 2.1 years ago by Michael Lawrence10k I actually just realized I could use reduce() on the combined regions to do what I want to do, except that it won't keep my extra columns. I tried translating your solution into mapply(reduce, split(CombinedRegions, ~File)) but this doesn't work. When I try just split(CombinedRegions, ~File), that doesn't work either, and the error makes me think it's because it is a GRanges object. If you could offer a solution with this, that'd be great. The error: Error in (function (classes, fdef, mtable) : unable to find an inherited method for function ‘splitAsList’ for signature ‘"GRanges", "formula"’ ADD REPLYlink written 2.1 years ago by gaiusjaugustus0 The splitting by formula probably only works in devel. I think you want something like: reduce(split(CombinedRegions, CombinedRegions$File))