Question

Any way to proceed duplicate removal for list of GRanges object ?

0

Entering edit mode

Jurat Shahidin ▴ 80

@jurat-shahidin-9488

Last seen 5.8 years ago

Chicago, IL, USA

Hi :

I have list of GRanges that needed to apply very specific duplicate removal . I have reason for using specific conditional duplicate removal for my data. However, duplicate removal condition for each individual GRanges is different. I want to do complete duplicate removal for first list element; for second list element, I need to search the row that appear more than twice (freq >2), and only keep one row; for third list element, search over the row that appear more than three times (freq>3), and keep two or three rows. I am trying to get more programmatic, dynamic solution for this data manipulation task. How can I make this happen easily ? Any way to accomplish this task more efficiently respect to my specific output ? Any idea please ?

Edit

(thanks for @Martin' edit on my reproducible data).

mini example :

grl <- GRangesList(
    bar= GRanges(seqnames = Rle("chr1",14),
        IRanges(
            c(9,19,34,54,70,82,136,9,34,70,136,9,82,136),
            c(14,21,39,61,73,87,153,14,39,73,153,14,87,153)),
        score=c(48,6,9,8,4,15,38,48,9,4,38,48,15,38)),
    cat = GRanges(seqnames = Rle("chr10",16),
        IRanges(
            c(7,21,21,72,142,7,16,21,45,72,100,114,142,16,72,114),
            c(10,34,34,78,147,10,17,34,51,78,103,124,147,17,78,124)),
        score=c(53,14,14,20,4,53,20,14,11,20,7,32,4,20,20,32)),
    foo= GRanges(seqnames = Rle("chr11",16),
        IRanges(
            c(12,12,12,58,58,58,118,12,12,44,58,102,118,12,58,118),
            c(36,36,36,92,92,92,139,36,36,49,92,109,139,36,92,139)),
        score=c(48,48,48,12,12,12,5,48,48,12,12,11,5,48,12,5))
)

Note that in cat, I am going to look up the rows that appear three times, and keep that rows only once; if row appear twice, I don't do duplicate removal on that. in foo, I am going to check the rows that appear more than three times, and keep two or three same rows instead. This is what I am trying to make very specific duplicate removal for each GRange. How can I get my output ?

This is my desired output :

grl_expected <- GRangesList(
    bar= GRanges(seqnames = Rle("chr1",7),
        IRanges(
            c(9,19,34,54,70,82,136),
            c(14,21,39,61,73,87,153)),
        score=c(48,6,9,8,4,15,38)),
    cat= GRanges(seqnames = Rle("chr10",12),
        IRanges(
            c(7,21,72,142,7,16,45,100,114,142,16,114),
            c(10,34,78,147,10,17,51,103,124,147,17,124)),
        score=c(53,14,20,4,53,20,11,7,32,4,20,32)),
    foo= GRanges(seqnames = Rle("chr11",11),
        IRanges(
            c(12,12,12,44,58,58,58,118,102,118,118),
            c(36,36,36,49,92,92,92,139,109,139,139)),
        score=c(48,48,48,17,12,12,12,5,11,5,5))
)

can any one point me out how to make this happen ? Any idea ?

Best regards :

Jurat

r grangeslist duplicate • 5.8k views

ADD COMMENT • link updated 9.0 years ago by Martin Morgan 25k • written 9.0 years ago by Jurat Shahidin ▴ 80

0

Entering edit mode

I edited your question to avoid the unnecessary step of constructing a list of GRanges, and removed information about 'strand', which doesn't seem relevant to your question. Please edit your question further so that the seqnames in the result are consistent with the seqnames in the original. Please also remove any elements that do not provide additional insight into your problem. For instance, are all 14 ranges in the first element of your input necessary to illustrate the operation you are trying to perform, or would perhaps just 4 elements be enough?

ADD REPLY • link 9.0 years ago Martin Morgan 25k

0

Entering edit mode

Dear Martin :

Thanks for your editing on my question. I am sure the structure of my data (very much simulated from my original problem, and we have reason for conditionally keeping or removing repeated rows in special way). Could you give me possible idea of making this happen ? Thanks a lot.

Best regards :

Jurat

ADD REPLY • link 9.0 years ago Jurat Shahidin ▴ 80

0

Entering edit mode

Also, in this example, it seems the 'starts' are enough to determine duplication (ie whenever the 'start' is the same, so is the end, the seqname, and the score?) Is that necessarily true? If so, I imagine, an efficient way would be to create, for each ranges, to count the number of previous instances of a start with counted <- ave(start(bar), start(bar), FUN=seq_along) and then you could subset your ranges with bar[counted<=nbar] . You could run this in a loop (or lapply-like invocation), where you'd regenerate the object being counted (and the relevant n). If the 'starts' being equal isn't enough to determine overall uniqueness, then you could 'paste' together enough information to determine whether rows are duplicated, and do something similar.

ADD REPLY • link 9.0 years ago Gavin Kelly ▴ 690

0

Entering edit mode

@gavinpaulkelly I think better to evaluate by each row to determine the duplication pattern, and it can be done.

ADD REPLY • link 9.0 years ago Jurat Shahidin ▴ 80

0

Entering edit mode

wcstcyx ▴ 30

@wcstcyx-11636

Last seen 7.9 years ago

China/Beijing/AMSS,CAS

Hi Jurat,

You need two columns to achieve your demand. dupTimes and idInSubgroup can help you.

idInSubgroup <- function(gr) {
  f <- as.factor(gr)
  grl <- endoapply(split(gr, f), function(gr) {
    mcols(gr)$dupTimes <- length(gr)
    mcols(gr)$idInSubgroup <- seq_along(gr)
    gr})
  gr <- unsplit(grl, f)
  gr
}
grl_ann <- endoapply(grl, idInSubgroup)
bar_res <- grl_ann$bar[with(mcols(grl_ann$bar), idInSubgroup == 1)]
mcols(bar_res)[c("dupTimes", "idInSubgroup")] <- NULL
bar_res
cat_res <- grl_ann$cat[with(mcols(grl_ann$cat), dupTimes <= 2 | (dupTimes > 2 & idInSubgroup == 1))]
mcols(cat_res)[c("dupTimes", "idInSubgroup")] <- NULL
cat_res
foo_res <- grl_ann$foo[with(mcols(grl_ann$foo), dupTimes <= 3 | (dupTimes > 3 & idInSubgroup <= 3))]
mcols(foo_res)[c("dupTimes", "idInSubgroup")] <- NULL
foo_res
grl_res <- GRangesList(bar = bar_res, cat = cat_res, foo = foo_res)
grl_expected <- GRangesList(
  bar= GRanges(seqnames = Rle("chr1",7),
               IRanges(
                 c(9,19,34,54,70,82,136),
                 c(14,21,39,61,73,87,153)),
               score=c(48,6,9,8,4,15,38)),
  cat= GRanges(seqnames = Rle("chr10",12),
               IRanges(
                 c(7,21,72,142,7,16,45,100,114,142,16,114),
                 c(10,34,78,147,10,17,51,103,124,147,17,124)),
               score=c(53,14,20,4,53,20,11,7,32,4,20,32)),
  foo= GRanges(seqnames = Rle("chr11",11),
               IRanges(
                 c(12,12,12,58,58,58,118,44,102,118,118), # modified to be consistent to your original one
                 c(36,36,36,92,92,92,139,49,109,139,139)), # modified to be consistent to your original one
               score=c(48,48,48,12,12,12,5,12,11,5,5)) # modified to be consistent to your original one
)
identical(grl_res, grl_expected) #  TRUE

You can run the code by yourself. Hope you like it :)

Can

ADD COMMENT • link 9.0 years ago wcstcyx ▴ 30

0

Entering edit mode

Hi Can:

using endoapply is interesting in your helper function, but it's lengthy. I mean consider grl as an input list, but you take each GRanges object outside the original list. perhaps make your answer more programmatic and more efficient is needed.

ADD REPLY • link 9.0 years ago Jurat Shahidin ▴ 80

0

Entering edit mode

Hi Jurat,

You can write a lazy fun to do that. I have wrote that for you.

lazyFun <- function(grl) {
  idInSubgroup <- function(gr) {
    f <- as.factor(gr)
    grl <- endoapply(split(gr, f), function(gr) {
      mcols(gr)$dupTimes <- length(gr)
      mcols(gr)$idInSubgroup <- seq_along(gr)
      gr})
    gr <- unsplit(grl, f)
    gr
  }
  grl_ann <- endoapply(grl, idInSubgroup)
  bar_res <- grl_ann$bar[with(mcols(grl_ann$bar), idInSubgroup == 1)]
  cat_res <- grl_ann$cat[with(mcols(grl_ann$cat), dupTimes <= 2 | (dupTimes > 2 & idInSubgroup == 1))]
  foo_res <- grl_ann$foo[with(mcols(grl_ann$foo), dupTimes <= 3 | (dupTimes > 3 & idInSubgroup <= 3))]
  grl_res <- GRangesList(bar = bar_res, cat = cat_res, foo = foo_res)
  res <- endoapply(grl_res, function(x) {
    mcols(x)[c("dupTimes", "idInSubgroup")] <- NULL
    x})
  res
}
grl_res <- lazyFun(grl) # input is the original grl
identical(grl_res, grl_expected) # TRUE

Try it.:)

Can

ADD REPLY • link 9.0 years ago wcstcyx ▴ 30

0

Entering edit mode

@Can you just repeat again your answer, can you delete one of them please? it makes this post hard to read by others. my goal is, to solve this problem as simple and efficient as possible.

ADD REPLY • link 9.0 years ago Jurat Shahidin ▴ 80

0

Entering edit mode

Hi Jurat,

I do not have the ability to delete one of my answer. Maybe the administrators can help us do that.

I have made the things happen. Please let me know why lazyFun() is not efficient? In speed or in memory or in number of lines of the code?

You have three different tasks. So each one need at least one line to achieve that. So the code can be a little bit long.

You said "I think better to evaluate by each row to determine the duplication pattern, and it can be done." However, for R, to use 'lapply' series is faster than do something one by one. So in speed, I think lazyFun() is not that bad.

In memory, it just make one copy of grl in lazyFun(). If you want no copy, maybe data.table() can satisfy you.

Look forward to your reply. :)

Can

ADD REPLY • link 9.0 years ago wcstcyx ▴ 30

0

Entering edit mode

@Can Hi Can, you can edit your answer by deleting grl, grl_expect, only keep the code that what you've done. I liked your idea of solving this problem, thank you. But, making your code's body smaller, smarter is always helpful for others to learn your approach. Yes, data.table package provides some very useful function to solve this sort of problem, I'll look after it and find some clean solution. Cheers :)

ADD REPLY • link 9.0 years ago Jurat Shahidin ▴ 80

0

Entering edit mode

Thanks. I deleted grl, but grl_expect in my answer seems more appropriate than in your post. I suggest you change foo in your grl_expected to be

foo= GRanges(seqnames = Rle("chr11",11),
               IRanges(
                 c(12,12,12,58,58,58,118,44,102,118,118),
                 c(36,36,36,92,92,92,139,49,109,139,139)),
               score=c(48,48,48,12,12,12,5,12,11,5,5))

ADD REPLY • link 9.0 years ago wcstcyx ▴ 30

score 3 · Accepted Answer · 2017-01-05

3

Entering edit mode

Martin Morgan 25k

@martin-morgan-1513

Last seen 11 months ago

United States

Following wcstcyx's approach, the following code generates dup (how many times the range appears) and idx (index-in-subgroup), formulates a list of conditions for keeping elements, and subsets.

m <- match(grl, unique(grl))
idx <- lapply(m, function(m) {
    split(m, m) <- lapply(split(m, m), seq_along)
    m
})
dup <- lapply(m, function(elt) tabulate(elt)[elt])
keep <- list(
    dup[[1]] <= 1 | (dup[[1]] > 1 & idx[[1]] == 1),
    dup[[2]] <= 2 | (dup[[2]] > 2 & idx[[2]] == 1),
    dup[[3]] <= 3 | (dup[[3]] > 3 & idx[[3]] <= 3))
grl[keep]

keep can be written as

keep1 <- List(
    idx[[1]] <= 1,
    dup[[2]] <= 2 | idx[[2]] == 1,
    idx[[3]] <= 3)

ADD COMMENT • link 9.0 years ago Martin Morgan 25k

0

Entering edit mode

@Martin Dear Martin Thanks for your insightful answer. Regarding the part of getting index list keep, Any idea to make this part more programmatic and prettier ?

ADD REPLY • link 9.0 years ago Jurat Shahidin ▴ 80

0

Entering edit mode

The logic applied to different list elements is not consistent so there is no way to make it programmatic -- if the logic were consistent, the three indexes would be symmetric and would collapse into something that could be handled more cleanly.

ADD REPLY • link 9.0 years ago Martin Morgan 25k

0

Entering edit mode

Dear Martin :

I figured out issue. Thanks a lot for your idea on this problem.

Best regards :

Jurat

ADD REPLY • link 9.0 years ago Jurat Shahidin ▴ 80

0

Entering edit mode

This is a different question, about package development, and so belongs on the bioc-devel mailing list. But why use data.table, it seems to make your code complicated. If there is a performance bottleneck in the above code I guess it would be the split()<- iteration but that too is a different question.

ADD REPLY • link 9.0 years ago Martin Morgan 25k