Question

Finding duplicated entries in GRangesList

0

Entering edit mode

Marco • 0

@dc86b518

Last seen 9 weeks ago

United States

Hi, I am trying to remove duplicated entries in a set of GRangesList. Not sure what happened but I am pretty sure that I used to use the duplicated() or unique() function for removing duplicates but now it's not working. Could someone advise on alternatives?

This is an example that is not producing the expected value

library(GenomicRanges)

# Create a sample GRangesList with some duplicated GRanges objects
gr1 <- GRanges(seqnames = "chr1", ranges = IRanges(c(1,30), c(10,50)), strand = "+")
gr2 <- GRanges(seqnames = "chr2", ranges = IRanges(c(20,60), c(30,100)), strand = "-")
gr3 <- GRanges(seqnames = "chr1", ranges = IRanges(c(1,20), c(10,30)), strand = "+") 

my_grl <- GRangesList(gr1, gr2, gr3, gr1) # Duplicating gr1

duplicated(my_grl)
#LogicalList of length 4
#[[1]] FALSE FALSE
#[[2]] FALSE FALSE
#[[3]] FALSE FALSE
#[[4]] FALSE FALSE

length(my_grl) == length(unique(my_grl))
#[1] TRUE

Thanks

GenomicRanges • 188 views

ADD COMMENT • link updated 20 hours ago by Kevin Blighe ★ 4.0k • written 3 months ago by Marco • 0

score 0 · Answer 1 · 2025-11-20

The duplicated and unique functions operate element-wise on GRangesList objects. This means that they apply the respective function to each individual GRanges element within the list. In your example, there are no duplicate ranges within any individual GRanges element, which explains why duplicated returns all FALSE values and why unique does not reduce the length of the GRangesList.

To identify and remove duplicate GRanges elements across the entire GRangesList, you must compare the GRanges objects themselves. One approach is to generate a unique key for each GRanges element based on its contents and then use duplicated on those keys.

Here is an example using your data:

library(GenomicRanges)

# Your sample data
gr1 <- GRanges(seqnames = "chr1", ranges = IRanges(c(1,30), c(10,50)), strand = "+")
gr2 <- GRanges(seqnames = "chr2", ranges = IRanges(c(20,60), c(30,100)), strand = "-")
gr3 <- GRanges(seqnames = "chr1", ranges = IRanges(c(1,20), c(10,30)), strand = "+") 

my_grl <- GRangesList(gr1, gr2, gr3, gr1)

# Function to create a unique key for each GRanges
gr_key <- function(gr) {
  paste(seqnames(gr), ":", start(gr), "-", end(gr), ":", strand(gr), sep = "", collapse = ";")
}

# Generate keys
keys <- sapply(my_grl, gr_key)

# Identify duplicates
duplicated(keys)
# [1] FALSE FALSE FALSE  TRUE

# Remove duplicates
unique_grl <- my_grl[!duplicated(keys)]

length(unique_grl)
# [1] 3

This method assumes that the order of ranges within each GRanges matters for equality. If you wish to ignore order, sort each GRanges first using sort(gr) inside the key function. If your GRanges have metadata columns, include them in the key if they should affect uniqueness; otherwise, set mcols(gr) <- NULL before generating keys.

Kevin