Post does not exist.
Finding duplicated entries in GRangesList
1
0
Entering edit mode
Marco • 0
@dc86b518
Last seen 9 weeks ago
United States

Hi, I am trying to remove duplicated entries in a set of GRangesList. Not sure what happened but I am pretty sure that I used to use the duplicated() or unique() function for removing duplicates but now it's not working. Could someone advise on alternatives?

This is an example that is not producing the expected value

library(GenomicRanges)

# Create a sample GRangesList with some duplicated GRanges objects
gr1 <- GRanges(seqnames = "chr1", ranges = IRanges(c(1,30), c(10,50)), strand = "+")
gr2 <- GRanges(seqnames = "chr2", ranges = IRanges(c(20,60), c(30,100)), strand = "-")
gr3 <- GRanges(seqnames = "chr1", ranges = IRanges(c(1,20), c(10,30)), strand = "+") 

my_grl <- GRangesList(gr1, gr2, gr3, gr1) # Duplicating gr1

duplicated(my_grl)
#LogicalList of length 4
#[[1]] FALSE FALSE
#[[2]] FALSE FALSE
#[[3]] FALSE FALSE
#[[4]] FALSE FALSE

length(my_grl) == length(unique(my_grl))
#[1] TRUE

Thanks

GenomicRanges • 188 views
ADD COMMENT
0
Entering edit mode
Kevin Blighe ★ 4.0k
@kevin
Last seen 8 hours ago
The Cave, 181 Longwood Avenue, Boston, …

The duplicated and unique functions operate element-wise on GRangesList objects. This means that they apply the respective function to each individual GRanges element within the list. In your example, there are no duplicate ranges within any individual GRanges element, which explains why duplicated returns all FALSE values and why unique does not reduce the length of the GRangesList.

To identify and remove duplicate GRanges elements across the entire GRangesList, you must compare the GRanges objects themselves. One approach is to generate a unique key for each GRanges element based on its contents and then use duplicated on those keys.

Here is an example using your data:

library(GenomicRanges)

# Your sample data
gr1 <- GRanges(seqnames = "chr1", ranges = IRanges(c(1,30), c(10,50)), strand = "+")
gr2 <- GRanges(seqnames = "chr2", ranges = IRanges(c(20,60), c(30,100)), strand = "-")
gr3 <- GRanges(seqnames = "chr1", ranges = IRanges(c(1,20), c(10,30)), strand = "+") 

my_grl <- GRangesList(gr1, gr2, gr3, gr1)

# Function to create a unique key for each GRanges
gr_key <- function(gr) {
  paste(seqnames(gr), ":", start(gr), "-", end(gr), ":", strand(gr), sep = "", collapse = ";")
}

# Generate keys
keys <- sapply(my_grl, gr_key)

# Identify duplicates
duplicated(keys)
# [1] FALSE FALSE FALSE  TRUE

# Remove duplicates
unique_grl <- my_grl[!duplicated(keys)]

length(unique_grl)
# [1] 3

This method assumes that the order of ranges within each GRanges matters for equality. If you wish to ignore order, sort each GRanges first using sort(gr) inside the key function. If your GRanges have metadata columns, include them in the key if they should affect uniqueness; otherwise, set mcols(gr) <- NULL before generating keys.

Kevin

ADD COMMENT

Login before adding your answer.

Traffic: 509 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6