Unique GPos taking into account mcols
Entering edit mode
maltethodberg ▴ 140
Last seen 4 weeks ago

Is there a smart way of finding unique elements of a GPos while taking mcols into account?

unique() only uses seqnames/pos, and not mcols:

# Four unique positions when taking score-column into account:
gp1 <- GPos(c("chr1:10", "chr1:10", "chr1:11", "chr1:11", "chr1:11"))
score(gp1) <- c("A", "G", "A", "T", "T")

# Unique doesn't see the score column

After a bit of experimentation, first coercing to a DataFrame seems to work, but seems a bit hacky: It gives me an warning saying In .local(x, row.names, optional, ...) : 'optional' argument was ignored and I was unable to locate any documentation for it.

# Coerce to DataFrame
DF <- as(gp1, "DataFrame")

# Find unique rows

# Extract the GPos

My use case is a very long GPos (> 30 million) with Ref/Alt alleles.

GenomicRanges unique GPos • 180 views
Entering edit mode

Hi maltethodberg

We currently do not have a method that does this. Perhaps a duplicated method is appropriate but I am not the developer of GenomicRanges.

Hervé can provide more input when he gets back from his break.

You could try using the duplicated.data.frame approach:


gp1 <- GPos(c("chr1:10", "chr1:10", "chr1:11", "chr1:11", "chr1:11"))
score(gp1) <- c("A", "G", "A", "T", "T")

dups <- duplicated(as.data.frame(gp1))

#> StitchedGPos object with 4 positions and 1 metadata column:
#>       seqnames       pos strand |       score
#>          <Rle> <integer>  <Rle> | <character>
#>   [1]     chr1        10      * |           A
#>   [2]     chr1        10      * |           G
#>   [3]     chr1        11      * |           A
#>   [4]     chr1        11      * |           T
#>   -------
#>   seqinfo: 1 sequence from an unspecified genome; no seqlengths

The other option which may not be what you're looking for but allows you to keep all the data and the actual unique(gp1) is:


gp1 <- GPos(c("chr1:10", "chr1:10", "chr1:11", "chr1:11", "chr1:11"))
score(gp1) <- c("A", "G", "A", "T", "T")

ugp <- unique(gp1)

scores <- IRanges::CharacterList(split(mcols(gp1)$score, as.character(gp1)))

mcols(ugp)$score <- unique(scores)

#> StitchedGPos object with 2 positions and 1 metadata column:
#>       seqnames       pos strand |           score
#>          <Rle> <integer>  <Rle> | <CharacterList>
#>   [1]     chr1        10      * |             A,G
#>   [2]     chr1        11      * |             A,T
#>   -------
#>   seqinfo: 1 sequence from an unspecified genome; no seqlengths

It is a bit more complicated and may not be as efficient...

Best regards,


Entering edit mode

After a bit of investigation into existing Bioconductor packages, it seems the VariantAnnotation package calls some un-documented functions from S4Vectors that does roughly this, namely duplicatedIntegerQuads and matchIntegerQuads! Of course, they rely on the columns being integers/factors.

Entering edit mode

Hi maltethodberg,

Sorry for the late answer.

Coercing first to a DataFrame is probably the easiest way to go. Not sure how efficient that will be though since I'm not familiar with the implementation details of the unique() method for DataFrame objects.

If you want to explore a solution based on S4Vectors::duplicatedIntegerQuads/S4Vectors::matchIntegerQuads, note that any column col can be replaced with match(col, col) (or selfmatch(col)), which is an integer vector that will always behave like the original column from a unique()/duplicated() perspective.

Furthermore, a solution based on S4Vectors::duplicatedIntegerQuads/S4Vectors::matchIntegerQuads (possibly combined with some selfmatch(col) calls) is probably going to be the most efficient. It will also scale to an arbitrary number of metadata columns if you collapse groups of 4 metadata columns into a single metadata column with S4Vectors::selfmatchIntegerQuads(col1, col2, col3, col4), and repeat the collapsing process until you end up with a single metadata column. This approach requires a little bit of non-trivial work though.

Finally please note that unique() and duplicated() on a GPos object use the strand() in addition to seqnames() and pos(), like they do on a GRanges object or on any GenomicRanges derivative.



Login before adding your answer.

Traffic: 333 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6