Question

Unique GPos taking into account mcols

0

Entering edit mode

maltethodberg ▴ 180

@maltethodberg-9690

Last seen 4 weeks ago

Denmark

Is there a smart way of finding unique elements of a GPos while taking mcols into account?

unique() only uses seqnames/pos, and not mcols:

# Four unique positions when taking score-column into account:
gp1 <- GPos(c("chr1:10", "chr1:10", "chr1:11", "chr1:11", "chr1:11"))
score(gp1) <- c("A", "G", "A", "T", "T")

# Unique doesn't see the score column
unique(gp1)

After a bit of experimentation, first coercing to a DataFrame seems to work, but seems a bit hacky: It gives me an warning saying In .local(x, row.names, optional, ...) : 'optional' argument was ignored and I was unable to locate any documentation for it.

# Coerce to DataFrame
DF <- as(gp1, "DataFrame")

# Find unique rows
unique(DF)

# Extract the GPos
unique(DF)$X

My use case is a very long GPos (> 30 million) with Ref/Alt alleles.

GenomicRanges unique GPos • 1.2k views

ADD COMMENT • link 2.8 years ago • updated 2.7 years ago maltethodberg ▴ 180

1

Entering edit mode

Hi maltethodberg

We currently do not have a method that does this. Perhaps a duplicated method is appropriate but I am not the developer of GenomicRanges.

Hervé can provide more input when he gets back from his break.

You could try using the duplicated.data.frame approach:

suppressPackageStartupMessages({
    library(GenomicRanges)
})

gp1 <- GPos(c("chr1:10", "chr1:10", "chr1:11", "chr1:11", "chr1:11"))
score(gp1) <- c("A", "G", "A", "T", "T")

dups <- duplicated(as.data.frame(gp1))

gp1[!dups]
#> StitchedGPos object with 4 positions and 1 metadata column:
#>       seqnames       pos strand |       score
#>          <Rle> <integer>  <Rle> | <character>
#>   [1]     chr1        10      * |           A
#>   [2]     chr1        10      * |           G
#>   [3]     chr1        11      * |           A
#>   [4]     chr1        11      * |           T
#>   -------
#>   seqinfo: 1 sequence from an unspecified genome; no seqlengths

The other option which may not be what you're looking for but allows you to keep all the data and the actual unique(gp1) is:

suppressPackageStartupMessages({
    library(GenomicRanges)
})

gp1 <- GPos(c("chr1:10", "chr1:10", "chr1:11", "chr1:11", "chr1:11"))
score(gp1) <- c("A", "G", "A", "T", "T")

ugp <- unique(gp1)

scores <- IRanges::CharacterList(split(mcols(gp1)$score, as.character(gp1)))

mcols(ugp)$score <- unique(scores)

ugp
#> StitchedGPos object with 2 positions and 1 metadata column:
#>       seqnames       pos strand |           score
#>          <Rle> <integer>  <Rle> | <CharacterList>
#>   [1]     chr1        10      * |             A,G
#>   [2]     chr1        11      * |             A,T
#>   -------
#>   seqinfo: 1 sequence from an unspecified genome; no seqlengths

It is a bit more complicated and may not be as efficient...

Best regards,

Marcel

ADD REPLY • link 2.8 years ago Marcel Ramos 700

0

Entering edit mode

After a bit of investigation into existing Bioconductor packages, it seems the VariantAnnotation package calls some un-documented functions from S4Vectors that does roughly this, namely duplicatedIntegerQuads and matchIntegerQuads! Of course, they rely on the columns being integers/factors.

ADD REPLY • link 2.8 years ago maltethodberg ▴ 180

1

Entering edit mode

Hi maltethodberg,

Sorry for the late answer.

Coercing first to a DataFrame is probably the easiest way to go. Not sure how efficient that will be though since I'm not familiar with the implementation details of the unique() method for DataFrame objects.

If you want to explore a solution based on S4Vectors::duplicatedIntegerQuads/S4Vectors::matchIntegerQuads, note that any column col can be replaced with match(col, col) (or selfmatch(col)), which is an integer vector that will always behave like the original column from a unique()/duplicated() perspective.

Furthermore, a solution based on S4Vectors::duplicatedIntegerQuads/S4Vectors::matchIntegerQuads (possibly combined with some selfmatch(col) calls) is probably going to be the most efficient. It will also scale to an arbitrary number of metadata columns if you collapse groups of 4 metadata columns into a single metadata column with S4Vectors::selfmatchIntegerQuads(col1, col2, col3, col4), and repeat the collapsing process until you end up with a single metadata column. This approach requires a little bit of non-trivial work though.

Finally please note that unique() and duplicated() on a GPos object use the strand() in addition to seqnames() and pos(), like they do on a GRanges object or on any GenomicRanges derivative.

H.

ADD REPLY • link 2.7 years ago Hervé Pagès 16k

0

Entering edit mode

Yes, I managed to find those functions via the VariantAnnotation package. The "Quad" versions are very handy for matching seqnames/pos/effect allele/other allele quadruplets!

ADD REPLY • link 2.7 years ago maltethodberg ▴ 180