Entering edit mode
Is there a smart way of finding unique elements of a GPos
while taking mcols
into account?
unique()
only uses seqnames
/pos
, and not mcols
:
# Four unique positions when taking score-column into account:
gp1 <- GPos(c("chr1:10", "chr1:10", "chr1:11", "chr1:11", "chr1:11"))
score(gp1) <- c("A", "G", "A", "T", "T")
# Unique doesn't see the score column
unique(gp1)
After a bit of experimentation, first coercing to a DataFrame
seems to work, but seems a bit hacky: It gives me an warning saying In .local(x, row.names, optional, ...) : 'optional' argument was ignored
and I was unable to locate any documentation for it.
# Coerce to DataFrame
DF <- as(gp1, "DataFrame")
# Find unique rows
unique(DF)
# Extract the GPos
unique(DF)$X
My use case is a very long GPos (> 30 million) with Ref/Alt alleles.
Hi maltethodberg
We currently do not have a method that does this. Perhaps a
duplicated
method is appropriate but I am not the developer ofGenomicRanges
.Hervé can provide more input when he gets back from his break.
You could try using the
duplicated.data.frame
approach:The other option which may not be what you're looking for but allows you to keep all the data and the actual
unique(gp1)
is:It is a bit more complicated and may not be as efficient...
Best regards,
Marcel
After a bit of investigation into existing Bioconductor packages, it seems the VariantAnnotation package calls some un-documented functions from S4Vectors that does roughly this, namely duplicatedIntegerQuads and matchIntegerQuads! Of course, they rely on the columns being integers/factors.
Hi maltethodberg,
Sorry for the late answer.
Coercing first to a DataFrame is probably the easiest way to go. Not sure how efficient that will be though since I'm not familiar with the implementation details of the
unique()
method for DataFrame objects.If you want to explore a solution based on
S4Vectors::duplicatedIntegerQuads
/S4Vectors::matchIntegerQuads
, note that any columncol
can be replaced withmatch(col, col)
(orselfmatch(col)
), which is an integer vector that will always behave like the original column from aunique()
/duplicated()
perspective.Furthermore, a solution based on
S4Vectors::duplicatedIntegerQuads
/S4Vectors::matchIntegerQuads
(possibly combined with someselfmatch(col)
calls) is probably going to be the most efficient. It will also scale to an arbitrary number of metadata columns if you collapse groups of 4 metadata columns into a single metadata column withS4Vectors::selfmatchIntegerQuads(col1, col2, col3, col4)
, and repeat the collapsing process until you end up with a single metadata column. This approach requires a little bit of non-trivial work though.Finally please note that
unique()
andduplicated()
on a GPos object use thestrand()
in addition toseqnames()
andpos()
, like they do on a GRanges object or on any GenomicRanges derivative.H.
Yes, I managed to find those functions via the VariantAnnotation package. The "Quad" versions are very handy for matching seqnames/pos/effect allele/other allele quadruplets!