Entering edit mode
What follows is some random comments on the use of GRanges for storing
data in the elementMetadata slot. My use case is a lot of very small
ranges (>10M ranges of width 1), which may be a special border case
(but very important to me :). I am particular interested in fast
subsetting like subsetByOverlaps.
Comments:
* I find column subsetting on the elementMetadata to be really slow,
slower than it need to be I would say (I suspect some validity
checking, see comment below on row subsetting).
* There is no easy shortcut to get a specific metadata column (gr[,
"score"] returns a GRange). I would find it natural that gr$score
would do that. This is important in order to get at the data stores
in the GRange
* while there is a as.data.frame, as(from, "data.frame") does not work
and I would also appreciate an as(from, "GRanges") method with from
being a data.frame. I have a quick function at the bottom of my
email, for doing this.
* Since start, end, width are reserved column names for GRanges, it
would be nice if the GRanges constructor was extended to allow
GRanges(seqnames = "chr1", start = 1, end = 10)
instead of now where we need
GRanges(seqnames = "chr1", ranges = IRanges(start = 1, end = 10))
(ok, this is pretty minor, but it is convenient)
* There is no sort method implemented for GRanges; it returns an error
* is.unsorted(xx) seems to always return FALSE with a warning when xx
is an IRange or a GRange
*It seems to me that row subsetting, specifically subsetByOverlaps is
a bit too slow. The find overlaps part is blazingly fast (when ranges
is a single IRange); it seems like the initialization and subsetting
part takes "more time than necessary". This is based on comparing a
(long) GRanges with around 8 elementMetadata columns to another
approach where instead of using elementMetadata I use two elements of
a list: a GRanges without elementMetadata and where the
elementMetadata is stored in a separate data.frame - I then use
findOverlaps to get me the indices (when I compare using a GRanges to
using a GRanges and separate matrix, I use the "position" GRanges to
get my indexes, which I use to subset the matrix with).
Specifically here gr is a GRanges with length 25M (no elementMetadata)
and mat is a 25M x 8 matrix
grIn <- gr
elementMetadata(grIn) <- mat
## The "classic" approach:
system.time({
grOut1 <- subsetByOverlaps(grIn, GRanges(seqnames = "chr1", ranges =
IRanges(start = 1, end = 10^7)))
}) # 4.7 secs
## Keeping gr and mat separate:
system.time({
qh <- queryHits(findOverlaps(grIn, GRanges(seqnames = "chr1",
ranges = IRanges(start = 1, end = 10^7))))
grOut2 <- list(gr[qh], mat = mat[qh,])
}) # 0.9 secs
## Doing the above, but forming a GRange after the subsetting adds
just a small overhead
system.time({
qh <- queryHits(findOverlaps(grIn, GRanges(seqnames = "chr1",
ranges = IRanges(start = 1, end = 10^7))))
grOut3 = gr[qh]
elementMetadata(grOut3) <- mat[qh,]
}) # 1.1 secs
* For really large query GRanges timing results (not shown) seems to
indicate that the line
query[!is.na(findOverlaps(query, subject, maxgap = maxgap,
minoverlap = minoverlap, type = type, select = "first"))]
in subsetByOverlaps(query = "GRanges", subject = "GRanges") could be
made faster by changing it to
query[queryHits(findOverlaps(query, subject, maxgap = maxgap,
minoverlap = minoverlap, type = type))]
(my timing is for about a query length of 30M and a subject length of
1)
* I find it counter-intuitive that the following code doesn't work:
R> example(GRanges)
R> seqnames(gr) = "chr1"
or perhaps more intuitive
R> gr1 = gr[seqnames(gr) == "Chrom1"]
R> seqnames(gr1) = "chr1"
In general I don't like that seqnames is a factor. This is fine when
I just consider a single GRange, but when I put together multiple
GRanges from different sources, I sometimes get into problems.
Specifically, in human, it is not always true that the various contigs
(like "chr10_random") are part of the data sources. I would have
preferred it to be a character and then have seqlengths essentially be
NA in case a character does not appear in the names(seqlengths). I
know that I have complained about this a long time ago and it is
probably too late to change, but still :)
Kasper
data.frame2GRanges <- function(object, keepColumns = TRUE) {
stopifnot(class(object) == "data.frame")
stopifnot(all(c("chr", "start", "end") %in% names(object)))
if("strand" %in% names(object)) {
if(is.numeric(object$strand)) {
strand <- ifelse(object$strand == 1, "+", "*")
strand[object$strand == -1] <- "-"
object$strand <- strand
}
gr <- GRanges(seqnames = object$chr,
ranges = IRanges(start = object$start, end =
object$end),
strand = object$strand)
} else {
gr <- GRanges(seqnames = object$chr,
ranges = IRanges(start = object$start, end =
object$end))
}
if(keepColumns) {
dt <- as(object[, setdiff(names(object), c("chr", "start",
"end", "strand"))],
"DataFrame")
elementMetadata(gr) <- dt
}
names(gr) <- rownames(object)
gr
}