some random comments on GRanges

0

Entering edit mode

Kasper Daniel Hansen ★ 6.5k

@kasper-daniel-hansen-2979

Last seen 2.2 years ago

United States

What follows is some random comments on the use of GRanges for storing data in the elementMetadata slot. My use case is a lot of very small ranges (>10M ranges of width 1), which may be a special border case (but very important to me :). I am particular interested in fast subsetting like subsetByOverlaps. Comments: * I find column subsetting on the elementMetadata to be really slow, slower than it need to be I would say (I suspect some validity checking, see comment below on row subsetting). * There is no easy shortcut to get a specific metadata column (gr[, "score"] returns a GRange). I would find it natural that gr$score would do that. This is important in order to get at the data stores in the GRange * while there is a as.data.frame, as(from, "data.frame") does not work and I would also appreciate an as(from, "GRanges") method with from being a data.frame. I have a quick function at the bottom of my email, for doing this. * Since start, end, width are reserved column names for GRanges, it would be nice if the GRanges constructor was extended to allow GRanges(seqnames = "chr1", start = 1, end = 10) instead of now where we need GRanges(seqnames = "chr1", ranges = IRanges(start = 1, end = 10)) (ok, this is pretty minor, but it is convenient) * There is no sort method implemented for GRanges; it returns an error * is.unsorted(xx) seems to always return FALSE with a warning when xx is an IRange or a GRange *It seems to me that row subsetting, specifically subsetByOverlaps is a bit too slow. The find overlaps part is blazingly fast (when ranges is a single IRange); it seems like the initialization and subsetting part takes "more time than necessary". This is based on comparing a (long) GRanges with around 8 elementMetadata columns to another approach where instead of using elementMetadata I use two elements of a list: a GRanges without elementMetadata and where the elementMetadata is stored in a separate data.frame - I then use findOverlaps to get me the indices (when I compare using a GRanges to using a GRanges and separate matrix, I use the "position" GRanges to get my indexes, which I use to subset the matrix with). Specifically here gr is a GRanges with length 25M (no elementMetadata) and mat is a 25M x 8 matrix grIn <- gr elementMetadata(grIn) <- mat ## The "classic" approach: system.time({ grOut1 <- subsetByOverlaps(grIn, GRanges(seqnames = "chr1", ranges = IRanges(start = 1, end = 10^7))) }) # 4.7 secs ## Keeping gr and mat separate: system.time({ qh <- queryHits(findOverlaps(grIn, GRanges(seqnames = "chr1", ranges = IRanges(start = 1, end = 10^7)))) grOut2 <- list(gr[qh], mat = mat[qh,]) }) # 0.9 secs ## Doing the above, but forming a GRange after the subsetting adds just a small overhead system.time({ qh <- queryHits(findOverlaps(grIn, GRanges(seqnames = "chr1", ranges = IRanges(start = 1, end = 10^7)))) grOut3 = gr[qh] elementMetadata(grOut3) <- mat[qh,] }) # 1.1 secs * For really large query GRanges timing results (not shown) seems to indicate that the line query[!is.na(findOverlaps(query, subject, maxgap = maxgap, minoverlap = minoverlap, type = type, select = "first"))] in subsetByOverlaps(query = "GRanges", subject = "GRanges") could be made faster by changing it to query[queryHits(findOverlaps(query, subject, maxgap = maxgap, minoverlap = minoverlap, type = type))] (my timing is for about a query length of 30M and a subject length of 1) * I find it counter-intuitive that the following code doesn't work: R> example(GRanges) R> seqnames(gr) = "chr1" or perhaps more intuitive R> gr1 = gr[seqnames(gr) == "Chrom1"] R> seqnames(gr1) = "chr1" In general I don't like that seqnames is a factor. This is fine when I just consider a single GRange, but when I put together multiple GRanges from different sources, I sometimes get into problems. Specifically, in human, it is not always true that the various contigs (like "chr10_random") are part of the data sources. I would have preferred it to be a character and then have seqlengths essentially be NA in case a character does not appear in the names(seqlengths). I know that I have complained about this a long time ago and it is probably too late to change, but still :) Kasper data.frame2GRanges <- function(object, keepColumns = TRUE) { stopifnot(class(object) == "data.frame") stopifnot(all(c("chr", "start", "end") %in% names(object))) if("strand" %in% names(object)) { if(is.numeric(object$strand)) { strand <- ifelse(object$strand == 1, "+", "*") strand[object$strand == -1] <- "-" object$strand <- strand } gr <- GRanges(seqnames = object$chr, ranges = IRanges(start = object$start, end = object$end), strand = object$strand) } else { gr <- GRanges(seqnames = object$chr, ranges = IRanges(start = object$start, end = object$end)) } if(keepColumns) { dt <- as(object[, setdiff(names(object), c("chr", "start", "end", "strand"))], "DataFrame") elementMetadata(gr) <- dt } names(gr) <- rownames(object) gr }

• 1.9k views

ADD COMMENT • link updated 15.0 years ago by Michael Lawrence ★ 11k • written 15.0 years ago by Kasper Daniel Hansen ★ 6.5k

0

Entering edit mode

Michael Lawrence ★ 11k

@michael-lawrence-3846

Last seen 3.7 years ago

United States

On Mon, Sep 13, 2010 at 6:16 AM, Kasper Daniel Hansen < kasperdanielhansen@gmail.com> wrote: > What follows is some random comments on the use of GRanges for storing > data in the elementMetadata slot. My use case is a lot of very small > ranges (>10M ranges of width 1), which may be a special border case > (but very important to me :). I am particular interested in fast > subsetting like subsetByOverlaps. > > I'm not the primary developer of this package, but I have some comments: > Comments: > > * I find column subsetting on the elementMetadata to be really slow, > slower than it need to be I would say (I suspect some validity > checking, see comment below on row subsetting). > > > * There is no easy shortcut to get a specific metadata column (gr[, > "score"] returns a GRange). I would find it natural that gr$score > would do that. This is important in order to get at the data stores > in the GRange > > Others, including me, have suggested this. It's a design choice by the guys in Seattle... > * while there is a as.data.frame, as(from, "data.frame") does not work > and I would also appreciate an as(from, "GRanges") method with from > being a data.frame. I have a quick function at the bottom of my > email, for doing this. > > I'm not such a big fan of as(from, "GRanges"), since it's going from something that is loosely defined to something with more constraints. We did not support this in RangedData either. Usually, it's not so tough to just use the constructor, and it makes the translation more clear. I could be convinced otherwise, though. > * Since start, end, width are reserved column names for GRanges, it > would be nice if the GRanges constructor was extended to allow > GRanges(seqnames = "chr1", start = 1, end = 10) > instead of now where we need > GRanges(seqnames = "chr1", ranges = IRanges(start = 1, end = 10)) > (ok, this is pretty minor, but it is convenient) > > * There is no sort method implemented for GRanges; it returns an error > > * is.unsorted(xx) seems to always return FALSE with a warning when xx > is an IRange or a GRange > > *It seems to me that row subsetting, specifically subsetByOverlaps is > a bit too slow. The find overlaps part is blazingly fast (when ranges > is a single IRange); it seems like the initialization and subsetting > part takes "more time than necessary". This is based on comparing a > (long) GRanges with around 8 elementMetadata columns to another > approach where instead of using elementMetadata I use two elements of > a list: a GRanges without elementMetadata and where the > elementMetadata is stored in a separate data.frame - I then use > findOverlaps to get me the indices (when I compare using a GRanges to > using a GRanges and separate matrix, I use the "position" GRanges to > get my indexes, which I use to subset the matrix with). > > Specifically here gr is a GRanges with length 25M (no elementMetadata) > and mat is a 25M x 8 matrix > > grIn <- gr > elementMetadata(grIn) <- mat > > ## The "classic" approach: > system.time({ > grOut1 <- subsetByOverlaps(grIn, GRanges(seqnames = "chr1", ranges = > IRanges(start = 1, end = 10^7))) > }) # 4.7 secs > > ## Keeping gr and mat separate: > system.time({ > qh <- queryHits(findOverlaps(grIn, GRanges(seqnames = "chr1", > ranges = IRanges(start = 1, end = 10^7)))) > grOut2 <- list(gr[qh], mat = mat[qh,]) > }) # 0.9 secs > > ## Doing the above, but forming a GRange after the subsetting adds > just a small overhead > system.time({ > qh <- queryHits(findOverlaps(grIn, GRanges(seqnames = "chr1", > ranges = IRanges(start = 1, end = 10^7)))) > grOut3 = gr[qh] > elementMetadata(grOut3) <- mat[qh,] > }) # 1.1 secs > > > * For really large query GRanges timing results (not shown) seems to > indicate that the line > query[!is.na(findOverlaps(query, subject, maxgap = maxgap, > minoverlap = minoverlap, type = type, select = "first"))] > in subsetByOverlaps(query = "GRanges", subject = "GRanges") could be > made faster by changing it to > query[queryHits(findOverlaps(query, subject, maxgap = maxgap, > minoverlap = minoverlap, type = type))] > (my timing is for about a query length of 30M and a subject length of 1) > > Right, select = "first" will be slow for GRanges, since at the low- level it does a findOverlaps(select="all"), and then filters for "first". It is not clear to me why it always does "all" here. Convenience of implementation? The filtering is necessary though. Using queryHits() and select = "all" could cause a query to become repeated if it hits multiple subjects. Obviously not a problem when you have a subject of length 1. In that case though, findOverlaps, etc is overkill. * I find it counter-intuitive that the following code doesn't work: > R> example(GRanges) > R> seqnames(gr) = "chr1" > or perhaps more intuitive > R> gr1 = gr[seqnames(gr) == "Chrom1"] > R> seqnames(gr1) = "chr1" > In general I don't like that seqnames is a factor. This is fine when > I just consider a single GRange, but when I put together multiple > GRanges from different sources, I sometimes get into problems. > Specifically, in human, it is not always true that the various contigs > (like "chr10_random") are part of the data sources. I would have > preferred it to be a character and then have seqlengths essentially be > NA in case a character does not appear in the names(seqlengths). I > know that I have complained about this a long time ago and it is > probably too late to change, but still :) > > > Kasper > > data.frame2GRanges <- function(object, keepColumns = TRUE) { > stopifnot(class(object) == "data.frame") > stopifnot(all(c("chr", "start", "end") %in% names(object))) > if("strand" %in% names(object)) { > if(is.numeric(object$strand)) { > strand <- ifelse(object$strand == 1, "+", "*") > strand[object$strand == -1] <- "-" > object$strand <- strand > } > gr <- GRanges(seqnames = object$chr, > ranges = IRanges(start = object$start, end = > object$end), > strand = object$strand) > } else { > gr <- GRanges(seqnames = object$chr, > ranges = IRanges(start = object$start, end = > object$end)) > } > if(keepColumns) { > dt <- as(object[, setdiff(names(object), c("chr", "start", > "end", "strand"))], > "DataFrame") > elementMetadata(gr) <- dt > } > names(gr) <- rownames(object) > gr > } > > _______________________________________________ > Bioconductor mailing list > Bioconductor@stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]

ADD COMMENT • link 15.0 years ago Michael Lawrence ★ 11k

0

Entering edit mode

Michael Lawrence ★ 11k

@michael-lawrence-3846

Last seen 3.7 years ago

United States

On Mon, Sep 13, 2010 at 6:16 AM, Kasper Daniel Hansen < kasperdanielhansen@gmail.com> wrote: > What follows is some random comments on the use of GRanges for storing > data in the elementMetadata slot. My use case is a lot of very small > ranges (>10M ranges of width 1), which may be a special border case > (but very important to me :). I am particular interested in fast > subsetting like subsetByOverlaps. > > Comments: > > * I find column subsetting on the elementMetadata to be really slow, > slower than it need to be I would say (I suspect some validity > checking, see comment below on row subsetting). > > * There is no easy shortcut to get a specific metadata column (gr[, > "score"] returns a GRange). I would find it natural that gr$score > would do that. This is important in order to get at the data stores > in the GRange > > * while there is a as.data.frame, as(from, "data.frame") does not work > and I would also appreciate an as(from, "GRanges") method with from > being a data.frame. I have a quick function at the bottom of my > email, for doing this. > > * Since start, end, width are reserved column names for GRanges, it > would be nice if the GRanges constructor was extended to allow > GRanges(seqnames = "chr1", start = 1, end = 10) > instead of now where we need > GRanges(seqnames = "chr1", ranges = IRanges(start = 1, end = 10)) > (ok, this is pretty minor, but it is convenient) > > * There is no sort method implemented for GRanges; it returns an error > > * is.unsorted(xx) seems to always return FALSE with a warning when xx > is an IRange or a GRange > > *It seems to me that row subsetting, specifically subsetByOverlaps is > a bit too slow. The find overlaps part is blazingly fast (when ranges > is a single IRange); it seems like the initialization and subsetting > part takes "more time than necessary". This is based on comparing a > (long) GRanges with around 8 elementMetadata columns to another > approach where instead of using elementMetadata I use two elements of > a list: a GRanges without elementMetadata and where the > elementMetadata is stored in a separate data.frame - I then use > findOverlaps to get me the indices (when I compare using a GRanges to > using a GRanges and separate matrix, I use the "position" GRanges to > get my indexes, which I use to subset the matrix with). > > Specifically here gr is a GRanges with length 25M (no elementMetadata) > and mat is a 25M x 8 matrix > > grIn <- gr > elementMetadata(grIn) <- mat > > ## The "classic" approach: > system.time({ > grOut1 <- subsetByOverlaps(grIn, GRanges(seqnames = "chr1", ranges = > IRanges(start = 1, end = 10^7))) > }) # 4.7 secs > > ## Keeping gr and mat separate: > system.time({ > qh <- queryHits(findOverlaps(grIn, GRanges(seqnames = "chr1", > ranges = IRanges(start = 1, end = 10^7)))) > grOut2 <- list(gr[qh], mat = mat[qh,]) > }) # 0.9 secs > > ## Doing the above, but forming a GRange after the subsetting adds > just a small overhead > system.time({ > qh <- queryHits(findOverlaps(grIn, GRanges(seqnames = "chr1", > ranges = IRanges(start = 1, end = 10^7)))) > grOut3 = gr[qh] > elementMetadata(grOut3) <- mat[qh,] > }) # 1.1 secs > > > * For really large query GRanges timing results (not shown) seems to > indicate that the line > query[!is.na(findOverlaps(query, subject, maxgap = maxgap, > minoverlap = minoverlap, type = type, select = "first"))] > in subsetByOverlaps(query = "GRanges", subject = "GRanges") could be > made faster by changing it to > query[queryHits(findOverlaps(query, subject, maxgap = maxgap, > minoverlap = minoverlap, type = type))] > (my timing is for about a query length of 30M and a subject length of 1) > > * I find it counter-intuitive that the following code doesn't work: > R> example(GRanges) > R> seqnames(gr) = "chr1" > or perhaps more intuitive > R> gr1 = gr[seqnames(gr) == "Chrom1"] > R> seqnames(gr1) = "chr1" > In general I don't like that seqnames is a factor. This is fine when > I just consider a single GRange, but when I put together multiple > GRanges from different sources, I sometimes get into problems. > Specifically, in human, it is not always true that the various contigs > (like "chr10_random") are part of the data sources. I would have > preferred it to be a character and then have seqlengths essentially be > NA in case a character does not appear in the names(seqlengths). I > know that I have complained about this a long time ago and it is > probably too late to change, but still :) > > Just saw this. Conceptually, a factor seems appropriate for sequence names. It does cause major headaches though. For example pintersect() requires the sequences names to be setequal(). But trying to fix the levels in seqnames is difficult, due to the validity constraint that the seqlengths need the same entries as seqnames levels. If there was someway to fix that without having to hack things using low-level slot accessors, it would be great. Michael > > Kasper > > data.frame2GRanges <- function(object, keepColumns = TRUE) { > stopifnot(class(object) == "data.frame") > stopifnot(all(c("chr", "start", "end") %in% names(object))) > if("strand" %in% names(object)) { > if(is.numeric(object$strand)) { > strand <- ifelse(object$strand == 1, "+", "*") > strand[object$strand == -1] <- "-" > object$strand <- strand > } > gr <- GRanges(seqnames = object$chr, > ranges = IRanges(start = object$start, end = > object$end), > strand = object$strand) > } else { > gr <- GRanges(seqnames = object$chr, > ranges = IRanges(start = object$start, end = > object$end)) > } > if(keepColumns) { > dt <- as(object[, setdiff(names(object), c("chr", "start", > "end", "strand"))], > "DataFrame") > elementMetadata(gr) <- dt > } > names(gr) <- rownames(object) > gr > } > > _______________________________________________ > Bioconductor mailing list > Bioconductor@stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]

ADD COMMENT • link 15.0 years ago Michael Lawrence ★ 11k

0

Entering edit mode

Hi Kasper -- Mostly in agreement with what you say, with some caveats below... On 09/13/2010 11:57 AM, Michael Lawrence wrote: > On Mon, Sep 13, 2010 at 6:16 AM, Kasper Daniel Hansen < > kasperdanielhansen at gmail.com> wrote: > >> What follows is some random comments on the use of GRanges for storing >> data in the elementMetadata slot. My use case is a lot of very small >> ranges (>10M ranges of width 1), which may be a special border case >> (but very important to me :). I am particular interested in fast >> subsetting like subsetByOverlaps. >> >> Comments: >> >> * I find column subsetting on the elementMetadata to be really slow, >> slower than it need to be I would say (I suspect some validity >> checking, see comment below on row subsetting). >> * There is no easy shortcut to get a specific metadata column (gr[, >> "score"] returns a GRange). I would find it natural that gr$score >> would do that. This is important in order to get at the data stores >> in the GRange To be sure we're on the same page, values(gr)$score get's what you want, right? In an earlier iteration on this theme Michael mentioned eSet. eSet didn't strongly shape the discussion of GRanges, but since both pheno- and feature-Data can have annotations I would rather that it had become phenoData(eset)$x, featureData(eset)$y for symmetry and consistent interface. Not sure that this is really relevant to GRanges. > >> * while there is a as.data.frame, as(from, "data.frame") does not work >> and I would also appreciate an as(from, "GRanges") method with from >> being a data.frame. I have a quick function at the bottom of my >> email, for doing this. Not so sure about this though, for reasons Michael mentions. >> * Since start, end, width are reserved column names for GRanges, it >> would be nice if the GRanges constructor was extended to allow >> GRanges(seqnames = "chr1", start = 1, end = 10) >> instead of now where we need >> GRanges(seqnames = "chr1", ranges = IRanges(start = 1, end = 10)) >> (ok, this is pretty minor, but it is convenient) Not too keen on this, either, as it (or supporting both, anyway) makes for a complicated interface; glad it's a pretty minor issue ;) Martin >> * There is no sort method implemented for GRanges; it returns an error >> >> * is.unsorted(xx) seems to always return FALSE with a warning when xx >> is an IRange or a GRange >> *It seems to me that row subsetting, specifically subsetByOverlaps is >> a bit too slow. The find overlaps part is blazingly fast (when ranges >> is a single IRange); it seems like the initialization and subsetting >> part takes "more time than necessary". This is based on comparing a >> (long) GRanges with around 8 elementMetadata columns to another >> approach where instead of using elementMetadata I use two elements of >> a list: a GRanges without elementMetadata and where the >> elementMetadata is stored in a separate data.frame - I then use >> findOverlaps to get me the indices (when I compare using a GRanges to >> using a GRanges and separate matrix, I use the "position" GRanges to >> get my indexes, which I use to subset the matrix with). >> >> Specifically here gr is a GRanges with length 25M (no elementMetadata) >> and mat is a 25M x 8 matrix >> >> grIn <- gr >> elementMetadata(grIn) <- mat >> >> ## The "classic" approach: >> system.time({ >> grOut1 <- subsetByOverlaps(grIn, GRanges(seqnames = "chr1", ranges = >> IRanges(start = 1, end = 10^7))) >> }) # 4.7 secs >> >> ## Keeping gr and mat separate: >> system.time({ >> qh <- queryHits(findOverlaps(grIn, GRanges(seqnames = "chr1", >> ranges = IRanges(start = 1, end = 10^7)))) >> grOut2 <- list(gr[qh], mat = mat[qh,]) >> }) # 0.9 secs >> >> ## Doing the above, but forming a GRange after the subsetting adds >> just a small overhead >> system.time({ >> qh <- queryHits(findOverlaps(grIn, GRanges(seqnames = "chr1", >> ranges = IRanges(start = 1, end = 10^7)))) >> grOut3 = gr[qh] >> elementMetadata(grOut3) <- mat[qh,] >> }) # 1.1 secs >> >> >> * For really large query GRanges timing results (not shown) seems to >> indicate that the line >> query[!is.na(findOverlaps(query, subject, maxgap = maxgap, >> minoverlap = minoverlap, type = type, select = "first"))] >> in subsetByOverlaps(query = "GRanges", subject = "GRanges") could be >> made faster by changing it to >> query[queryHits(findOverlaps(query, subject, maxgap = maxgap, >> minoverlap = minoverlap, type = type))] >> (my timing is for about a query length of 30M and a subject length of 1) >> * I find it counter-intuitive that the following code doesn't work: >> R> example(GRanges) >> R> seqnames(gr) = "chr1" >> or perhaps more intuitive >> R> gr1 = gr[seqnames(gr) == "Chrom1"] >> R> seqnames(gr1) = "chr1" >> In general I don't like that seqnames is a factor. This is fine when >> I just consider a single GRange, but when I put together multiple >> GRanges from different sources, I sometimes get into problems. >> Specifically, in human, it is not always true that the various contigs >> (like "chr10_random") are part of the data sources. I would have >> preferred it to be a character and then have seqlengths essentially be >> NA in case a character does not appear in the names(seqlengths). I >> know that I have complained about this a long time ago and it is >> probably too late to change, but still :) >> >> > Just saw this. > > Conceptually, a factor seems appropriate for sequence names. It does cause > major headaches though. For example pintersect() requires the sequences > names to be setequal(). But trying to fix the levels in seqnames is > difficult, due to the validity constraint that the seqlengths need the same > entries as seqnames levels. If there was someway to fix that without having > to hack things using low-level slot accessors, it would be great. > > Michael > > > >> >> Kasper >> >> data.frame2GRanges <- function(object, keepColumns = TRUE) { >> stopifnot(class(object) == "data.frame") >> stopifnot(all(c("chr", "start", "end") %in% names(object))) >> if("strand" %in% names(object)) { >> if(is.numeric(object$strand)) { >> strand <- ifelse(object$strand == 1, "+", "*") >> strand[object$strand == -1] <- "-" >> object$strand <- strand >> } >> gr <- GRanges(seqnames = object$chr, >> ranges = IRanges(start = object$start, end = >> object$end), >> strand = object$strand) >> } else { >> gr <- GRanges(seqnames = object$chr, >> ranges = IRanges(start = object$start, end = >> object$end)) >> } >> if(keepColumns) { >> dt <- as(object[, setdiff(names(object), c("chr", "start", >> "end", "strand"))], >> "DataFrame") >> elementMetadata(gr) <- dt >> } >> names(gr) <- rownames(object) >> gr >> } >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD REPLY • link 15.0 years ago Martin Morgan 25k

Login before adding your answer.