subsetting SummarizedExperiments - a proposal (or a hack?)

0

Entering edit mode

Malcolm Cook ★ 1.6k

@malcolm-cook-6293

Last seen 15 days ago

United States

Martin, Google says: 'Your search - "subset.SummarizedExperiment" - did not match any documents.' So I offer the below for your consideration, along with a few examples and a quick benchmark comparison with using the more explicit `[` extraction operator. subset.SummarizedExperiment<-function(x ,rowSubset=TRUE ,colSubset=TRUE ,assaySubset=TRUE ,drop=FALSE ,provenanceTrack=FALSE ,...) { ## PURPOSE: implement subsetting of SummarizedExperiments, ## allowing expressions in terms of: ## ## + the GRanges (and its mcols meta-data) held in rowData ## + the experimental meta-data held in the colData DataFrame ## + the names of the assays x<-x[ ##eval(as.expression(substitute(row)),mcols(rowData(x)),.GlobalEnv) eval(as.expression(substitute(rowSubset)),as.data.frame(rowDa ta(x)),.GlobalEnv) ,eval(as.expression(substitute(colSubset)),colData(x),.GlobalEnv) ,drop=drop,...] if (! identical(TRUE,assaySubset)) assays(x)<-assays(x)[assaySubset] if(provenanceTrack) { exptData(x)$rowSubset<-c(exptData(x)$rowSubset,as.character(su bstitute(rowSubset))) exptData(x)$colSubset<-c(exptData(x)$colSubset,as.character(su bstitute(colSubset))) } x } attr(subset,'ex')<-function() { example(SummarizedExperiment) assays(se1)$a2<-assays(se1)$counts*2 assays(se1)$a3<-assays(se1)$counts*3 benchmark(replications=100 ,se1.ss1<-se1[start(rowData(se1))==344757,se1$Treatment=='ChIP'] ,se1.ss2<-subset(se1,start==344757,Treatment=='ChIP') ) stopifnot(identical(assays(se1.ss1),assays(se1.ss2))) se1.ss3<-subset(se1,strand=='+',Treatment=='ChIP',c('a2','a3')) stopifnot(identical(c('a2','a3'),names(assays(se1.ss3)))) } The rbenchmarks shows the cost of overhead in this contrived minimal example. YMMV: test replications elapsed relative user.self sys.self user.child sys.child 3 c("a2", "a3") 100 0.001 1 0.001 0.000 0 0 1 se1.ss1 <- se1[start(rowData(se1)) == 344757, se1$Treatment == "ChIP"] 100 3.113 3113 3.111 0.001 0 0 2 se1.ss2 <- subset(se1, start == 344757, Treatment == "ChIP") 100 3.486 3486 3.486 0.000 0 0 Is all this in the spirit of things as you see it? It sure makes my use of SE "scan" (erhm, like poetry;) if you like, I also have castLong and castWide functions to reshape a (subsetted) SE as a data.table for use in ggplot and friends. You like? Cheers, ~Malcolm

• 1.7k views

ADD COMMENT • link updated 10.2 years ago by Martin Morgan 25k • written 10.2 years ago by Malcolm Cook ★ 1.6k

0

Entering edit mode

Steve Lianoglou ★ 13k

@steve-lianoglou-2771

Last seen 14 months ago

United States

Hi, I think these ideas are pretty cool, as I often find myself doing both of the things you mention: * rather verbose subsetting/slicing/dicing of SEs * casting my SEs into data.tables to do "stuff" (not just plotting) I dig the idea of defining `subset` on SEs (and even GRanges alone), but I suspect there will be an issue with the semantics, ie. considering a SE (or whatever) as a 2d object, and what the "rows" and "columns" of them actually mean (I recall a "subset GRanges mojo" thread I might have raised that kicked the same bees nest some time (a few years(?)) ago) ... Anyway -- yeah, would be curious to see what your cast* functions look like ;-) -steve -- Steve Lianoglou Computational Biologist Genentech

ADD COMMENT • link 10.2 years ago Steve Lianoglou ★ 13k

0

Entering edit mode

On 02/05/2014 02:25 PM, Steve Lianoglou wrote: > Hi, > > I think these ideas are pretty cool, as I often find myself doing both > of the things you mention: > > * rather verbose subsetting/slicing/dicing of SEs actually, for Malcolm's example there's just 1 more character se1[start(se1) == 697568, se1$Treatment == 'ChIP'] subset(se1, start == 697568, Treatment == 'ChIP') but I get the point. > * casting my SEs into data.tables to do "stuff" (not just plotting) > > I dig the idea of defining `subset` on SEs (and even GRanges alone), > but I suspect there will be an issue with the semantics, ie. > considering a SE (or whatever) as a 2d object, and what the "rows" and > "columns" of them actually mean (I recall a "subset GRanges mojo" > thread I might have raised that kicked the same bees nest some time (a > few years(?)) ago) ... As the SummarizedExperiment author I do think of it as matrix-like, e.g., with dim(), dimnames(), and two-dimensional sub-setting. And SummarizedExperiment supports the more list-like GRanges functions such as subsetByOverlaps, %over%, etc. that transparently use the rowData(). Martin > > Anyway -- yeah, would be curious to see what your cast* functions look like ;-) > > -steve > -- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793

ADD REPLY • link 10.2 years ago Martin Morgan 25k

0

Entering edit mode

On 02/05/2014 02:25 PM, Steve Lianoglou wrote: > Hi, > > I think these ideas are pretty cool, as I often find myself doing both > of the things you mention: > > * rather verbose subsetting/slicing/dicing of SEs actually, for Malcolm's example there's just 1 more character se1[start(se1) == 697568, se1$Treatment == 'ChIP'] subset(se1, start == 697568, Treatment == 'ChIP') but I get the point. > * casting my SEs into data.tables to do "stuff" (not just plotting) > > I dig the idea of defining `subset` on SEs (and even GRanges alone), > but I suspect there will be an issue with the semantics, ie. > considering a SE (or whatever) as a 2d object, and what the "rows" and > "columns" of them actually mean (I recall a "subset GRanges mojo" > thread I might have raised that kicked the same bees nest some time (a > few years(?)) ago) ... As the SummarizedExperiment author I do think of it as matrix-like, e.g., with dim(), dimnames(), and two-dimensional sub-setting. And SummarizedExperiment supports the more list-like GRanges functions such as subsetByOverlaps, %over%, etc. that transparently use the rowData(). Martin > > Anyway -- yeah, would be curious to see what your cast* functions look like > > -steve > -- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793

ADD REPLY • link 10.2 years ago Martin Morgan 25k

0

Entering edit mode

Martin Morgan 25k

@martin-morgan-1513

Last seen 12 days ago

United States

Hi Malcom -- On 02/05/2014 02:11 PM, Cook, Malcolm wrote: > Martin, > > Google says: 'Your search - "subset.SummarizedExperiment" - did not match any documents.' If it had existed, it would have been an S4 method revealed by library(GenomicAlignments) # library(GenomicRanges) in release showMethods("subset") method?"subset,SummarizedExperiment" ?"subset,SummarizedExperiment-method" with tab completion available on the last two after getting through "subset". Google would have found "subset,SummarizedExperiment-method" (if it existed) but Google and tab completion with ? would have failed (though the latter in perhaps a suggestive way by showing alternate tab completions) when a relevant method is defined on a contained class. But yes, a method does not exist. > > So I offer the below for your consideration, along with a few examples and a quick benchmark comparison with using the more explicit `[` extraction operator. > > > > subset.SummarizedExperiment<-function(x > ,rowSubset=TRUE > ,colSubset=TRUE > ,assaySubset=TRUE > ,drop=FALSE > ,provenanceTrack=FALSE > ,...) { > ## PURPOSE: implement subsetting of SummarizedExperiments, > ## allowing expressions in terms of: > ## > ## + the GRanges (and its mcols meta-data) held in rowData > ## + the experimental meta-data held in the colData DataFrame > ## + the names of the assays > x<-x[ > ##eval(as.expression(substitute(row)),mcols(rowData(x)),.GlobalEnv) > eval(as.expression(substitute(rowSubset)),as.data.frame(ro wData(x)),.GlobalEnv) > ,eval(as.expression(substitute(colSubset)),colData(x),.GlobalEnv) > ,drop=drop,...] > if (! identical(TRUE,assaySubset)) assays(x)<-assays(x)[assaySubset] > if(provenanceTrack) { > exptData(x)$rowSubset<-c(exptData(x)$rowSubset,as.character (substitute(rowSubset))) > exptData(x)$colSubset<-c(exptData(x)$colSubset,as.character (substitute(colSubset))) > } > x > } > attr(subset,'ex')<-function() { > example(SummarizedExperiment) > assays(se1)$a2<-assays(se1)$counts*2 > assays(se1)$a3<-assays(se1)$counts*3 > benchmark(replications=100 > ,se1.ss1<-se1[start(rowData(se1))==344757,se1$Treatment=='ChIP'] > ,se1.ss2<-subset(se1,start==344757,Treatment=='ChIP') > ) > stopifnot(identical(assays(se1.ss1),assays(se1.ss2))) > se1.ss3<-subset(se1,strand=='+',Treatment=='ChIP',c('a2','a3')) > stopifnot(identical(c('a2','a3'),names(assays(se1.ss3)))) > } > > > The rbenchmarks shows the cost of overhead in this contrived minimal example. YMMV: > > test replications elapsed relative user.self sys.self user.child sys.child > 3 c("a2", "a3") 100 0.001 1 0.001 0.000 0 0 > 1 se1.ss1 <- se1[start(rowData(se1)) == 344757, se1$Treatment == "ChIP"] 100 3.113 3113 3.111 0.001 0 0 > 2 se1.ss2 <- subset(se1, start == 344757, Treatment == "ChIP") 100 3.486 3486 3.486 0.000 0 0 > > Is all this in the spirit of things as you see it? It sure makes my use of SE "scan" (erhm, like poetry;) My own problem with subset() is that the context of the call is hard to get correct. So your subset with this f <- function(x) { id <- 697568 subset(x, start==id) } fails > f(se1) Error in x[eval(as.expression(substitute(rowSubset)), as.data.frame(rowData(x)), : error in evaluating the argument 'i' in selecting a method for function '[': Error in eval(expr, envir, enclos) : object 'id' not found and this fails (does not select the value of id set in f()) silently > id <- 387456 > start(f(se1)) [1] 387456 Other R idioms are similarly dangerous even for data.frame using base::subset.data.frame as illustrated here http://stackoverflow.com/questions/9860090/in-r-why-is-better-than- subset in the high-voted answer. But maybe I shouldn't stand between users, guns, and feet? A couple of small SummarizedExperiment things in your code ... start(se1) == start(rowData(se1)) and likewise for other functions in the GRanges 'API'. Also it turns out that it is expensive in the (current) implementation of SummarizedExperiment to associated dimnames with returned assay() or assays(), so it is much faster to assays(x) <- assays(x, withDimnames=FALSE)[assaySubset] (dimnames get stored on the rowData and colData rather than redundantly on (each of the) assays, although this is probably a false (space) economy). The combination of GRanges API and expensive assay() / assays() can help save one from unintended inefficiencies, e.g., dimnames(se1) rather than dimnames(assay(se1)). Martin > > if you like, I also have castLong and castWide functions to reshape a (subsetted) SE as a data.table for use in ggplot and friends. > > You like? > > Cheers, > > ~Malcolm > -- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793

ADD COMMENT • link 10.2 years ago Martin Morgan 25k

0

Entering edit mode

Martin, Excellent! I have been warned. Dragons lurking. Suppose I replied... ... wait for it.... "Enter defmacro - stage right" (greek chorus chanting "oh, no, harken the days of yore, LISP resounds") Erhm... it _is_ a little late. But, seriously. Thanks for the subset clues on SO! And, seriously, I _may_ try a defmacro rewrite. Unless...? Steve, I'm just getting going with SEs and expect them to be on my road map for a while. holds a better formatted version of my proposed subset.SummarizedExperiments - sorry for the earlier copy/paste madness. Martin's subset admonitions notwithstanding, I _will_ include a copy of my cast(Wide|Long) functions for you enjoyment. Or my ridicule. Time will tell. Always learning.... Martin, I am coming to agree with your remark "Ranges required? Not really, but a bit of a hack" appearing in http://www.bioconductor.org/help/course- materials/2012/BiocEurope2012/SummarizedExperiment.pdf Indeed, lets imagine that rowData were _not_ a GRanges... rather, if the dimensional meta-data were of uniform mode (I'm thinking data.table), then symmetry would easily allow for recursion on the number of dimensions. Instead of `colData(se)` and `rowData(se)` we would have `dimData(se)[[i]]` with dimnames<-function(x) {lapply(dimData(x),names)}. Hmmm. I'm liking this. Are you? What would lack is a means to add some sort of useful genomic index to a data.table. Something like BLASTgres' BioIndexing on genomic `location` (a custom data type) defined on any set of columns from a table - http://www.cs.ucla.edu/~stott/bioinf/BLASTgres.pdf - In postgres the implementation depends upon user defined database types and indexes, and indexing schemes. Do you see profit here? Or is this the stuff of dreams? G'night, ~ malcolm_cook at stowers.org ________________________________________ From: Martin Morgan [mtmorgan@fhcrc.org] Sent: Wednesday, February 05, 2014 7:54 PM To: Cook, Malcolm Cc: bioconductor at r-project.org Subject: Re: subsetting SummarizedExperiments - a proposal (or a hack?) Hi Malcom -- On 02/05/2014 02:11 PM, Cook, Malcolm wrote: > Martin, > > Google says: 'Your search - "subset.SummarizedExperiment" - did not match any documents.' If it had existed, it would have been an S4 method revealed by library(GenomicAlignments) # library(GenomicRanges) in release showMethods("subset") method?"subset,SummarizedExperiment" ?"subset,SummarizedExperiment-method" with tab completion available on the last two after getting through "subset". Google would have found "subset,SummarizedExperiment-method" (if it existed) but Google and tab completion with ? would have failed (though the latter in perhaps a suggestive way by showing alternate tab completions) when a relevant method is defined on a contained class. But yes, a method does not exist. > > So I offer the below for your consideration, along with a few examples and a quick benchmark comparison with using the more explicit `[` extraction operator. > > > > subset.SummarizedExperiment<-function(x > ,rowSubset=TRUE > ,colSubset=TRUE > ,assaySubset=TRUE > ,drop=FALSE > ,provenanceTrack=FALSE > ,...) { > ## PURPOSE: implement subsetting of SummarizedExperiments, > ## allowing expressions in terms of: > ## > ## + the GRanges (and its mcols meta-data) held in rowData > ## + the experimental meta-data held in the colData DataFrame > ## + the names of the assays > x<-x[ > ##eval(as.expression(substitute(row)),mcols(rowData(x)),.GlobalEnv) > eval(as.expression(substitute(rowSubset)),as.data.frame(ro wData(x)),.GlobalEnv) > ,eval(as.expression(substitute(colSubset)),colData(x),.GlobalEnv) > ,drop=drop,...] > if (! identical(TRUE,assaySubset)) assays(x)<-assays(x)[assaySubset] > if(provenanceTrack) { > exptData(x)$rowSubset<-c(exptData(x)$rowSubset,as.character (substitute(rowSubset))) > exptData(x)$colSubset<-c(exptData(x)$colSubset,as.character (substitute(colSubset))) > } > x > } > attr(subset,'ex')<-function() { > example(SummarizedExperiment) > assays(se1)$a2<-assays(se1)$counts*2 > assays(se1)$a3<-assays(se1)$counts*3 > benchmark(replications=100 > ,se1.ss1<-se1[start(rowData(se1))==344757,se1$Treatment=='ChIP'] > ,se1.ss2<-subset(se1,start==344757,Treatment=='ChIP') > ) > stopifnot(identical(assays(se1.ss1),assays(se1.ss2))) > se1.ss3<-subset(se1,strand=='+',Treatment=='ChIP',c('a2','a3')) > stopifnot(identical(c('a2','a3'),names(assays(se1.ss3)))) > } > > > The rbenchmarks shows the cost of overhead in this contrived minimal example. YMMV: > > test replications elapsed relative user.self sys.self user.child sys.child > 3 c("a2", "a3") 100 0.001 1 0.001 0.000 0 0 > 1 se1.ss1 <- se1[start(rowData(se1)) == 344757, se1$Treatment == "ChIP"] 100 3.113 3113 3.111 0.001 0 0 > 2 se1.ss2 <- subset(se1, start == 344757, Treatment == "ChIP") 100 3.486 3486 3.486 0.000 0 0 > > Is all this in the spirit of things as you see it? It sure makes my use of SE "scan" (erhm, like poetry;) My own problem with subset() is that the context of the call is hard to get correct. So your subset with this f <- function(x) { id <- 697568 subset(x, start==id) } fails > f(se1) Error in x[eval(as.expression(substitute(rowSubset)), as.data.frame(rowData(x)), : error in evaluating the argument 'i' in selecting a method for function '[': Error in eval(expr, envir, enclos) : object 'id' not found and this fails (does not select the value of id set in f()) silently > id <- 387456 > start(f(se1)) [1] 387456 Other R idioms are similarly dangerous even for data.frame using base::subset.data.frame as illustrated here http://stackoverflow.com/questions/9860090/in-r-why-is-better-than- subset in the high-voted answer. But maybe I shouldn't stand between users, guns, and feet? A couple of small SummarizedExperiment things in your code ... start(se1) == start(rowData(se1)) and likewise for other functions in the GRanges 'API'. Also it turns out that it is expensive in the (current) implementation of SummarizedExperiment to associated dimnames with returned assay() or assays(), so it is much faster to assays(x) <- assays(x, withDimnames=FALSE)[assaySubset] (dimnames get stored on the rowData and colData rather than redundantly on (each of the) assays, although this is probably a false (space) economy). The combination of GRanges API and expensive assay() / assays() can help save one from unintended inefficiencies, e.g., dimnames(se1) rather than dimnames(assay(se1)). Martin > > if you like, I also have castLong and castWide functions to reshape a (subsetted) SE as a data.table for use in ggplot and friends. > > You like? > > Cheers, > > ~Malcolm > -- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793

ADD REPLY • link 10.2 years ago Malcolm Cook ★ 1.6k

0

Entering edit mode

On Wed, Feb 5, 2014 at 5:54 PM, Martin Morgan <mtmorgan@fhcrc.org> wrote: > Hi Malcom -- > > On 02/05/2014 02:11 PM, Cook, Malcolm wrote: > > Martin, > > > > Google says: 'Your search - "subset.SummarizedExperiment" - did not > match any documents.' > > If it had existed, it would have been an S4 method revealed by > > library(GenomicAlignments) # library(GenomicRanges) in release > showMethods("subset") > method?"subset,SummarizedExperiment" > ?"subset,SummarizedExperiment-method" > > with tab completion available on the last two after getting through > "subset". Google would have found "subset,SummarizedExperiment- method" (if > it existed) but Google and tab completion with ? would have failed (though > the latter in perhaps a suggestive way by showing alternate tab > completions) when a relevant method is defined on a contained class. But > yes, a method does not exist. > > > > > So I offer the below for your consideration, along with a few examples > and a quick benchmark comparison with using the more explicit `[` > extraction operator. > > > > > > > > subset.SummarizedExperiment<-function(x > > ,rowSubset=TRUE > > ,colSubset=TRUE > > ,assaySubset=TRUE > > ,drop=FALSE > > ,provenanceTrack=FALSE > > ,...) { > > ## PURPOSE: implement subsetting of SummarizedExperiments, > > ## allowing expressions in terms of: > > ## > > ## + the GRanges (and its mcols meta-data) held in rowData > > ## + the experimental meta-data held in the colData DataFrame > > ## + the names of the assays > > x<-x[ > > > ##eval(as.expression(substitute(row)),mcols(rowData(x)),.GlobalEnv) > > > eval(as.expression(substitute(rowSubset)),as.data.frame(rowData(x)), .GlobalEnv) > > > ,eval(as.expression(substitute(colSubset)),colData(x),.GlobalEnv) > > ,drop=drop,...] > > if (! identical(TRUE,assaySubset)) assays(x)<-assays(x)[assaySubset] > > if(provenanceTrack) { > > > exptData(x)$rowSubset<-c(exptData(x)$rowSubset,as.character(substit ute(rowSubset))) > > > exptData(x)$colSubset<-c(exptData(x)$colSubset,as.character(substit ute(colSubset))) > > } > > x > > } > > attr(subset,'ex')<-function() { > > example(SummarizedExperiment) > > assays(se1)$a2<-assays(se1)$counts*2 > > assays(se1)$a3<-assays(se1)$counts*3 > > benchmark(replications=100 > > > ,se1.ss1<-se1[start(rowData(se1))==344757,se1$Treatment=='ChIP'] > > ,se1.ss2<-subset(se1,start==344757,Treatment=='ChIP') > > ) > > stopifnot(identical(assays(se1.ss1),assays(se1.ss2))) > > se1.ss3<-subset(se1,strand=='+',Treatment=='ChIP',c('a2','a3')) > > stopifnot(identical(c('a2','a3'),names(assays(se1.ss3)))) > > } > > > > > > The rbenchmarks shows the cost of overhead in this contrived minimal > example. YMMV: > > > > > test replications elapsed relative user.self sys.self user.child sys.child > > 3 c("a2", > "a3") 100 0.001 1 0.001 0.000 0 0 > > 1 se1.ss1 <- se1[start(rowData(se1)) == 344757, se1$Treatment == > "ChIP"] 100 3.113 3113 3.111 0.001 0 > 0 > > 2 se1.ss2 <- subset(se1, start == 344757, Treatment == > "ChIP") 100 3.486 3486 3.486 0.000 0 > 0 > > > > Is all this in the spirit of things as you see it? It sure makes my > use of SE "scan" (erhm, like poetry;) > > My own problem with subset() is that the context of the call is hard to > get correct. So your subset with this > > f <- function(x) { > id <- 697568 > subset(x, start==id) > } > > fails > > > f(se1) > Error in x[eval(as.expression(substitute(rowSubset)), > as.data.frame(rowData(x)), : > error in evaluating the argument 'i' in selecting a method for function > '[': Error in eval(expr, envir, enclos) : object 'id' not found > > and this fails (does not select the value of id set in f()) silently > > > id <- 387456 > > start(f(se1)) > [1] 387456 > > I think I might have solved this problem with a simple solution. The promise knows the expression and the environment in which to evaluate it. Since we have no problem getting the expression with substitute(), why not get the environment, too? Just added this to IRanges and it seems to work: SEXP top_prenv(SEXP nm, SEXP env) { SEXP promise = findVar(nm, env); while(TYPEOF(promise) == PROMSXP) { env = PRENV(promise); promise = PREXPR(promise); } return env; } Let's see if we can break it. > Other R idioms are similarly dangerous even for data.frame using > base::subset.data.frame as illustrated here > > > http://stackoverflow.com/questions/9860090/in-r-why-is-better-than- subset > > in the high-voted answer. > > But maybe I shouldn't stand between users, guns, and feet? > > A couple of small SummarizedExperiment things in your code ... > > start(se1) == start(rowData(se1)) > > and likewise for other functions in the GRanges 'API'. Also it turns out > that it is expensive in the (current) implementation of > SummarizedExperiment to associated dimnames with returned assay() or > assays(), so it is much faster to > > assays(x) <- assays(x, withDimnames=FALSE)[assaySubset] > > (dimnames get stored on the rowData and colData rather than redundantly on > (each of the) assays, although this is probably a false (space) economy). > The combination of GRanges API and expensive assay() / assays() can help > save one from unintended inefficiencies, e.g., dimnames(se1) rather than > dimnames(assay(se1)). > > Martin > > > > > if you like, I also have castLong and castWide functions to reshape a > (subsetted) SE as a data.table for use in ggplot and friends. > > > > You like? > > > > Cheers, > > > > ~Malcolm > > > > > -- > Computational Biology / Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N. > PO Box 19024 Seattle, WA 98109 > > Location: Arnold Building M1 B861 > Phone: (206) 667-2793 > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]

ADD REPLY • link 10.2 years ago Michael Lawrence ★ 11k

0

Entering edit mode

Michael, Just to say, "Hooray" on your response to this thread, and especially more recent activity, where I see: 1) your discussion on R-devel re 'getting environment from "top" promise: http://r.789695.n4.nabble.com/getting-environment-from-quot- top-quot-promise-td4685138.html#none 2) your commit last night in service of allowing interfaces to GRanges (change log copied below) What nice syntactic sugar you are serving us. And what lurking dragons and self-inflicted gunshot wounds to the foot your are saving me. I look forward to the next release! Did I say "Hooray!" Hooray! ~Malcolm Revision: 86351 Author: m.lawrence Date: a day ago Paths: Modified /trunk/madman/Rpacks/IRanges/R/Vector-class.R<https: hedgehog.fhcrc.="" org="" bioconductor="" trunk="" madman="" rpacks="" iranges="" r="" vector-class.r=""> Modified /trunk/madman/Rpacks/IRanges/NAMESPACE<https: hedgehog.fhcrc.org="" bio="" conductor="" trunk="" madman="" rpacks="" iranges="" namespace=""> Modified /trunk/madman/Rpacks/IRanges/R/List-class.R<https: hedgehog.fhcrc.or="" g="" bioconductor="" trunk="" madman="" rpacks="" iranges="" r="" list-class.r=""> Modified /trunk/madman/Rpacks/IRanges/DESCRIPTION<https: hedgehog.fhcrc.org="" b="" ioconductor="" trunk="" madman="" rpacks="" iranges="" description=""> generalize eval() to Vectors, add as.env,Vector that forms an environment by querying the Vector for fixedColumnNames() and the mcols(). fixedColumnNames() returns an empty vector of names by default, but e.g. GenomicRanges can use it to provide start, end, etc. Change subset() and with() to call eval directly on the Vector. Now we can do subset(gr, seqnames=="chr2" & score > 15). From: Michael Lawrence [mailto:lawrence.michael@gene.com] Sent: Saturday, February 08, 2014 7:55 AM To: Martin Morgan Cc: Cook, Malcolm; bioconductor@r-project.org Subject: Re: [BioC] subsetting SummarizedExperiments - a proposal (or a hack?) On Wed, Feb 5, 2014 at 5:54 PM, Martin Morgan <mtmorgan@fhcrc.org<mailto:mtmorgan@fhcrc.org>> wrote: Hi Malcom -- On 02/05/2014 02:11 PM, Cook, Malcolm wrote: > Martin, > > Google says: 'Your search - "subset.SummarizedExperiment" - did not match any documents.' If it had existed, it would have been an S4 method revealed by library(GenomicAlignments) # library(GenomicRanges) in release showMethods("subset") method?"subset,SummarizedExperiment" ?"subset,SummarizedExperiment-method" with tab completion available on the last two after getting through "subset". Google would have found "subset,SummarizedExperiment-method" (if it existed) but Google and tab completion with ? would have failed (though the latter in perhaps a suggestive way by showing alternate tab completions) when a relevant method is defined on a contained class. But yes, a method does not exist. > > So I offer the below for your consideration, along with a few examples and a quick benchmark comparison with using the more explicit `[` extraction operator. > > > > subset.SummarizedExperiment<-function(x > ,rowSubset=TRUE > ,colSubset=TRUE > ,assaySubset=TRUE > ,drop=FALSE > ,provenanceTrack=FALSE > ,...) { > ## PURPOSE: implement subsetting of SummarizedExperiments, > ## allowing expressions in terms of: > ## > ## + the GRanges (and its mcols meta-data) held in rowData > ## + the experimental meta-data held in the colData DataFrame > ## + the names of the assays > x<-x[ > ##eval(as.expression(substitute(row)),mcols(rowData(x)),.GlobalEnv) > eval(as.expression(substitute(rowSubset)),as.data.frame(ro wData(x)),.GlobalEnv) > ,eval(as.expression(substitute(colSubset)),colData(x),.GlobalEnv) > ,drop=drop,...] > if (! identical(TRUE,assaySubset)) assays(x)<-assays(x)[assaySubset] > if(provenanceTrack) { > exptData(x)$rowSubset<-c(exptData(x)$rowSubset,as.character (substitute(rowSubset))) > exptData(x)$colSubset<-c(exptData(x)$colSubset,as.character (substitute(colSubset))) > } > x > } > attr(subset,'ex')<-function() { > example(SummarizedExperiment) > assays(se1)$a2<-assays(se1)$counts*2 > assays(se1)$a3<-assays(se1)$counts*3 > benchmark(replications=100 > ,se1.ss1<-se1[start(rowData(se1))==344757,se1$Treatment=='ChIP'] > ,se1.ss2<-subset(se1,start==344757,Treatment=='ChIP') > ) > stopifnot(identical(assays(se1.ss1),assays(se1.ss2))) > se1.ss3<-subset(se1,strand=='+',Treatment=='ChIP',c('a2','a3')) > stopifnot(identical(c('a2','a3'),names(assays(se1.ss3)))) > } > > > The rbenchmarks shows the cost of overhead in this contrived minimal example. YMMV: > > test replications elapsed relative user.self sys.self user.child sys.child > 3 c("a2", "a3") 100 0.001 1 0.001 0.000 0 0 > 1 se1.ss1 <- se1[start(rowData(se1)) == 344757, se1$Treatment == "ChIP"] 100 3.113 3113 3.111 0.001 0 0 > 2 se1.ss2 <- subset(se1, start == 344757, Treatment == "ChIP") 100 3.486 3486 3.486 0.000 0 0 > > Is all this in the spirit of things as you see it? It sure makes my use of SE "scan" (erhm, like poetry;) My own problem with subset() is that the context of the call is hard to get correct. So your subset with this f <- function(x) { id <- 697568 subset(x, start==id) } fails > f(se1) Error in x[eval(as.expression(substitute(rowSubset)), as.data.frame(rowData(x)), : error in evaluating the argument 'i' in selecting a method for function '[': Error in eval(expr, envir, enclos) : object 'id' not found and this fails (does not select the value of id set in f()) silently > id <- 387456 > start(f(se1)) [1] 387456 I think I might have solved this problem with a simple solution. The promise knows the expression and the environment in which to evaluate it. Since we have no problem getting the expression with substitute(), why not get the environment, too? Just added this to IRanges and it seems to work: SEXP top_prenv(SEXP nm, SEXP env) { SEXP promise = findVar(nm, env); while(TYPEOF(promise) == PROMSXP) { env = PRENV(promise); promise = PREXPR(promise); } return env; } Let's see if we can break it. Other R idioms are similarly dangerous even for data.frame using base::subset.data.frame as illustrated here http://stackoverflow.com/questions/9860090/in-r-why-is-better-than- subset in the high-voted answer. But maybe I shouldn't stand between users, guns, and feet? A couple of small SummarizedExperiment things in your code ... start(se1) == start(rowData(se1)) and likewise for other functions in the GRanges 'API'. Also it turns out that it is expensive in the (current) implementation of SummarizedExperiment to associated dimnames with returned assay() or assays(), so it is much faster to assays(x) <- assays(x, withDimnames=FALSE)[assaySubset] (dimnames get stored on the rowData and colData rather than redundantly on (each of the) assays, although this is probably a false (space) economy). The combination of GRanges API and expensive assay() / assays() can help save one from unintended inefficiencies, e.g., dimnames(se1) rather than dimnames(assay(se1)). Martin > > if you like, I also have castLong and castWide functions to reshape a (subsetted) SE as a data.table for use in ggplot and friends. > > You like? > > Cheers, > > ~Malcolm > -- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793<tel:%28206%29%20667-2793> _______________________________________________ Bioconductor mailing list Bioconductor@r-project.org<mailto:bioconductor@r-project.org> https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor [[alternative HTML version deleted]]

ADD REPLY • link 10.2 years ago Malcolm Cook ★ 1.6k

Login before adding your answer.