Approaches to building Rle sequences from ranges

0

Entering edit mode

Vince S. Buffalo ▴ 40

@vince-s-buffalo-6592

Last seen 10.6 years ago

Hi All, I'm thinking there might be a clever way to do something that I'm not aware of. The setup is that I frequently find myself using using Views and viewMeans, viewSums, etc. to calculate summary statistics by tiles on sequences. I have a GRanges object with 1-width ranges (but this should apply more generally), and metadata columns have some measurement (GC content, pairwise diversity, some quality metric, etc). I need to go from a quantitative variable tied to specific ranges to an Rle sequence across an entire chromosome to use Views/viewMeans (e.g. the binnedAverages example in the How to) . Right now I approach this with something like: data <- Rle(NA, length=seqlengths(txdb)['chr1']) data[start(my_rngs)] <- my_rngs$variable # simple, since my features are are 1-width # or more generally: data2 <- Rle(NA, length=seqlengths(txdb)['chr1']) data2[ranges(my_rngs)] <- my_rngs$variable identical(as.vector(data), as.vector(data2)) # returns TRUE (contingent on all widths = 1) Then, I can convert these to Views on a set of bins/tiles created with tileGenome, and use the viewMean, viewSums, etc. functions (removing NAs). So my question is â are there better methods for creating sequence- length Rle from metadata columns? Or another way of saying this is taking some metadata column corresponding to ranges and mapping it to coordinate space (maybe in one call)? It seems like if seqlengths is set in the GRanges object, there's sufficient information to go directly from a GRanges metadata column to an Rle vector (and I might be missing a more obvious solution). My example assumed a single chromosome, but an approach that knows to handle multiple sequences through RleLists seems like it would be helpful. thanks, Vince PS: My apologies if you've received this message twice, I had to resend after it appears that I sent it to the wrong list. -- Vince Buffalo Ross-Ibarra Lab www.rilab.org) Plant Sciences, UC Davis [[alternative HTML version deleted]]

GO convert GO convert • 2.0k views

ADD COMMENT • link updated 10.8 years ago by Jeff Johnston ▴ 90 • written 10.8 years ago by Vince S. Buffalo ▴ 40

0

Entering edit mode

Jeff Johnston ▴ 90

@jeff-johnston-6497

Last seen 7.3 years ago

United States

For GRanges with a metadata column, you can do: coverage(granges, weight=?variable?) This will produce an RleList where the value at each coordinate is the sum of the ?variable? metadata column of all overlapping ranges. I think this will work for your use case if your ranges do not overlap. -Jeff On Jun 6, 2014, at 2:59 PM, Vince S. Buffalo <vsbuffalo at="" ucdavis.edu=""> wrote: > Hi All, > > I'm thinking there might be a clever way to do something that I'm not aware > of. The setup is that I frequently find myself using using Views and > viewMeans, viewSums, etc. to calculate summary statistics by tiles on > sequences. I have a GRanges object with 1-width ranges (but this should > apply more generally), and metadata columns have some measurement (GC > content, pairwise diversity, some quality metric, etc). I need to go from a > quantitative variable tied to specific ranges to an Rle sequence across an > entire chromosome to use Views/viewMeans (e.g. the binnedAverages example > in the How to) . Right now I approach this with something like: > > data <- Rle(NA, length=seqlengths(txdb)['chr1']) > data[start(my_rngs)] <- my_rngs$variable # simple, since my features are > are 1-width > > # or more generally: > data2 <- Rle(NA, length=seqlengths(txdb)['chr1']) > data2[ranges(my_rngs)] <- my_rngs$variable > identical(as.vector(data), as.vector(data2)) # returns TRUE (contingent on > all widths = 1) > > Then, I can convert these to Views on a set of bins/tiles created with > tileGenome, and use the viewMean, viewSums, etc. functions (removing NAs). > > So my question is ? are there better methods for creating sequence- length > Rle from metadata columns? Or another way of saying this is taking some > metadata column corresponding to ranges and mapping it to coordinate space > (maybe in one call)? It seems like if seqlengths is set in the GRanges > object, there's sufficient information to go directly from a GRanges > metadata column to an Rle vector (and I might be missing a more obvious > solution). My example assumed a single chromosome, but an approach that > knows to handle multiple sequences through RleLists seems like it would be > helpful. > > thanks, > Vince > > PS: My apologies if you've received this message twice, I had to resend > after it appears that I sent it to the wrong list. > > -- > Vince Buffalo > Ross-Ibarra Lab www.rilab.org) > Plant Sciences, UC Davis > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD COMMENT • link 10.8 years ago Jeff Johnston ▴ 90

0

Entering edit mode

Note that if your statistic is per-base (like the mentioned GC content), use XInteger and XIntegerViews for efficiency. But it looks like Vince has gaps, in general at least. On Fri, Jun 6, 2014 at 1:11 PM, Johnston, Jeffrey <jjj@stowers.org> wrote: > For GRanges with a metadata column, you can do: > > coverage(granges, weight=âvariableâ) > > This will produce an RleList where the value at each coordinate is the sum > of the âvariableâ metadata column of all overlapping ranges. I think this > will work for your use case if your ranges do not overlap. > > -Jeff > > On Jun 6, 2014, at 2:59 PM, Vince S. Buffalo <vsbuffalo@ucdavis.edu> > wrote: > > > Hi All, > > > > I'm thinking there might be a clever way to do something that I'm not > aware > > of. The setup is that I frequently find myself using using Views and > > viewMeans, viewSums, etc. to calculate summary statistics by tiles on > > sequences. I have a GRanges object with 1-width ranges (but this should > > apply more generally), and metadata columns have some measurement (GC > > content, pairwise diversity, some quality metric, etc). I need to go > from a > > quantitative variable tied to specific ranges to an Rle sequence across > an > > entire chromosome to use Views/viewMeans (e.g. the binnedAverages example > > in the How to) . Right now I approach this with something like: > > > > data <- Rle(NA, length=seqlengths(txdb)['chr1']) > > data[start(my_rngs)] <- my_rngs$variable # simple, since my features are > > are 1-width > > > > # or more generally: > > data2 <- Rle(NA, length=seqlengths(txdb)['chr1']) > > data2[ranges(my_rngs)] <- my_rngs$variable > > identical(as.vector(data), as.vector(data2)) # returns TRUE (contingent > on > > all widths = 1) > > > > Then, I can convert these to Views on a set of bins/tiles created with > > tileGenome, and use the viewMean, viewSums, etc. functions (removing > NAs). > > > > So my question is â are there better methods for creating sequence-length > > Rle from metadata columns? Or another way of saying this is taking some > > metadata column corresponding to ranges and mapping it to coordinate > space > > (maybe in one call)? It seems like if seqlengths is set in the GRanges > > object, there's sufficient information to go directly from a GRanges > > metadata column to an Rle vector (and I might be missing a more obvious > > solution). My example assumed a single chromosome, but an approach that > > knows to handle multiple sequences through RleLists seems like it would > be > > helpful. > > > > thanks, > > Vince > > > > PS: My apologies if you've received this message twice, I had to resend > > after it appears that I sent it to the wrong list. > > > > -- > > Vince Buffalo > > Ross-Ibarra Lab www.rilab.org) > > Plant Sciences, UC Davis > > > > [[alternative HTML version deleted]] > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor@r-project.org > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]

ADD REPLY • link 10.8 years ago Michael Lawrence ★ 11k

0

Entering edit mode

Terrific! I knew I might be missing a more obvious solution â I didn't know that coverage's weight argument could take a metadata column. Thanks Jeffrey and Michael! Vince On Fri, Jun 6, 2014 at 1:29 PM, Michael Lawrence <lawrence.michael@gene.com> wrote: > Note that if your statistic is per-base (like the mentioned GC content), > use XInteger and XIntegerViews for efficiency. But it looks like Vince has > gaps, in general at least. > > > > > > > > On Fri, Jun 6, 2014 at 1:11 PM, Johnston, Jeffrey <jjj@stowers.org> wrote: > >> For GRanges with a metadata column, you can do: >> >> coverage(granges, weight=âvariableâ) >> >> This will produce an RleList where the value at each coordinate is the >> sum of the âvariableâ metadata column of all overlapping ranges. I think >> this will work for your use case if your ranges do not overlap. >> >> -Jeff >> >> On Jun 6, 2014, at 2:59 PM, Vince S. Buffalo <vsbuffalo@ucdavis.edu> >> wrote: >> >> > Hi All, >> > >> > I'm thinking there might be a clever way to do something that I'm not >> aware >> > of. The setup is that I frequently find myself using using Views and >> > viewMeans, viewSums, etc. to calculate summary statistics by tiles on >> > sequences. I have a GRanges object with 1-width ranges (but this should >> > apply more generally), and metadata columns have some measurement (GC >> > content, pairwise diversity, some quality metric, etc). I need to go >> from a >> > quantitative variable tied to specific ranges to an Rle sequence across >> an >> > entire chromosome to use Views/viewMeans (e.g. the binnedAverages >> example >> > in the How to) . Right now I approach this with something like: >> > >> > data <- Rle(NA, length=seqlengths(txdb)['chr1']) >> > data[start(my_rngs)] <- my_rngs$variable # simple, since my features are >> > are 1-width >> > >> > # or more generally: >> > data2 <- Rle(NA, length=seqlengths(txdb)['chr1']) >> > data2[ranges(my_rngs)] <- my_rngs$variable >> > identical(as.vector(data), as.vector(data2)) # returns TRUE (contingent >> on >> > all widths = 1) >> > >> > Then, I can convert these to Views on a set of bins/tiles created with >> > tileGenome, and use the viewMean, viewSums, etc. functions (removing >> NAs). >> > >> > So my question is â are there better methods for creating >> sequence-length >> > Rle from metadata columns? Or another way of saying this is taking some >> > metadata column corresponding to ranges and mapping it to coordinate >> space >> > (maybe in one call)? It seems like if seqlengths is set in the GRanges >> > object, there's sufficient information to go directly from a GRanges >> > metadata column to an Rle vector (and I might be missing a more obvious >> > solution). My example assumed a single chromosome, but an approach that >> > knows to handle multiple sequences through RleLists seems like it would >> be >> > helpful. >> > >> > thanks, >> > Vince >> > >> > PS: My apologies if you've received this message twice, I had to resend >> > after it appears that I sent it to the wrong list. >> > >> > -- >> > Vince Buffalo >> > Ross-Ibarra Lab www.rilab.org) >> > Plant Sciences, UC Davis >> > >> > [[alternative HTML version deleted]] >> > >> > _______________________________________________ >> > Bioconductor mailing list >> > Bioconductor@r-project.org >> > https://stat.ethz.ch/mailman/listinfo/bioconductor >> > Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor@r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > -- Vince Buffalo Ross-Ibarra Lab www.rilab.org) Plant Sciences, UC Davis [[alternative HTML version deleted]]

ADD REPLY • link 10.8 years ago Vince S. Buffalo ▴ 40

0

Entering edit mode

I've thought about this a bit more: coverage works, but by default is assumes that no ranges means coverage=0. Would it be possible to have an argument for coverage() that allows for no coverage regions to be given some arbitrary value like NA? (1) for metadata column that can take positive and negative values, there's no way to distinguish no coverage versus truly 0, (2) missingness along a sequence is an important variable too. Maybe this topic needs to be shifted to bioc-devel (or I'm missing another piece). Vince On Fri, Jun 6, 2014 at 2:24 PM, Vince S. Buffalo <vsbuffalo@ucdavis.edu> wrote: > Terrific! I knew I might be missing a more obvious solution â I didn't > know that coverage's weight argument could take a metadata column. Thanks > Jeffrey and Michael! > > Vince > > > On Fri, Jun 6, 2014 at 1:29 PM, Michael Lawrence < > lawrence.michael@gene.com> wrote: > >> Note that if your statistic is per-base (like the mentioned GC content), >> use XInteger and XIntegerViews for efficiency. But it looks like Vince has >> gaps, in general at least. >> >> >> >> >> >> >> >> On Fri, Jun 6, 2014 at 1:11 PM, Johnston, Jeffrey <jjj@stowers.org> >> wrote: >> >>> For GRanges with a metadata column, you can do: >>> >>> coverage(granges, weight=âvariableâ) >>> >>> This will produce an RleList where the value at each coordinate is the >>> sum of the âvariableâ metadata column of all overlapping ranges. I think >>> this will work for your use case if your ranges do not overlap. >>> >>> -Jeff >>> >>> On Jun 6, 2014, at 2:59 PM, Vince S. Buffalo <vsbuffalo@ucdavis.edu> >>> wrote: >>> >>> > Hi All, >>> > >>> > I'm thinking there might be a clever way to do something that I'm not >>> aware >>> > of. The setup is that I frequently find myself using using Views and >>> > viewMeans, viewSums, etc. to calculate summary statistics by tiles on >>> > sequences. I have a GRanges object with 1-width ranges (but this should >>> > apply more generally), and metadata columns have some measurement (GC >>> > content, pairwise diversity, some quality metric, etc). I need to go >>> from a >>> > quantitative variable tied to specific ranges to an Rle sequence >>> across an >>> > entire chromosome to use Views/viewMeans (e.g. the binnedAverages >>> example >>> > in the How to) . Right now I approach this with something like: >>> > >>> > data <- Rle(NA, length=seqlengths(txdb)['chr1']) >>> > data[start(my_rngs)] <- my_rngs$variable # simple, since my features >>> are >>> > are 1-width >>> > >>> > # or more generally: >>> > data2 <- Rle(NA, length=seqlengths(txdb)['chr1']) >>> > data2[ranges(my_rngs)] <- my_rngs$variable >>> > identical(as.vector(data), as.vector(data2)) # returns TRUE >>> (contingent on >>> > all widths = 1) >>> > >>> > Then, I can convert these to Views on a set of bins/tiles created with >>> > tileGenome, and use the viewMean, viewSums, etc. functions (removing >>> NAs). >>> > >>> > So my question is â are there better methods for creating >>> sequence-length >>> > Rle from metadata columns? Or another way of saying this is taking some >>> > metadata column corresponding to ranges and mapping it to coordinate >>> space >>> > (maybe in one call)? It seems like if seqlengths is set in the GRanges >>> > object, there's sufficient information to go directly from a GRanges >>> > metadata column to an Rle vector (and I might be missing a more obvious >>> > solution). My example assumed a single chromosome, but an approach that >>> > knows to handle multiple sequences through RleLists seems like it >>> would be >>> > helpful. >>> > >>> > thanks, >>> > Vince >>> > >>> > PS: My apologies if you've received this message twice, I had to resend >>> > after it appears that I sent it to the wrong list. >>> > >>> > -- >>> > Vince Buffalo >>> > Ross-Ibarra Lab www.rilab.org) >>> > Plant Sciences, UC Davis >>> > >>> > [[alternative HTML version deleted]] >>> > >>> > _______________________________________________ >>> > Bioconductor mailing list >>> > Bioconductor@r-project.org >>> > https://stat.ethz.ch/mailman/listinfo/bioconductor >>> > Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor@r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >> >> > > > -- > Vince Buffalo > Ross-Ibarra Lab www.rilab.org) > Plant Sciences, UC Davis > -- Vince Buffalo Ross-Ibarra Lab www.rilab.org) Plant Sciences, UC Davis [[alternative HTML version deleted]]

ADD REPLY • link 10.8 years ago Vince S. Buffalo ▴ 40

0

Entering edit mode

This has been requested in the past; not sure what happened, but initializing to NA somehow goes against the semantic of "coverage". Maybe there needs to be a new verb here? It's confusing enough as it is. On Fri, Jun 6, 2014 at 3:59 PM, Vince S. Buffalo <vsbuffalo@ucdavis.edu> wrote: > I've thought about this a bit more: coverage works, but by default is > assumes that no ranges means coverage=0. Would it be possible to have an > argument for coverage() that allows for no coverage regions to be given > some arbitrary value like NA? (1) for metadata column that can take > positive and negative values, there's no way to distinguish no coverage > versus truly 0, (2) missingness along a sequence is an important variable > too. Maybe this topic needs to be shifted to bioc-devel (or I'm missing > another piece). > > Vince > > > On Fri, Jun 6, 2014 at 2:24 PM, Vince S. Buffalo <vsbuffalo@ucdavis.edu> > wrote: > >> Terrific! I knew I might be missing a more obvious solution â I didn't >> know that coverage's weight argument could take a metadata column. Thanks >> Jeffrey and Michael! >> >> Vince >> >> >> On Fri, Jun 6, 2014 at 1:29 PM, Michael Lawrence < >> lawrence.michael@gene.com> wrote: >> >>> Note that if your statistic is per-base (like the mentioned GC content), >>> use XInteger and XIntegerViews for efficiency. But it looks like Vince has >>> gaps, in general at least. >>> >>> >>> >>> >>> >>> >>> >>> On Fri, Jun 6, 2014 at 1:11 PM, Johnston, Jeffrey <jjj@stowers.org> >>> wrote: >>> >>>> For GRanges with a metadata column, you can do: >>>> >>>> coverage(granges, weight=âvariableâ) >>>> >>>> This will produce an RleList where the value at each coordinate is the >>>> sum of the âvariableâ metadata column of all overlapping ranges. I think >>>> this will work for your use case if your ranges do not overlap. >>>> >>>> -Jeff >>>> >>>> On Jun 6, 2014, at 2:59 PM, Vince S. Buffalo <vsbuffalo@ucdavis.edu> >>>> wrote: >>>> >>>> > Hi All, >>>> > >>>> > I'm thinking there might be a clever way to do something that I'm not >>>> aware >>>> > of. The setup is that I frequently find myself using using Views and >>>> > viewMeans, viewSums, etc. to calculate summary statistics by tiles on >>>> > sequences. I have a GRanges object with 1-width ranges (but this >>>> should >>>> > apply more generally), and metadata columns have some measurement (GC >>>> > content, pairwise diversity, some quality metric, etc). I need to go >>>> from a >>>> > quantitative variable tied to specific ranges to an Rle sequence >>>> across an >>>> > entire chromosome to use Views/viewMeans (e.g. the binnedAverages >>>> example >>>> > in the How to) . Right now I approach this with something like: >>>> > >>>> > data <- Rle(NA, length=seqlengths(txdb)['chr1']) >>>> > data[start(my_rngs)] <- my_rngs$variable # simple, since my features >>>> are >>>> > are 1-width >>>> > >>>> > # or more generally: >>>> > data2 <- Rle(NA, length=seqlengths(txdb)['chr1']) >>>> > data2[ranges(my_rngs)] <- my_rngs$variable >>>> > identical(as.vector(data), as.vector(data2)) # returns TRUE >>>> (contingent on >>>> > all widths = 1) >>>> > >>>> > Then, I can convert these to Views on a set of bins/tiles created with >>>> > tileGenome, and use the viewMean, viewSums, etc. functions (removing >>>> NAs). >>>> > >>>> > So my question is â are there better methods for creating >>>> sequence-length >>>> > Rle from metadata columns? Or another way of saying this is taking >>>> some >>>> > metadata column corresponding to ranges and mapping it to coordinate >>>> space >>>> > (maybe in one call)? It seems like if seqlengths is set in the GRanges >>>> > object, there's sufficient information to go directly from a GRanges >>>> > metadata column to an Rle vector (and I might be missing a more >>>> obvious >>>> > solution). My example assumed a single chromosome, but an approach >>>> that >>>> > knows to handle multiple sequences through RleLists seems like it >>>> would be >>>> > helpful. >>>> > >>>> > thanks, >>>> > Vince >>>> > >>>> > PS: My apologies if you've received this message twice, I had to >>>> resend >>>> > after it appears that I sent it to the wrong list. >>>> > >>>> > -- >>>> > Vince Buffalo >>>> > Ross-Ibarra Lab www.rilab.org) >>>> > Plant Sciences, UC Davis >>>> > >>>> > [[alternative HTML version deleted]] >>>> > >>>> > _______________________________________________ >>>> > Bioconductor mailing list >>>> > Bioconductor@r-project.org >>>> > https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> > Search the archives: >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>> >>>> _______________________________________________ >>>> Bioconductor mailing list >>>> Bioconductor@r-project.org >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> Search the archives: >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>> >>> >>> >> >> >> -- >> Vince Buffalo >> Ross-Ibarra Lab www.rilab.org) >> Plant Sciences, UC Davis >> > > > > -- > Vince Buffalo > Ross-Ibarra Lab www.rilab.org) > Plant Sciences, UC Davis > [[alternative HTML version deleted]]

ADD REPLY • link 10.8 years ago Michael Lawrence ★ 11k

0

Entering edit mode

I do like the idea of of a new verb. coverage() also doesn't guarantee the Rle sequence will always be the same length as the underlying biological sequence. I think some basics characteristics might include: - guarantees that Rle is the same length as seqlengths, error out if it's not set. - takes an optional function argument on how to handle overlaps, e.g. summarize the metadata column values of overlapping ranges to a single value in the Rle. Examples include mean, median, custom function, etc. - default value to use for no coverage I see the primary use of such a function as one that takes quantitative values in metadata columns in possibly overlapping ranges, and creating a Rle sequence equal that can then be used with Views and view functions to summarize metadata by tile more easily. Vince On Fri, Jun 6, 2014 at 9:19 PM, Michael Lawrence <lawrence.michael@gene.com> wrote: > This has been requested in the past; not sure what happened, but > initializing to NA somehow goes against the semantic of "coverage". Maybe > there needs to be a new verb here? It's confusing enough as it is. > > > On Fri, Jun 6, 2014 at 3:59 PM, Vince S. Buffalo <vsbuffalo@ucdavis.edu> > wrote: > >> I've thought about this a bit more: coverage works, but by default is >> assumes that no ranges means coverage=0. Would it be possible to have an >> argument for coverage() that allows for no coverage regions to be given >> some arbitrary value like NA? (1) for metadata column that can take >> positive and negative values, there's no way to distinguish no coverage >> versus truly 0, (2) missingness along a sequence is an important variable >> too. Maybe this topic needs to be shifted to bioc-devel (or I'm missing >> another piece). >> >> Vince >> >> >> On Fri, Jun 6, 2014 at 2:24 PM, Vince S. Buffalo <vsbuffalo@ucdavis.edu> >> wrote: >> >>> Terrific! I knew I might be missing a more obvious solution â I didn't >>> know that coverage's weight argument could take a metadata column. Thanks >>> Jeffrey and Michael! >>> >>> Vince >>> >>> >>> On Fri, Jun 6, 2014 at 1:29 PM, Michael Lawrence < >>> lawrence.michael@gene.com> wrote: >>> >>>> Note that if your statistic is per-base (like the mentioned GC >>>> content), use XInteger and XIntegerViews for efficiency. But it looks like >>>> Vince has gaps, in general at least. >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> On Fri, Jun 6, 2014 at 1:11 PM, Johnston, Jeffrey <jjj@stowers.org> >>>> wrote: >>>> >>>>> For GRanges with a metadata column, you can do: >>>>> >>>>> coverage(granges, weight=âvariableâ) >>>>> >>>>> This will produce an RleList where the value at each coordinate is the >>>>> sum of the âvariableâ metadata column of all overlapping ranges. I think >>>>> this will work for your use case if your ranges do not overlap. >>>>> >>>>> -Jeff >>>>> >>>>> On Jun 6, 2014, at 2:59 PM, Vince S. Buffalo <vsbuffalo@ucdavis.edu> >>>>> wrote: >>>>> >>>>> > Hi All, >>>>> > >>>>> > I'm thinking there might be a clever way to do something that I'm >>>>> not aware >>>>> > of. The setup is that I frequently find myself using using Views and >>>>> > viewMeans, viewSums, etc. to calculate summary statistics by tiles on >>>>> > sequences. I have a GRanges object with 1-width ranges (but this >>>>> should >>>>> > apply more generally), and metadata columns have some measurement (GC >>>>> > content, pairwise diversity, some quality metric, etc). I need to go >>>>> from a >>>>> > quantitative variable tied to specific ranges to an Rle sequence >>>>> across an >>>>> > entire chromosome to use Views/viewMeans (e.g. the binnedAverages >>>>> example >>>>> > in the How to) . Right now I approach this with something like: >>>>> > >>>>> > data <- Rle(NA, length=seqlengths(txdb)['chr1']) >>>>> > data[start(my_rngs)] <- my_rngs$variable # simple, since my features >>>>> are >>>>> > are 1-width >>>>> > >>>>> > # or more generally: >>>>> > data2 <- Rle(NA, length=seqlengths(txdb)['chr1']) >>>>> > data2[ranges(my_rngs)] <- my_rngs$variable >>>>> > identical(as.vector(data), as.vector(data2)) # returns TRUE >>>>> (contingent on >>>>> > all widths = 1) >>>>> > >>>>> > Then, I can convert these to Views on a set of bins/tiles created >>>>> with >>>>> > tileGenome, and use the viewMean, viewSums, etc. functions (removing >>>>> NAs). >>>>> > >>>>> > So my question is â are there better methods for creating >>>>> sequence-length >>>>> > Rle from metadata columns? Or another way of saying this is taking >>>>> some >>>>> > metadata column corresponding to ranges and mapping it to coordinate >>>>> space >>>>> > (maybe in one call)? It seems like if seqlengths is set in the >>>>> GRanges >>>>> > object, there's sufficient information to go directly from a GRanges >>>>> > metadata column to an Rle vector (and I might be missing a more >>>>> obvious >>>>> > solution). My example assumed a single chromosome, but an approach >>>>> that >>>>> > knows to handle multiple sequences through RleLists seems like it >>>>> would be >>>>> > helpful. >>>>> > >>>>> > thanks, >>>>> > Vince >>>>> > >>>>> > PS: My apologies if you've received this message twice, I had to >>>>> resend >>>>> > after it appears that I sent it to the wrong list. >>>>> > >>>>> > -- >>>>> > Vince Buffalo >>>>> > Ross-Ibarra Lab www.rilab.org) >>>>> > Plant Sciences, UC Davis >>>>> > >>>>> > [[alternative HTML version deleted]] >>>>> > >>>>> > _______________________________________________ >>>>> > Bioconductor mailing list >>>>> > Bioconductor@r-project.org >>>>> > https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>> > Search the archives: >>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>> >>>>> _______________________________________________ >>>>> Bioconductor mailing list >>>>> Bioconductor@r-project.org >>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>> Search the archives: >>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>> >>>> >>>> >>> >>> >>> -- >>> Vince Buffalo >>> Ross-Ibarra Lab www.rilab.org) >>> Plant Sciences, UC Davis >>> >> >> >> >> -- >> Vince Buffalo >> Ross-Ibarra Lab www.rilab.org) >> Plant Sciences, UC Davis >> > > -- Vince Buffalo Ross-Ibarra Lab www.rilab.org) Plant Sciences, UC Davis [[alternative HTML version deleted]]

ADD REPLY • link 10.8 years ago Vince S. Buffalo ▴ 40

0

Entering edit mode

On Sat, Jun 7, 2014 at 9:46 AM, Vince S. Buffalo <vsbuffalo@ucdavis.edu> wrote: > I do like the idea of of a new verb. coverage() also doesn't guarantee the > Rle sequence will always be the same length as the underlying biological > sequence. I think some basics characteristics might include: > > - guarantees that Rle is the same length as seqlengths, error out if it's > not set. > It currently does take into account the seqlengths. I assume you're suggesting that it should throw an error if the seqlengths are not set. That's probably a good idea. > - takes an optional function argument on how to handle overlaps, e.g. > summarize the metadata column values of overlapping ranges to a single > value in the Rle. Examples include mean, median, custom function, etc. > - default value to use for no coverage > > Potentially useful. Would be good to hear some specific use cases though. I see the primary use of such a function as one that takes quantitative > values in metadata columns in possibly overlapping ranges, and creating a > Rle sequence equal that can then be used with Views and view functions to > summarize metadata by tile more easily. > > Vince > > > On Fri, Jun 6, 2014 at 9:19 PM, Michael Lawrence < > lawrence.michael@gene.com> wrote: > >> This has been requested in the past; not sure what happened, but >> initializing to NA somehow goes against the semantic of "coverage". Maybe >> there needs to be a new verb here? It's confusing enough as it is. >> >> >> On Fri, Jun 6, 2014 at 3:59 PM, Vince S. Buffalo <vsbuffalo@ucdavis.edu> >> wrote: >> >>> I've thought about this a bit more: coverage works, but by default is >>> assumes that no ranges means coverage=0. Would it be possible to have an >>> argument for coverage() that allows for no coverage regions to be given >>> some arbitrary value like NA? (1) for metadata column that can take >>> positive and negative values, there's no way to distinguish no coverage >>> versus truly 0, (2) missingness along a sequence is an important variable >>> too. Maybe this topic needs to be shifted to bioc-devel (or I'm missing >>> another piece). >>> >>> Vince >>> >>> >>> On Fri, Jun 6, 2014 at 2:24 PM, Vince S. Buffalo <vsbuffalo@ucdavis.edu> >>> wrote: >>> >>>> Terrific! I knew I might be missing a more obvious solution â I didn't >>>> know that coverage's weight argument could take a metadata column. Thanks >>>> Jeffrey and Michael! >>>> >>>> Vince >>>> >>>> >>>> On Fri, Jun 6, 2014 at 1:29 PM, Michael Lawrence < >>>> lawrence.michael@gene.com> wrote: >>>> >>>>> Note that if your statistic is per-base (like the mentioned GC >>>>> content), use XInteger and XIntegerViews for efficiency. But it looks like >>>>> Vince has gaps, in general at least. >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> On Fri, Jun 6, 2014 at 1:11 PM, Johnston, Jeffrey <jjj@stowers.org> >>>>> wrote: >>>>> >>>>>> For GRanges with a metadata column, you can do: >>>>>> >>>>>> coverage(granges, weight=âvariableâ) >>>>>> >>>>>> This will produce an RleList where the value at each coordinate is >>>>>> the sum of the âvariableâ metadata column of all overlapping ranges. I >>>>>> think this will work for your use case if your ranges do not overlap. >>>>>> >>>>>> -Jeff >>>>>> >>>>>> On Jun 6, 2014, at 2:59 PM, Vince S. Buffalo <vsbuffalo@ucdavis.edu> >>>>>> wrote: >>>>>> >>>>>> > Hi All, >>>>>> > >>>>>> > I'm thinking there might be a clever way to do something that I'm >>>>>> not aware >>>>>> > of. The setup is that I frequently find myself using using Views and >>>>>> > viewMeans, viewSums, etc. to calculate summary statistics by tiles >>>>>> on >>>>>> > sequences. I have a GRanges object with 1-width ranges (but this >>>>>> should >>>>>> > apply more generally), and metadata columns have some measurement >>>>>> (GC >>>>>> > content, pairwise diversity, some quality metric, etc). I need to >>>>>> go from a >>>>>> > quantitative variable tied to specific ranges to an Rle sequence >>>>>> across an >>>>>> > entire chromosome to use Views/viewMeans (e.g. the binnedAverages >>>>>> example >>>>>> > in the How to) . Right now I approach this with something like: >>>>>> > >>>>>> > data <- Rle(NA, length=seqlengths(txdb)['chr1']) >>>>>> > data[start(my_rngs)] <- my_rngs$variable # simple, since my >>>>>> features are >>>>>> > are 1-width >>>>>> > >>>>>> > # or more generally: >>>>>> > data2 <- Rle(NA, length=seqlengths(txdb)['chr1']) >>>>>> > data2[ranges(my_rngs)] <- my_rngs$variable >>>>>> > identical(as.vector(data), as.vector(data2)) # returns TRUE >>>>>> (contingent on >>>>>> > all widths = 1) >>>>>> > >>>>>> > Then, I can convert these to Views on a set of bins/tiles created >>>>>> with >>>>>> > tileGenome, and use the viewMean, viewSums, etc. functions >>>>>> (removing NAs). >>>>>> > >>>>>> > So my question is â are there better methods for creating >>>>>> sequence-length >>>>>> > Rle from metadata columns? Or another way of saying this is taking >>>>>> some >>>>>> > metadata column corresponding to ranges and mapping it to >>>>>> coordinate space >>>>>> > (maybe in one call)? It seems like if seqlengths is set in the >>>>>> GRanges >>>>>> > object, there's sufficient information to go directly from a GRanges >>>>>> > metadata column to an Rle vector (and I might be missing a more >>>>>> obvious >>>>>> > solution). My example assumed a single chromosome, but an approach >>>>>> that >>>>>> > knows to handle multiple sequences through RleLists seems like it >>>>>> would be >>>>>> > helpful. >>>>>> > >>>>>> > thanks, >>>>>> > Vince >>>>>> > >>>>>> > PS: My apologies if you've received this message twice, I had to >>>>>> resend >>>>>> > after it appears that I sent it to the wrong list. >>>>>> > >>>>>> > -- >>>>>> > Vince Buffalo >>>>>> > Ross-Ibarra Lab www.rilab.org) >>>>>> > Plant Sciences, UC Davis >>>>>> > >>>>>> > [[alternative HTML version deleted]] >>>>>> > >>>>>> > _______________________________________________ >>>>>> > Bioconductor mailing list >>>>>> > Bioconductor@r-project.org >>>>>> > https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>> > Search the archives: >>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>>> >>>>>> _______________________________________________ >>>>>> Bioconductor mailing list >>>>>> Bioconductor@r-project.org >>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>> Search the archives: >>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>>> >>>>> >>>>> >>>> >>>> >>>> -- >>>> Vince Buffalo >>>> Ross-Ibarra Lab www.rilab.org) >>>> Plant Sciences, UC Davis >>>> >>> >>> >>> >>> -- >>> Vince Buffalo >>> Ross-Ibarra Lab www.rilab.org) >>> Plant Sciences, UC Davis >>> >> >> > > > -- > Vince Buffalo > Ross-Ibarra Lab www.rilab.org) > Plant Sciences, UC Davis > [[alternative HTML version deleted]]

ADD REPLY • link 10.8 years ago Michael Lawrence ★ 11k

Login before adding your answer.