Granges averaging

0

Entering edit mode

Yannick Wurm ▴ 40

@yannick-wurm-5519

Last seen 3.8 years ago

United Kingdom

Hello, my Granges object "bigGranges" contains too much data. THat makes plotting extremely slow. Sometimes I want an averaged overview. I use the following code below to obtain it. Subsequently I can use ggbio for plotting. But I feel there must be an elegant one-line manner of doing this? Thanks & kind regards, yannick ## code is more legible on averageGranges <- GRanges() interval_length <- 10000 for (scaffold_index in 1:nrow(myscaffolds)) { scaffold <- myscaffolds$id[scaffold_index] scaffold_length <- myscaffolds$to[scaffold_index] num_intervals <- scaffold_length / interval_length for (interval_index in 0:(num_intervals-1) ) { grangesInterval <- GRanges(scaffold, IRanges(interval_index * interval_length, min( scaffold_length, (interval_index+1)* interval_length ))) subsetGranges <- subsetByOverlaps( bigGranges , grangesInterval) grangesInterval$averagemyValue <- mean(subsetGranges$myValue) averageGranges <- append(averageGranges, grangesInterval) } } # then I use averageGranges -------------------------------------- Yannick Wurm http://yannick.poulet.org Ants, Genomes & Evolution ? y.wurm at qmul.ac.uk ? skype:yannickwurm ? +44 207 882 3049 5.03A Fogg ? School of Biological & Chemical Sciences ? Queen Mary, University of London ? Mile End Road ? E1 4NS London ? UK Easy custom BLAST interface: http://www.sequenceserver.com

genomes genomes • 1.5k views

ADD COMMENT • link updated 13.2 years ago by Hervé Pagès 16k • written 13.2 years ago by Yannick Wurm ▴ 40

0

Entering edit mode

Hervé Pagès 16k

@herve-pages-1542

Last seen 2 days ago

Seattle, WA, United States

Hi Yannick, On 10/02/2012 10:07 AM, Yannick Wurm wrote: > Hello, > > my Granges object "bigGranges" contains too much data. THat makes plotting extremely slow. Sometimes I want an averaged overview. I use the following code below to obtain it. Subsequently I can use ggbio for plotting. > But I feel there must be an elegant one-line manner of doing this? The most elegant manner would be that ggbio takes care of this for you ;-) The code you provide below is not self-contained (we don't know what bigGranges and myscaffolds are). Also you don't explain what it's supposed to do and/or what the output is supposed to look like. Anyway, using 2 big nested "for" loops and growing the averageGranges object at each iteration is in general very inefficient. The "R way" is to take advantage of the fact that most arithmetic operations in R are vectorized. Also the IRanges/GenomicRanges packages provide a lot of arithmetic and other operations to let you work on RleViews, RleViewsList, IRangesList, GRanges etc... objects in a vectorized fashion. But most importantly, the kind of average you compute seems questionable to me. Right now, if a bin is for example the [500, 599] interval, and if your bigGranges object has 2 ranges that fall within that bin e.g. start end myValue 510 510 2.8 500 549 3.8 then you will assign the non-weighted myValue average to the bin (i.e. 3.3). Is it really what you want? Or, given that the 1st range doesn't count a lot, and the 2nd range covers half of the bin, wouldn't you want the average to be close to 1.9? Another (somewhat related) issue I see is that if a range in bigGranges overlaps with 2 bins then you're counting it twice (because you're using subsetByOverlaps()). What about computing the weighted coverage of bigGranges (using the myValue metadata column for the weights), and then the mean of this coverage for a set of bins you specifiy? You would not only get rid of the above issues, but it can also be implemented in a very efficient way. Here is what it would look like: ### First define the bins. The bins would typically be stored in a ### GRanges or IRangesList object. Here we choose the latter: mybins <- lapply(seqlengths(bigGranges), function(chrom_len) IRanges(breakInChunks(chrom_len, 200L))) mybins <- IRangesList(mybins) ### Then compute the coverage of your 'bigGranges', weighted by ### the values in its "myValue" metadata column: cvg <- coverage(bigGranges, weight="myValue") ### Then put views (corresponding to the bins) on top of the ### RleList object returned by coverage(). The result is stored ### in a RleViewsList object: viewslist <- RleViewsList( lapply(names(cvg), function(chrom) Views(cvg[[chrom]], mybins[[chrom]]))) names(viewslist) <- names(cvg) ### Then call viewMeans() on 'viewslist': bin_means <- viewMeans(viewslist) ### 'bin_means' is a NumericList object. It as the same "shape" as ### 'mybins' i.e. both are list-like objects of the same length and ### names, and the lengths of 'bin_means[[i]]' and 'mybins[[i]]' are ### the same for any valid i. ### Finally we construct 'averageGranges' from 'mybins' and 'bin_means': averageGranges <- as(mybins, "GRanges") mcols(averageGranges)$averagemyValue <- unlist(bin_means) Hope this helps, H. > > Thanks & kind regards, > yannick > > > > ## code is more legible on > > > averageGranges <- GRanges() > interval_length <- 10000 > > for (scaffold_index in 1:nrow(myscaffolds)) { > scaffold <- myscaffolds$id[scaffold_index] > scaffold_length <- myscaffolds$to[scaffold_index] > > num_intervals <- scaffold_length / interval_length > > for (interval_index in 0:(num_intervals-1) ) { > grangesInterval <- GRanges(scaffold, IRanges(interval_index * interval_length, > min( scaffold_length, > (interval_index+1)* interval_length ))) > subsetGranges <- subsetByOverlaps( bigGranges , grangesInterval) > > grangesInterval$averagemyValue <- mean(subsetGranges$myValue) > averageGranges <- append(averageGranges, grangesInterval) > } > } > > # then I use averageGranges > > -------------------------------------- > Yannick Wurm http://yannick.poulet.org > Ants, Genomes & Evolution ? y.wurm at qmul.ac.uk ? skype:yannickwurm ? +44 207 882 3049 > 5.03A Fogg ? School of Biological & Chemical Sciences ? Queen Mary, University of London ? Mile End Road ? E1 4NS London ? UK > Easy custom BLAST interface: http://www.sequenceserver.com > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319

ADD COMMENT • link 13.2 years ago Hervé Pagès 16k

0

Entering edit mode

On Wed, Oct 3, 2012 at 3:17 PM, HervÃ© PagÃ¨s <hpages@fhcrc.org> wrote: > Hi Yannick, > > > On 10/02/2012 10:07 AM, Yannick Wurm wrote: > >> Hello, >> >> my Granges object "bigGranges" contains too much data. THat makes >> plotting extremely slow. Sometimes I want an averaged overview. I use the >> following code below to obtain it. Subsequently I can use ggbio for >> plotting. >> But I feel there must be an elegant one-line manner of doing this? >> > > The most elegant manner would be that ggbio takes care of this for > you ;-) > sorry Yannick, I missed this discussion related to ggbio, I have something in ggbio you could try. Please update ggbio, and check ?stat_aggregate, ignore the "##doesn't work" comment in the example, just run those commented line, it should work, aggregate is This smooth and aggregate over your GRanges first by window argument you specified, and show as bar, boxplot, point.... , 'type' have customized method for min/max/mean/...,and customized Fun is supported too. This is kind of experimental, for some reason HervÃ© mentioned below, trying to figure out the best practice. now it use aggregate method inside and some other utils in GenomicRanges package. cheers Tengfei > The code you provide below is not self-contained (we don't know what > bigGranges and myscaffolds are). Also you don't explain what it's > supposed to do and/or what the output is supposed to look like. > > Anyway, using 2 big nested "for" loops and growing the averageGranges > object at each iteration is in general very inefficient. The "R way" is > to take advantage of the fact that most arithmetic operations in R are > vectorized. Also the IRanges/GenomicRanges packages provide a lot of > arithmetic and other operations to let you work on RleViews, > RleViewsList, IRangesList, GRanges etc... objects in a vectorized > fashion. > > But most importantly, the kind of average you compute seems questionable > to me. Right now, if a bin is for example the [500, 599] interval, and > if your bigGranges object has 2 ranges that fall within that bin e.g. > start end myValue > 510 510 2.8 > 500 549 3.8 > then you will assign the non-weighted myValue average to the bin (i.e. > 3.3). Is it really what you want? Or, given that the 1st range doesn't > count a lot, and the 2nd range covers half of the bin, wouldn't you > want the average to be close to 1.9? > > Another (somewhat related) issue I see is that if a range in bigGranges > overlaps with 2 bins then you're counting it twice (because you're using > subsetByOverlaps()). > > What about computing the weighted coverage of bigGranges (using the > myValue metadata column for the weights), and then the mean of this > coverage for a set of bins you specifiy? You would not only get rid > of the above issues, but it can also be implemented in a very > efficient way. > > Here is what it would look like: > > ### First define the bins. The bins would typically be stored in a > ### GRanges or IRangesList object. Here we choose the latter: > mybins <- lapply(seqlengths(bigGranges), > function(chrom_len) > IRanges(breakInChunks(chrom_**len, 200L))) > mybins <- IRangesList(mybins) > > ### Then compute the coverage of your 'bigGranges', weighted by > ### the values in its "myValue" metadata column: > cvg <- coverage(bigGranges, weight="myValue") > > ### Then put views (corresponding to the bins) on top of the > ### RleList object returned by coverage(). The result is stored > ### in a RleViewsList object: > viewslist <- RleViewsList( > lapply(names(cvg), > function(chrom) > Views(cvg[[chrom]], mybins[[chrom]]))) > names(viewslist) <- names(cvg) > > ### Then call viewMeans() on 'viewslist': > bin_means <- viewMeans(viewslist) > > ### 'bin_means' is a NumericList object. It as the same "shape" as > ### 'mybins' i.e. both are list-like objects of the same length and > ### names, and the lengths of 'bin_means[[i]]' and 'mybins[[i]]' are > ### the same for any valid i. > ### Finally we construct 'averageGranges' from 'mybins' and 'bin_means': > averageGranges <- as(mybins, "GRanges") > mcols(averageGranges)$**averagemyValue <- unlist(bin_means) > > Hope this helps, > H. > > > >> Thanks & kind regards, >> yannick >> >> >> >> ## code is more legible on https://gist.github.com/**c5a553bd012876 effb48<https: gist.github.com="" c5a553bd012876effb48=""> >> >> >> averageGranges <- GRanges() >> interval_length <- 10000 >> >> for (scaffold_index in 1:nrow(myscaffolds)) { >> scaffold <- myscaffolds$id[scaffold_index] >> scaffold_length <- myscaffolds$to[scaffold_index] >> >> num_intervals <- scaffold_length / interval_length >> >> for (interval_index in 0:(num_intervals-1) ) { >> grangesInterval <- GRanges(scaffold, IRanges(interval_index * >> interval_length, >> min( scaffold_length, >> >> (interval_index+1)* interval_length ))) >> subsetGranges <- subsetByOverlaps( bigGranges , grangesInterval) >> >> grangesInterval$averagemyValue <- mean(subsetGranges$myValue) >> averageGranges <- append(averageGranges, grangesInterval) >> } >> } >> >> # then I use averageGranges >> >> ------------------------------**-------- >> Yannick Wurm http://yannick.poulet.org >> Ants, Genomes & Evolution â y.wurm@qmul.ac.uk â skype:yannickwurm â +44 >> 207 882 3049 >> 5.03A Fogg â School of Biological & Chemical Sciences â Queen Mary, >> University of London â Mile End Road â E1 4NS London â UK >> Easy custom BLAST interface: http://www.sequenceserver.com >> >> ______________________________**_________________ >> Bioconductor mailing list >> Bioconductor@r-project.org >> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https: stat.e="" thz.ch="" mailman="" listinfo="" bioconductor=""> >> Search the archives: http://news.gmane.org/gmane.** >> science.biology.informatics.**conductor<http: news.gmane.org="" gmane="" .science.biology.informatics.conductor=""> >> >> > -- > HervÃ© PagÃ¨s > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages@fhcrc.org > Phone: (206) 667-5791 > Fax: (206) 667-1319 > > > ______________________________**_________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/**listinfo/bioconductor<https: stat.et="" hz.ch="" mailman="" listinfo="" bioconductor=""> > Search the archives: http://news.gmane.org/gmane.** > science.biology.informatics.**conductor<http: news.gmane.org="" gmane.="" science.biology.informatics.conductor=""> > -- Tengfei Yin MCDB PhD student 1620 Howe Hall, 2274, Iowa State University Ames, IA,50011-2274 [[alternative HTML version deleted]]

ADD REPLY • link 13.2 years ago Tengfei Yin ▴ 420

Login before adding your answer.