Turning a GRanges Metadata Column into Rle List

0

Entering edit mode

Dario Strbenac ★ 1.6k

@dario-strbenac-5916

Last seen 5 weeks ago

Australia

Anything neater than this available ? g <- GRanges(c("chr1", "chr2"), IRanges(c(10, 50), c(15, 55)), scores = c(20, 10)) seqlengths(g) <- c(100, 100) chrRanges <- split(g, seqnames(g)) lapply(chrRanges, function(x) { scoreVect <- rep(NA, seqlengths(x)[1]) invisible(mapply(function(st, end, score) scoreVect[st:end] <<- score, start(x), end(x), values(x)[,1])) Rle(scoreVect) }) Then, code to define views on these Rle and calculate view summaries. -------------------------------------- Dario Strbenac PhD Student University of Sydney Camperdown NSW 2050 Australia

• 4.5k views

ADD COMMENT • link updated 13.1 years ago by Aaron Statham ▴ 40 • written 13.1 years ago by Dario Strbenac ★ 1.6k

0

Entering edit mode

Aaron Statham ▴ 40

@aaron-statham-4434

Last seen 11.4 years ago

On 7 January 2013 16:00, Dario Strbenac <d.strbenac@garvan.org.au> wrote: > Anything neater than this available ? > > g <- GRanges(c("chr1", "chr2"), IRanges(c(10, 50), c(15, 55)), scores = > c(20, 10)) > seqlengths(g) <- c(100, 100) > chrRanges <- split(g, seqnames(g)) > lapply(chrRanges, function(x) > { > scoreVect <- rep(NA, seqlengths(x)[1]) > invisible(mapply(function(st, end, score) scoreVect[st:end] <<- score, > start(x), end(x), values(x)[,1])) > Rle(scoreVect) > }) > > Then, code to define views on these Rle and calculate view summaries. > > Hey Dario, If I understand what you want to do correctly (your goal is not obvious) this one liner should work for integer/numeric metadata columns, and even handles overlapping scores gracefully. #define the ranges/scores g <- GRanges(c("chr1", "chr2"), IRanges(c(10, 50), c(15, 55)), scores = c(20, 10)) seqlengths(g) <- c(100, 100) #Rles of scores coverage(g, weight=values(g)$scores) Cheers, Aaron > -------------------------------------- > Dario Strbenac > PhD Student > University of Sydney > Camperdown NSW 2050 > Australia > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Aaron Statham Postgraduate Scholar, Cancer Epigenetics Garvan Institute of Medical Research Tel: (02) 9295 8393 384 Victoria St Darlinghurst 2010 Fax: (02) 9295 8316 NSW Australia email: a.statham@garvan.org.au [[alternative HTML version deleted]]

ADD COMMENT • link 13.1 years ago Aaron Statham ▴ 40

0

Entering edit mode

> coverage(g, weight=values(g)$scores) That is close to what I was after, although the ranges which aren't present in the GRanges object effectively have missing scores and need to be NA because 0 is a valid score for the ranges which are present. So, if I had a view defined that only partially overlapped with a range in the example, I would want any summary to give NA, rather than taking 0 into the calculation.

ADD REPLY • link 13.1 years ago Dario Strbenac ★ 1.6k

0

Entering edit mode

One hack would be to add 1 to all of the scores, replace the zeros in the coverage() result with NAs and subtract 1. I've hit something like this a couple times in the past. The coverage method for GRanges could gain a default value argument. Michael On Sun, Jan 6, 2013 at 11:00 PM, Dario Strbenac <d.strbenac@garvan.org.au>wrote: > > coverage(g, weight=values(g)$scores) > > That is close to what I was after, although the ranges which aren't > present in the GRanges object effectively have missing scores and need to > be NA because 0 is a valid score for the ranges which are present. So, if I > had a view defined that only partially overlapped with a range in the > example, I would want any summary to give NA, rather than taking 0 into the > calculation. > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]

ADD REPLY • link 13.1 years ago Michael Lawrence ★ 11k

0

Entering edit mode

Something like coverage(foo, bar, ..., NA.value=-1) ? --t On Jan 7, 2013, at 8:59 AM, Michael Lawrence <lawrence.michael at="" gene.com=""> wrote: > One hack would be to add 1 to all of the scores, replace the zeros in the > coverage() result with NAs and subtract 1. I've hit something like this a > couple times in the past. The coverage method for GRanges could gain a > default value argument. > > Michael > > > > On Sun, Jan 6, 2013 at 11:00 PM, Dario Strbenac <d.strbenac at="" garvan.org.au="">wrote: > >>> coverage(g, weight=values(g)$scores) >> >> That is close to what I was after, although the ranges which aren't >> present in the GRanges object effectively have missing scores and need to >> be NA because 0 is a valid score for the ranges which are present. So, if I >> had a view defined that only partially overlapped with a range in the >> example, I would want any summary to give NA, rather than taking 0 into the >> calculation. >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD REPLY • link 13.1 years ago Tim Triche ★ 4.2k

0

Entering edit mode

Here's a quick workaround (would be more elegant if weights in coverage() could be NA, but that would probably break other functionality) g <- GRanges(c("chr1", "chr2"), IRanges(c(10, 50), c(15, 55)), scores = c(20, 10)) seqlengths(g) <- c(100, 100) mask=-1 #placeholder for NA #Define the ranges we want to be NA gInv <- setdiff(GRanges(seqnames(g), IRanges(1, seqlengths(g))), g) values(gInv)$scores <- mask g <- c(g, gInv) endoapply(coverage(g, weight="scores"), function(x) { #replace mask with NA runValue(x)[runValue(x==mask)] <- NA x }) Aaron On 8 January 2013 03:59, Michael Lawrence <lawrence.michael@gene.com> wrote: > One hack would be to add 1 to all of the scores, replace the zeros in the > coverage() result with NAs and subtract 1. I've hit something like this a > couple times in the past. The coverage method for GRanges could gain a > default value argument. > > Michael > > > > On Sun, Jan 6, 2013 at 11:00 PM, Dario Strbenac <d.strbenac@garvan.org.au>wrote: > >> > coverage(g, weight=values(g)$scores) >> >> That is close to what I was after, although the ranges which aren't >> present in the GRanges object effectively have missing scores and need to >> be NA because 0 is a valid score for the ranges which are present. So, if I >> had a view defined that only partially overlapped with a range in the >> example, I would want any summary to give NA, rather than taking 0 into the >> calculation. >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor@r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > -- Aaron Statham Postgraduate Scholar, Cancer Epigenetics Garvan Institute of Medical Research Tel: (02) 9295 8393 384 Victoria St Darlinghurst 2010 Fax: (02) 9295 8316 NSW Australia email: a.statham@garvan.org.au [[alternative HTML version deleted]]

ADD REPLY • link 13.1 years ago Aaron Statham ▴ 40

0

Entering edit mode

I was thinking you could just do (assuming score is always >= 0, and that the ranges do not overlap, which seems to be the case from the initial code): gr$score <- gr$score + 1L cov <- coverage(gr, weight = "score") cov[cov == 0L] <- NA cov <- cov - 1L Note that it should not be necessary to use endoapply or any other apply function, as it is all implicit in the List API. Not tested; just an illustration. Michael On Mon, Jan 7, 2013 at 2:55 PM, Aaron Statham <a.statham@garvan.org.au>wrote: > Here's a quick workaround (would be more elegant if weights in coverage() > could be NA, but that would probably break other functionality) > > g <- GRanges(c("chr1", "chr2"), IRanges(c(10, 50), c(15, 55)), scores > = c(20, 10)) > seqlengths(g) <- c(100, 100) > > mask=-1 #placeholder for NA > #Define the ranges we want to be NA > gInv <- setdiff(GRanges(seqnames(g), IRanges(1, seqlengths(g))), g) > values(gInv)$scores <- mask > g <- c(g, gInv) > > endoapply(coverage(g, weight="scores"), function(x) { > #replace mask with NA > runValue(x)[runValue(x==mask)] <- NA > x > }) > > Aaron > > > On 8 January 2013 03:59, Michael Lawrence <lawrence.michael@gene.com>wrote: > >> One hack would be to add 1 to all of the scores, replace the zeros in the >> coverage() result with NAs and subtract 1. I've hit something like this a >> couple times in the past. The coverage method for GRanges could gain a >> default value argument. >> >> Michael >> >> >> >> On Sun, Jan 6, 2013 at 11:00 PM, Dario Strbenac <d.strbenac@garvan.org.au>> > wrote: >> >>> > coverage(g, weight=values(g)$scores) >>> >>> That is close to what I was after, although the ranges which aren't >>> present in the GRanges object effectively have missing scores and need to >>> be NA because 0 is a valid score for the ranges which are present. So, if I >>> had a view defined that only partially overlapped with a range in the >>> example, I would want any summary to give NA, rather than taking 0 into the >>> calculation. >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor@r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >> >> > > > -- > Aaron Statham > Postgraduate Scholar, Cancer Epigenetics > Garvan Institute of Medical Research Tel: (02) 9295 8393 > 384 Victoria St Darlinghurst 2010 Fax: (02) 9295 8316 > NSW Australia email: a.statham@garvan.org.au > [[alternative HTML version deleted]]

ADD REPLY • link 13.1 years ago Michael Lawrence ★ 11k

0

Entering edit mode

Hi Dario, Alternatively you could call coverage() a 2nd time with no weights to find the regions with no coverage, and set them to NA: cvg <- coverage(gr, weight="scores") cvg[coverage(gr) == 0] <- NA H. On 01/07/2013 04:00 PM, Michael Lawrence wrote: > I was thinking you could just do (assuming score is always >= 0, and that > the ranges do not overlap, which seems to be the case from the initial > code): > > gr$score <- gr$score + 1L > cov <- coverage(gr, weight = "score") > cov[cov == 0L] <- NA > cov <- cov - 1L > > Note that it should not be necessary to use endoapply or any other apply > function, as it is all implicit in the List API. > > Not tested; just an illustration. > > Michael > > > On Mon, Jan 7, 2013 at 2:55 PM, Aaron Statham <a.statham at="" garvan.org.au="">wrote: > >> Here's a quick workaround (would be more elegant if weights in coverage() >> could be NA, but that would probably break other functionality) >> >> g <- GRanges(c("chr1", "chr2"), IRanges(c(10, 50), c(15, 55)), scores >> = c(20, 10)) >> seqlengths(g) <- c(100, 100) >> >> mask=-1 #placeholder for NA >> #Define the ranges we want to be NA >> gInv <- setdiff(GRanges(seqnames(g), IRanges(1, seqlengths(g))), g) >> values(gInv)$scores <- mask >> g <- c(g, gInv) >> >> endoapply(coverage(g, weight="scores"), function(x) { >> #replace mask with NA >> runValue(x)[runValue(x==mask)] <- NA >> x >> }) >> >> Aaron >> >> >> On 8 January 2013 03:59, Michael Lawrence <lawrence.michael at="" gene.com="">wrote: >> >>> One hack would be to add 1 to all of the scores, replace the zeros in the >>> coverage() result with NAs and subtract 1. I've hit something like this a >>> couple times in the past. The coverage method for GRanges could gain a >>> default value argument. >>> >>> Michael >>> >>> >>> >>> On Sun, Jan 6, 2013 at 11:00 PM, Dario Strbenac <d.strbenac at="" garvan.org.au="">>>> wrote: >>> >>>>> coverage(g, weight=values(g)$scores) >>>> >>>> That is close to what I was after, although the ranges which aren't >>>> present in the GRanges object effectively have missing scores and need to >>>> be NA because 0 is a valid score for the ranges which are present. So, if I >>>> had a view defined that only partially overlapped with a range in the >>>> example, I would want any summary to give NA, rather than taking 0 into the >>>> calculation. >>>> >>>> _______________________________________________ >>>> Bioconductor mailing list >>>> Bioconductor at r-project.org >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> Search the archives: >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>> >>> >>> >> >> >> -- >> Aaron Statham >> Postgraduate Scholar, Cancer Epigenetics >> Garvan Institute of Medical Research Tel: (02) 9295 8393 >> 384 Victoria St Darlinghurst 2010 Fax: (02) 9295 8316 >> NSW Australia email: a.statham at garvan.org.au >> > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319

ADD REPLY • link 13.1 years ago Hervé Pagès 16k

0

Entering edit mode

> I was thinking you could just do (assuming score is > always >= 0, and that the ranges do not overlap, > which seems to be the case from the initial code): In another scenario, what if I have data on multiple cell lines, such as one of the ENCODE integrated tracks, and I would like to find the maximum value at each base within a specified genomic region ? In this case, the ranges in the supertrack would overlap often. I would imagine making a separate coverage RleList for each cell line, to avoid complications with overlapping ranges. Then, it would be nice to have a pmax function in the API that could take a number of RleList objects, each of the same length, and return one RleList. Are there any plans for pmin and pmax of this style ?

ADD REPLY • link 13.0 years ago Dario Strbenac ★ 1.6k

0

Entering edit mode

On 01/14/2013 10:00 PM, Dario Strbenac wrote: >> I was thinking you could just do (assuming score is always >= 0, and that >> the ranges do not overlap, which seems to be the case from the initial >> code): > > In another scenario, what if I have data on multiple cell lines, such as one > of the ENCODE integrated tracks, and I would like to find the maximum value > at each base within a specified genomic region ? In this case, the ranges in > the supertrack would overlap often. > > I would imagine making a separate coverage RleList for each cell line, to > avoid complications with overlapping ranges. Then, it would be nice to have a > pmax function in the API that could take a number of RleList objects, each of > the same length, and return one RleList. Are there any plans for pmin and > pmax of this style ? I think there is one? library(IRanges) showMethods(class="RleList", where=search()) > r1 = RleList(a=Rle(c(0, 0, 1, 0)), b=Rle(c(1, 1, 0, 0))) > r2 = RleList(a=Rle(c(0, 0, 2, 2)), b=Rle(c(1, 2, 1, 1))) > pmax(r1, r2) SimpleRleList of length 2 $a numeric-Rle of length 4 with 2 runs Lengths: 2 2 Values : 0 2 $b numeric-Rle of length 4 with 3 runs Lengths: 1 1 2 Values : 1 2 1 > > _______________________________________________ Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793

ADD REPLY • link 13.0 years ago Martin Morgan 25k

0

Entering edit mode

If you want the maximum of the overlap you can use something like this too: ir<-IRanges(start=you.intervals[,"start"],end=you.intervals[,"end"]) ir.encode.data<-IRanges(start=encode.data[,"start"],end=encode.data[," end"]) over<-findOverlaps(ir,ir.encode.data) query<-queryHits(over) subject<-subjectHits(over) tapply(as.numeric(encode.data[subject,"score"]),query,function(x) max(x,na.rm=TRUE)) slight modification if have different chrs .... This works for getting GERP scores in indels etc Dr Paul Leo Senior Bioinformatician The University of Queensland Diamantina Institute --------------------------------------------------------------------- TRI, level , 37 Kent Street, Woolloongabba QLD 4102 Tel: +61 7 3443 7072 Mob: 041 303 8691 Fax: +61 7 3443 6966 -----Original Message----- From: Martin Morgan <mtmorgan@fhcrc.org> To: D.Strbenac at garvan.org.au Cc: bioconductor at r-project.org Subject: Re: [BioC] Turning a GRanges Metadata Column into Rle List Date: Mon, 14 Jan 2013 22:17:17 -0800 On 01/14/2013 10:00 PM, Dario Strbenac wrote: >> I was thinking you could just do (assuming score is always >= 0, and that >> the ranges do not overlap, which seems to be the case from the initial >> code): > > In another scenario, what if I have data on multiple cell lines, such as one > of the ENCODE integrated tracks, and I would like to find the maximum value > at each base within a specified genomic region ? In this case, the ranges in > the supertrack would overlap often. > > I would imagine making a separate coverage RleList for each cell line, to > avoid complications with overlapping ranges. Then, it would be nice to have a > pmax function in the API that could take a number of RleList objects, each of > the same length, and return one RleList. Are there any plans for pmin and > pmax of this style ? I think there is one? library(IRanges) showMethods(class="RleList", where=search()) > r1 = RleList(a=Rle(c(0, 0, 1, 0)), b=Rle(c(1, 1, 0, 0))) > r2 = RleList(a=Rle(c(0, 0, 2, 2)), b=Rle(c(1, 2, 1, 1))) > pmax(r1, r2) SimpleRleList of length 2 $a numeric-Rle of length 4 with 2 runs Lengths: 2 2 Values : 0 2 $b numeric-Rle of length 4 with 3 runs Lengths: 1 1 2 Values : 1 2 1 > > _______________________________________________ Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor >

ADD REPLY • link 13.0 years ago Paul Leo ▴ 970

0

Entering edit mode

Actually this may be better. than my last post: But making the Rle object of scores using "coverage and weights" (is the only way I could think of). In won't work in cases where the intervals in ir.encode.data are not unique (that is the coverage values must always be one else you get coverage*score instead of just score when you use the weights) Is there a better way to make an Rle of non-contigious scores ??? ### assume encode.data is table with columns start,end,score ### assume you.intervals is table with columns start,end # ir<-IRanges(start=you.intervals[,"start"],end=you.intervals[,"end"]) ir.encode.data<-IRanges(start=encode.data[,"start"],end=encode.data[," end"]) cov<-coverage(ir.encode.data,weight=as.numeric(ir.encode.data[,"score" ])) ## DANGER HERE myViews<-Views(cov,start=you.intervals[,"start"],end=you.intervals[,"e nd"] ) viewMaxs(myViews) if ir.encode.data and ir are RleLists use regionViews <- RleViewsList(rleList = cov, rangesList =ir ) instesd of the Views I'm not sure which solution is best but the above should use less memory? Dr Paul Leo Senior Bioinformatician The University of Queensland Diamantina Institute --------------------------------------------------------------------- TRI, level , 37 Kent Street, Woolloongabba QLD 4102 Tel: +61 7 3443 7072 Mob: 041 303 8691 Fax: +61 7 3443 6966 -----Original Message----- From: Paul Leo <p.leo@uq.edu.au> Reply-to: <p.leo at="" uq.edu.au=""> To: Martin Morgan <mtmorgan at="" fhcrc.org=""> Cc: bioconductor at r-project.org Subject: Re: [BioC] Turning a GRanges Metadata Column into Rle List Date: Tue, 15 Jan 2013 16:38:33 +1000 If you want the maximum of the overlap you can use something like this too: ir<-IRanges(start=you.intervals[,"start"],end=you.intervals[,"end"]) ir.encode.data<-IRanges(start=encode.data[,"start"],end=encode.data[," end"]) over<-findOverlaps(ir,ir.encode.data) query<-queryHits(over) subject<-subjectHits(over) tapply(as.numeric(encode.data[subject,"score"]),query,function(x) max(x,na.rm=TRUE)) slight modification if have different chrs .... This works for getting GERP scores in indels etc Dr Paul Leo Senior Bioinformatician The University of Queensland Diamantina Institute --------------------------------------------------------------------- TRI, level , 37 Kent Street, Woolloongabba QLD 4102 Tel: +61 7 3443 7072 Mob: 041 303 8691 Fax: +61 7 3443 6966 -----Original Message----- From: Martin Morgan <mtmorgan@fhcrc.org> To: D.Strbenac at garvan.org.au Cc: bioconductor at r-project.org Subject: Re: [BioC] Turning a GRanges Metadata Column into Rle List Date: Mon, 14 Jan 2013 22:17:17 -0800 On 01/14/2013 10:00 PM, Dario Strbenac wrote: >> I was thinking you could just do (assuming score is always >= 0, and that >> the ranges do not overlap, which seems to be the case from the initial >> code): > > In another scenario, what if I have data on multiple cell lines, such as one > of the ENCODE integrated tracks, and I would like to find the maximum value > at each base within a specified genomic region ? In this case, the ranges in > the supertrack would overlap often. > > I would imagine making a separate coverage RleList for each cell line, to > avoid complications with overlapping ranges. Then, it would be nice to have a > pmax function in the API that could take a number of RleList objects, each of > the same length, and return one RleList. Are there any plans for pmin and > pmax of this style ? I think there is one? library(IRanges) showMethods(class="RleList", where=search()) > r1 = RleList(a=Rle(c(0, 0, 1, 0)), b=Rle(c(1, 1, 0, 0))) > r2 = RleList(a=Rle(c(0, 0, 2, 2)), b=Rle(c(1, 2, 1, 1))) > pmax(r1, r2) SimpleRleList of length 2 $a numeric-Rle of length 4 with 2 runs Lengths: 2 2 Values : 0 2 $b numeric-Rle of length 4 with 3 runs Lengths: 1 1 2 Values : 1 2 1 > > _______________________________________________ Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > _______________________________________________ Bioconductor mailing list Bioconductor at r-project.org https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD REPLY • link 13.0 years ago Paul Leo ▴ 970

Login before adding your answer.