Search
Question: help with reduction operation using IRanges/GRanges
0
6.1 years ago by
Abhishek Pratap400 wrote:
Hi Guys I am trying to run a reduce operation on my genomic interval data. Sample Data and expected output pasted here : http://pastebin.com/cccZby2t Rows 1-5 from input have merged since the tiling distance between them is < 2 bp. Ideally I would like to define that distance. Rows 5-6 are merged separately as their distance is more than 2 bp from the above rows. I hope tiling is the right word above. In essence what I want to do is define a max merging width and merge all the intervals which are within this user defined threshold . The condition of tiling distance should be satisfied by both start and end. Hope this is clear. Thanks! -Abhi [[alternative HTML version deleted]]
modified 6.1 years ago • written 6.1 years ago by Abhishek Pratap400
0
6.1 years ago by
United States
Kasper Daniel Hansen6.3k wrote:
Use the min.gapwidth argument to reduce, as explained in the help page. Kasper On Tue, Apr 24, 2012 at 2:47 PM, Abhishek Pratap <apratap at="" lbl.gov=""> wrote: > Hi Guys > > I am trying to run a reduce operation on my genomic interval data. > > Sample Data and expected output pasted here : http://pastebin.com/cccZby2t > > > Rows 1-5 from input have merged since the tiling distance between them is < > 2 bp. Ideally I would like to define that distance. > Rows 5-6 are merged separately as their distance is more than 2 bp from the > above rows. > > I hope tiling is the right word above. In essence what I want to do is > define a max merging width and merge all the intervals which are within > this user defined threshold . The condition of tiling distance should be > satisfied by both start and end. > > Hope this is clear. > > Thanks! > -Abhi > > ? ? ? ?[[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
Hi Kasper Thanks for a quick reply. I tried the following and I believe I would need a max.gapwidth parameter. ir <- IRanges(c(10,10,11,9,10,11), width=c(190,190,190,190,180,180)) reduce(ir,min.gapwidth=1) IRanges of length 1 start end width [1] 9 200 192 Although not sure why this one merges everything reduce(ir,min.gapwidth=11) IRanges of length 1 start end width [1] 9 200 192 -Abhi On Tue, Apr 24, 2012 at 2:54 PM, Kasper Daniel Hansen < kasperdanielhansen@gmail.com> wrote: > Use the min.gapwidth argument to reduce, as explained in the help page. > > Kasper > > On Tue, Apr 24, 2012 at 2:47 PM, Abhishek Pratap <apratap@lbl.gov> wrote: > > Hi Guys > > > > I am trying to run a reduce operation on my genomic interval data. > > > > Sample Data and expected output pasted here : > http://pastebin.com/cccZby2t > > > > > > Rows 1-5 from input have merged since the tiling distance between them > is < > > 2 bp. Ideally I would like to define that distance. > > Rows 5-6 are merged separately as their distance is more than 2 bp from > the > > above rows. > > > > I hope tiling is the right word above. In essence what I want to do is > > define a max merging width and merge all the intervals which are within > > this user defined threshold . The condition of tiling distance should be > > satisfied by both start and end. > > > > Hope this is clear. > > > > Thanks! > > -Abhi > > > > [[alternative HTML version deleted]] > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor@r-project.org > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]
> ir <- IRanges(start = c(1,5), width = 3) > ir IRanges of length 2 start end width [1] 1 3 3 [2] 5 7 3 > reduce(ir) IRanges of length 2 start end width [1] 1 3 3 [2] 5 7 3 > reduce(ir, min.gapwidth = 2) IRanges of length 1 start end width [1] 1 7 7 As far as I understand, this is exactly what you want. If you have further questions, I suggest you read the help page. Kasper On Tue, Apr 24, 2012 at 3:18 PM, Abhishek Pratap <apratap at="" lbl.gov=""> wrote: > Hi Kasper > > Thanks for a quick reply. I tried the following and I believe I would need a > max.gapwidth parameter. > > ir <- IRanges(c(10,10,11,9,10,11), width=c(190,190,190,190,180,180)) > > reduce(ir,min.gapwidth=1) > IRanges of length 1 > ? ? start end width > [1] ? ? 9 200 ? 192 > > > Although not sure why this one merges everything > > ?reduce(ir,min.gapwidth=11) > IRanges of length 1 > ? ? start end width > [1] ? ? 9 200 ? 192 > > -Abhi > > > > > > On Tue, Apr 24, 2012 at 2:54 PM, Kasper Daniel Hansen > <kasperdanielhansen at="" gmail.com=""> wrote: >> >> Use the min.gapwidth argument to reduce, as explained in the help page. >> >> Kasper >> >> On Tue, Apr 24, 2012 at 2:47 PM, Abhishek Pratap <apratap at="" lbl.gov=""> wrote: >> > Hi Guys >> > >> > I am trying to run a reduce operation on my genomic interval data. >> > >> > Sample Data and expected output pasted here : >> > http://pastebin.com/cccZby2t >> > >> > >> > Rows 1-5 from input have merged since the tiling distance between them >> > is < >> > 2 bp. Ideally I would like to define that distance. >> > Rows 5-6 are merged separately as their distance is more than 2 bp from >> > the >> > above rows. >> > >> > I hope tiling is the right word above. In essence what I want to do is >> > define a max merging width and merge all the intervals which are within >> > this user defined threshold . The condition of tiling distance should be >> > satisfied by both start and end. >> > >> > Hope this is clear. >> > >> > Thanks! >> > -Abhi >> > >> > ? ? ? ?[[alternative HTML version deleted]] >> > >> > _______________________________________________ >> > Bioconductor mailing list >> > Bioconductor at r-project.org >> > https://stat.ethz.ch/mailman/listinfo/bioconductor >> > Search the archives: >> > http://news.gmane.org/gmane.science.biology.informatics.conductor > >
ADD REPLYlink written 6.1 years ago by Kasper Daniel Hansen6.3k
Hi, On Tue, Apr 24, 2012 at 6:43 PM, Kasper Daniel Hansen <kasperdanielhansen at="" gmail.com=""> wrote: >> ir <- IRanges(start = c(1,5), width = 3) >> ir > IRanges of length 2 > ? ?start end width > [1] ? ? 1 ? 3 ? ? 3 > [2] ? ? 5 ? 7 ? ? 3 >> reduce(ir) > IRanges of length 2 > ? ?start end width > [1] ? ? 1 ? 3 ? ? 3 > [2] ? ? 5 ? 7 ? ? 3 >> reduce(ir, min.gapwidth = 2) > IRanges of length 1 > ? ?start end width > [1] ? ? 1 ? 7 ? ? 7 > > As far as I understand, this is exactly what you want. ?If you have > further questions, I suggest you read the help page. This actually isn't what he wants -- but what he wants isn't answerable w/ reduce. All of the OP's ranges overlap eachother -- I think he's trying to split his ranges into two groups because one group has ends that are very close to each other and the other's are a bit different. Note that the last two rows, the ends are a greater distance away from (but still *within*) the ends of the first group of ranges he's trying to group. It's almost trying to cluster groups of reads together, but again -- not really. Not sure why you might want to do this, but ... just to say that reduce isn't the answer here. -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology ?| Memorial Sloan-Kettering Cancer Center ?| Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact
Hi Steve That's precisely correct. Thanks for rewording it. I am trying to merge all the rows which have a tiling distance of 1 bp or less on both start and end. If a range internal doesnt agree with that then it should not be collapsed. I agree it is somewhat related to clustering but not exactly it. I will mention the data link again in case anyone else want to have look without digging deep in the thread. http://pastebin.com/cccZby2t Thanks for your help guys, -Abhi On Tue, Apr 24, 2012 at 3:54 PM, Steve Lianoglou < mailinglist.honeypot@gmail.com> wrote: > Hi, > > On Tue, Apr 24, 2012 at 6:43 PM, Kasper Daniel Hansen > <kasperdanielhansen@gmail.com> wrote: > >> ir <- IRanges(start = c(1,5), width = 3) > >> ir > > IRanges of length 2 > > start end width > > [1] 1 3 3 > > [2] 5 7 3 > >> reduce(ir) > > IRanges of length 2 > > start end width > > [1] 1 3 3 > > [2] 5 7 3 > >> reduce(ir, min.gapwidth = 2) > > IRanges of length 1 > > start end width > > [1] 1 7 7 > > > > As far as I understand, this is exactly what you want. If you have > > further questions, I suggest you read the help page. > > This actually isn't what he wants -- but what he wants isn't > answerable w/ reduce. > > All of the OP's ranges overlap eachother -- I think he's trying to > split his ranges into two groups because one group has ends that are > very close to each other and the other's are a bit different. Note > that the last two rows, the ends are a greater distance away from (but > still *within*) the ends of the first group of ranges he's trying to > group. > > It's almost trying to cluster groups of reads together, but again -- not > really. > > Not sure why you might want to do this, but ... just to say that > reduce isn't the answer here. > > -steve > > -- > Steve Lianoglou > Graduate Student: Computational Systems Biology > | Memorial Sloan-Kettering Cancer Center > | Weill Medical College of Cornell University > Contact Info: http://cbio.mskcc.org/~lianos/contact > [[alternative HTML version deleted]]
Quick thought. What if you think about doing two reduce-like operations on your data, but you would resize them to width=1, fix="start" and do some findOverlaps mojo with those coords. Then do the same but play with width=1 transformations web you fix at the end. I'm writing from my phone so can't really tease that out, but perhaps you get the direction I'm going? On Tuesday, April 24, 2012, Abhishek Pratap wrote: > Hi Steve > > That's precisely correct. Thanks for rewording it. > > I am trying to merge all the rows which have a tiling distance of 1 bp or > less on both start and end. If a range internal doesnt agree with that > then it should not be collapsed. > > I agree it is somewhat related to clustering but not exactly it. > > I will mention the data link again in case anyone else want to have look > without digging deep in the thread. > http://pastebin.com/cccZby2t > > Thanks for your help guys, > -Abhi > > On Tue, Apr 24, 2012 at 3:54 PM, Steve Lianoglou < > mailinglist.honeypot@gmail.com <javascript:_e({}, 'cvml',=""> 'mailinglist.honeypot@gmail.com');>> wrote: > >> Hi, >> >> On Tue, Apr 24, 2012 at 6:43 PM, Kasper Daniel Hansen >> <kasperdanielhansen@gmail.com <javascript:_e({},="" 'cvml',="">> 'kasperdanielhansen@gmail.com');>> wrote: >> >> ir <- IRanges(start = c(1,5), width = 3) >> >> ir >> > IRanges of length 2 >> > start end width >> > [1] 1 3 3 >> > [2] 5 7 3 >> >> reduce(ir) >> > IRanges of length 2 >> > start end width >> > [1] 1 3 3 >> > [2] 5 7 3 >> >> reduce(ir, min.gapwidth = 2) >> > IRanges of length 1 >> > start end width >> > [1] 1 7 7 >> > >> > As far as I understand, this is exactly what you want. If you have >> > further questions, I suggest you read the help page. >> >> This actually isn't what he wants -- but what he wants isn't >> answerable w/ reduce. >> >> All of the OP's ranges overlap eachother -- I think he's trying to >> split his ranges into two groups because one group has ends that are >> very close to each other and the other's are a bit different. Note >> that the last two rows, the ends are a greater distance away from (but >> still *within*) the ends of the first group of ranges he's trying to >> group. >> >> It's almost trying to cluster groups of reads together, but again -- not >> really. >> >> Not sure why you might want to do this, but ... just to say that >> reduce isn't the answer here. >> >> -steve >> >> -- >> Steve Lianoglou >> Graduate Student: Computational Systems Biology >> | Memorial Sloan-Kettering Cancer Center >> | Weill Medical College of Cornell University >> Contact Info: http://cbio.mskcc.org/~lianos/contact >> > > -- Steve Lianoglou Graduate Student: Computational Systems Biology | Memorial Sloan-Kettering Cancer Center | Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact [[alternative HTML version deleted]]
I think Steve is off to a good start. Maybe use flank(x, 1, both = TRUE) for the starts, and then flank(x, 1, start = FALSE, both = TRUE) for the ends. Then do findOverlaps(starts) and findOverlaps(ends). Use intersect(x, y) on the results to get the pairs that match at both the start and end. Then probably need RBGL and connComp function to find the connected components of that graph. Give each connected component a different seqname in a GRanges and call reduce. Maybe. Michael On Tue, Apr 24, 2012 at 4:41 PM, Steve Lianoglou < mailinglist.honeypot@gmail.com> wrote: > Quick thought. > > What if you think about doing two reduce-like operations on your data, but > you would resize them to width=1, fix="start" and do some findOverlaps mojo > with those coords. Then do the same but play with width=1 transformations > web you fix at the end. > > I'm writing from my phone so can't really tease that out, but perhaps you > get the direction I'm going? > > > > On Tuesday, April 24, 2012, Abhishek Pratap wrote: > > > Hi Steve > > > > That's precisely correct. Thanks for rewording it. > > > > I am trying to merge all the rows which have a tiling distance of 1 bp or > > less on both start and end. If a range internal doesnt agree with that > > then it should not be collapsed. > > > > I agree it is somewhat related to clustering but not exactly it. > > > > I will mention the data link again in case anyone else want to have look > > without digging deep in the thread. > > http://pastebin.com/cccZby2t > > > > Thanks for your help guys, > > -Abhi > > > > On Tue, Apr 24, 2012 at 3:54 PM, Steve Lianoglou < > > mailinglist.honeypot@gmail.com <javascript:_e({}, 'cvml',=""> > 'mailinglist.honeypot@gmail.com');>> wrote: > > > >> Hi, > >> > >> On Tue, Apr 24, 2012 at 6:43 PM, Kasper Daniel Hansen > >> <kasperdanielhansen@gmail.com <javascript:_e({},="" 'cvml',=""> >> 'kasperdanielhansen@gmail.com');>> wrote: > >> >> ir <- IRanges(start = c(1,5), width = 3) > >> >> ir > >> > IRanges of length 2 > >> > start end width > >> > [1] 1 3 3 > >> > [2] 5 7 3 > >> >> reduce(ir) > >> > IRanges of length 2 > >> > start end width > >> > [1] 1 3 3 > >> > [2] 5 7 3 > >> >> reduce(ir, min.gapwidth = 2) > >> > IRanges of length 1 > >> > start end width > >> > [1] 1 7 7 > >> > > >> > As far as I understand, this is exactly what you want. If you have > >> > further questions, I suggest you read the help page. > >> > >> This actually isn't what he wants -- but what he wants isn't > >> answerable w/ reduce. > >> > >> All of the OP's ranges overlap eachother -- I think he's trying to > >> split his ranges into two groups because one group has ends that are > >> very close to each other and the other's are a bit different. Note > >> that the last two rows, the ends are a greater distance away from (but > >> still *within*) the ends of the first group of ranges he's trying to > >> group. > >> > >> It's almost trying to cluster groups of reads together, but again -- not > >> really. > >> > >> Not sure why you might want to do this, but ... just to say that > >> reduce isn't the answer here. > >> > >> -steve > >> > >> -- > >> Steve Lianoglou > >> Graduate Student: Computational Systems Biology > >> | Memorial Sloan-Kettering Cancer Center > >> | Weill Medical College of Cornell University > >> Contact Info: http://cbio.mskcc.org/~lianos/contact > >> > > > > > > -- > Steve Lianoglou > Graduate Student: Computational Systems Biology > | Memorial Sloan-Kettering Cancer Center > | Weill Medical College of Cornell University > Contact Info: http://cbio.mskcc.org/~lianos/contact > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]
Hi Micheal and Steve Thanks for giving me the pointers. Here is what I did(basically steps suggested by you guys). I think I might need some hand holding. Not sure how to find the components common in two findoverlaps output. ir <- IRanges(c(10,10,11,9,10,11), width=c(190,190,190,190,180,180)) start = flank(ir,1,both=TRUE) end=flank(ir,1,start=FALSE,both=TRUE) start_overlaps = findOverlaps(start) end_overlaps = findOverlaps(end) If I do the following I get an error. intersect(start_overlaps,end_overlaps) Error in as.vector(y) : no method for coercing this S4 class to a vector And the following, I belv gives me all the indexes. intersect(as.matrix(findOverlaps(start)),as.matrix(findOverlaps(end))) [1] 1 2 3 4 5 6 As Micheal said I still need to find the connected components, could you please help me with that. Unfortunately I dont have much experience here. Appreciate your time. best, -Abhi On Tue, Apr 24, 2012 at 6:02 PM, Michael Lawrence <lawrence.michael@gene.com> wrote: > I think Steve is off to a good start. Maybe use flank(x, 1, both = TRUE) > for the starts, and then flank(x, 1, start = FALSE, both = TRUE) for the > ends. > > Then do findOverlaps(starts) and findOverlaps(ends). Use intersect(x, y) > on the results to get the pairs that match at both the start and end. > > Then probably need RBGL and connComp function to find the connected > components of that graph. Give each connected component a different seqname > in a GRanges and call reduce. > > Maybe. > > Michael > > On Tue, Apr 24, 2012 at 4:41 PM, Steve Lianoglou < > mailinglist.honeypot@gmail.com> wrote: > >> Quick thought. >> >> What if you think about doing two reduce-like operations on your data, but >> you would resize them to width=1, fix="start" and do some findOverlaps >> mojo >> with those coords. Then do the same but play with width=1 transformations >> web you fix at the end. >> >> I'm writing from my phone so can't really tease that out, but perhaps you >> get the direction I'm going? >> >> >> >> On Tuesday, April 24, 2012, Abhishek Pratap wrote: >> >> > Hi Steve >> > >> > That's precisely correct. Thanks for rewording it. >> > >> > I am trying to merge all the rows which have a tiling distance of 1 bp >> or >> > less on both start and end. If a range internal doesnt agree with that >> > then it should not be collapsed. >> > >> > I agree it is somewhat related to clustering but not exactly it. >> > >> > I will mention the data link again in case anyone else want to have >> look >> > without digging deep in the thread. >> > http://pastebin.com/cccZby2t >> > >> > Thanks for your help guys, >> > -Abhi >> > >> > On Tue, Apr 24, 2012 at 3:54 PM, Steve Lianoglou < >> > mailinglist.honeypot@gmail.com <javascript:_e({}, 'cvml',="">> >> > 'mailinglist.honeypot@gmail.com');>> wrote: >> > >> >> Hi, >> >> >> >> On Tue, Apr 24, 2012 at 6:43 PM, Kasper Daniel Hansen >> >> <kasperdanielhansen@gmail.com <javascript:_e({},="" 'cvml',="">> >> 'kasperdanielhansen@gmail.com');>> wrote: >> >> >> ir <- IRanges(start = c(1,5), width = 3) >> >> >> ir >> >> > IRanges of length 2 >> >> > start end width >> >> > [1] 1 3 3 >> >> > [2] 5 7 3 >> >> >> reduce(ir) >> >> > IRanges of length 2 >> >> > start end width >> >> > [1] 1 3 3 >> >> > [2] 5 7 3 >> >> >> reduce(ir, min.gapwidth = 2) >> >> > IRanges of length 1 >> >> > start end width >> >> > [1] 1 7 7 >> >> > >> >> > As far as I understand, this is exactly what you want. If you have >> >> > further questions, I suggest you read the help page. >> >> >> >> This actually isn't what he wants -- but what he wants isn't >> >> answerable w/ reduce. >> >> >> >> All of the OP's ranges overlap eachother -- I think he's trying to >> >> split his ranges into two groups because one group has ends that are >> >> very close to each other and the other's are a bit different. Note >> >> that the last two rows, the ends are a greater distance away from (but >> >> still *within*) the ends of the first group of ranges he's trying to >> >> group. >> >> >> >> It's almost trying to cluster groups of reads together, but again -- >> not >> >> really. >> >> >> >> Not sure why you might want to do this, but ... just to say that >> >> reduce isn't the answer here. >> >> >> >> -steve >> >> >> >> -- >> >> Steve Lianoglou >> >> Graduate Student: Computational Systems Biology >> >> | Memorial Sloan-Kettering Cancer Center >> >> | Weill Medical College of Cornell University >> >> Contact Info: http://cbio.mskcc.org/~lianos/contact >> >> >> > >> > >> >> -- >> Steve Lianoglou >> Graduate Student: Computational Systems Biology >> | Memorial Sloan-Kettering Cancer Center >> | Weill Medical College of Cornell University >> Contact Info: http://cbio.mskcc.org/~lianos/contact >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor@r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > [[alternative HTML version deleted]]
Hi Abhishek, A sessionInfo() would help here. The intersect functionality on Hits objects was added last release. Michael On Wed, Apr 25, 2012 at 11:00 AM, Abhishek Pratap <apratap@lbl.gov> wrote: > Hi Micheal and Steve > > Thanks for giving me the pointers. Here is what I did(basically steps > suggested by you guys). I think I might need some hand holding. Not sure > how to find the components common in two findoverlaps output. > > > ir <- IRanges(c(10,10,11,9,10,11), width=c(190,190,190,190,180,180)) > start = flank(ir,1,both=TRUE) > end=flank(ir,1,start=FALSE,both=TRUE) > start_overlaps = findOverlaps(start) > end_overlaps = findOverlaps(end) > > > If I do the following I get an error. > intersect(start_overlaps,end_overlaps) > Error in as.vector(y) : no method for coercing this S4 class to a vector > > And the following, I belv gives me all the indexes. > intersect(as.matrix(findOverlaps(start)),as.matrix(findOverlaps(end))) > [1] 1 2 3 4 5 6 > > As Micheal said I still need to find the connected components, could you > please help me with that. Unfortunately I dont have much experience here. > Appreciate your time. > > > best, > -Abhi > > > > > > On Tue, Apr 24, 2012 at 6:02 PM, Michael Lawrence < > lawrence.michael@gene.com> wrote: > >> I think Steve is off to a good start. Maybe use flank(x, 1, both = TRUE) >> for the starts, and then flank(x, 1, start = FALSE, both = TRUE) for the >> ends. >> >> Then do findOverlaps(starts) and findOverlaps(ends). Use intersect(x, y) >> on the results to get the pairs that match at both the start and end. >> >> Then probably need RBGL and connComp function to find the connected >> components of that graph. Give each connected component a different seqname >> in a GRanges and call reduce. >> >> Maybe. >> >> Michael >> >> On Tue, Apr 24, 2012 at 4:41 PM, Steve Lianoglou < >> mailinglist.honeypot@gmail.com> wrote: >> >>> Quick thought. >>> >>> What if you think about doing two reduce-like operations on your data, >>> but >>> you would resize them to width=1, fix="start" and do some findOverlaps >>> mojo >>> with those coords. Then do the same but play with width=1 transformations >>> web you fix at the end. >>> >>> I'm writing from my phone so can't really tease that out, but perhaps you >>> get the direction I'm going? >>> >>> >>> >>> On Tuesday, April 24, 2012, Abhishek Pratap wrote: >>> >>> > Hi Steve >>> > >>> > That's precisely correct. Thanks for rewording it. >>> > >>> > I am trying to merge all the rows which have a tiling distance of 1 bp >>> or >>> > less on both start and end. If a range internal doesnt agree with that >>> > then it should not be collapsed. >>> > >>> > I agree it is somewhat related to clustering but not exactly it. >>> > >>> > I will mention the data link again in case anyone else want to have >>> look >>> > without digging deep in the thread. >>> > http://pastebin.com/cccZby2t >>> > >>> > Thanks for your help guys, >>> > -Abhi >>> > >>> > On Tue, Apr 24, 2012 at 3:54 PM, Steve Lianoglou < >>> > mailinglist.honeypot@gmail.com <javascript:_e({}, 'cvml',="">>> >>> > 'mailinglist.honeypot@gmail.com');>> wrote: >>> > >>> >> Hi, >>> >> >>> >> On Tue, Apr 24, 2012 at 6:43 PM, Kasper Daniel Hansen >>> >> <kasperdanielhansen@gmail.com <javascript:_e({},="" 'cvml',="">>> >> 'kasperdanielhansen@gmail.com');>> wrote: >>> >> >> ir <- IRanges(start = c(1,5), width = 3) >>> >> >> ir >>> >> > IRanges of length 2 >>> >> > start end width >>> >> > [1] 1 3 3 >>> >> > [2] 5 7 3 >>> >> >> reduce(ir) >>> >> > IRanges of length 2 >>> >> > start end width >>> >> > [1] 1 3 3 >>> >> > [2] 5 7 3 >>> >> >> reduce(ir, min.gapwidth = 2) >>> >> > IRanges of length 1 >>> >> > start end width >>> >> > [1] 1 7 7 >>> >> > >>> >> > As far as I understand, this is exactly what you want. If you have >>> >> > further questions, I suggest you read the help page. >>> >> >>> >> This actually isn't what he wants -- but what he wants isn't >>> >> answerable w/ reduce. >>> >> >>> >> All of the OP's ranges overlap eachother -- I think he's trying to >>> >> split his ranges into two groups because one group has ends that are >>> >> very close to each other and the other's are a bit different. Note >>> >> that the last two rows, the ends are a greater distance away from (but >>> >> still *within*) the ends of the first group of ranges he's trying to >>> >> group. >>> >> >>> >> It's almost trying to cluster groups of reads together, but again -- >>> not >>> >> really. >>> >> >>> >> Not sure why you might want to do this, but ... just to say that >>> >> reduce isn't the answer here. >>> >> >>> >> -steve >>> >> >>> >> -- >>> >> Steve Lianoglou >>> >> Graduate Student: Computational Systems Biology >>> >> | Memorial Sloan-Kettering Cancer Center >>> >> | Weill Medical College of Cornell University >>> >> Contact Info: http://cbio.mskcc.org/~lianos/contact >>> >> >>> > >>> > >>> >>> -- >>> Steve Lianoglou >>> Graduate Student: Computational Systems Biology >>> | Memorial Sloan-Kettering Cancer Center >>> | Weill Medical College of Cornell University >>> Contact Info: http://cbio.mskcc.org/~lianos/contact >>> >>> [[alternative HTML version deleted]] >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor@r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >> >> > [[alternative HTML version deleted]]
Hi Michael SessionInfo copied below. My versions could be one older to current one. I am still wondering how I can get this information in a format that can be digested by connectedComp or something similar. I think we are close to a solution. Thanks! -Abhi > sessionInfo() R version 2.14.1 (2011-12-22) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] GenomicRanges_1.6.7 IRanges_1.12.6 loaded via a namespace (and not attached): [1] tcltk_2.14.1 tools_2.14.1 On Wed, Apr 25, 2012 at 2:16 PM, Michael Lawrence <lawrence.michael@gene.com> wrote: > Hi Abhishek, > > A sessionInfo() would help here. The intersect functionality on Hits > objects was added last release. > > Michael > > > On Wed, Apr 25, 2012 at 11:00 AM, Abhishek Pratap <apratap@lbl.gov> wrote: > >> Hi Micheal and Steve >> >> Thanks for giving me the pointers. Here is what I did(basically steps >> suggested by you guys). I think I might need some hand holding. Not sure >> how to find the components common in two findoverlaps output. >> >> >> ir <- IRanges(c(10,10,11,9,10,11), width=c(190,190,190,190,180,180)) >> start = flank(ir,1,both=TRUE) >> end=flank(ir,1,start=FALSE,both=TRUE) >> start_overlaps = findOverlaps(start) >> end_overlaps = findOverlaps(end) >> >> >> If I do the following I get an error. >> intersect(start_overlaps,end_overlaps) >> Error in as.vector(y) : no method for coercing this S4 class to a vector >> >> And the following, I belv gives me all the indexes. >> intersect(as.matrix(findOverlaps(start)),as.matrix(findOverlaps(end))) >> [1] 1 2 3 4 5 6 >> >> As Micheal said I still need to find the connected components, could you >> please help me with that. Unfortunately I dont have much experience here. >> Appreciate your time. >> >> >> best, >> -Abhi >> >> >> >> >> >> On Tue, Apr 24, 2012 at 6:02 PM, Michael Lawrence < >> lawrence.michael@gene.com> wrote: >> >>> I think Steve is off to a good start. Maybe use flank(x, 1, both = TRUE) >>> for the starts, and then flank(x, 1, start = FALSE, both = TRUE) for the >>> ends. >>> >>> Then do findOverlaps(starts) and findOverlaps(ends). Use intersect(x, y) >>> on the results to get the pairs that match at both the start and end. >>> >>> Then probably need RBGL and connComp function to find the connected >>> components of that graph. Give each connected component a different seqname >>> in a GRanges and call reduce. >>> >>> Maybe. >>> >>> Michael >>> >>> On Tue, Apr 24, 2012 at 4:41 PM, Steve Lianoglou < >>> mailinglist.honeypot@gmail.com> wrote: >>> >>>> Quick thought. >>>> >>>> What if you think about doing two reduce-like operations on your data, >>>> but >>>> you would resize them to width=1, fix="start" and do some findOverlaps >>>> mojo >>>> with those coords. Then do the same but play with width=1 >>>> transformations >>>> web you fix at the end. >>>> >>>> I'm writing from my phone so can't really tease that out, but perhaps >>>> you >>>> get the direction I'm going? >>>> >>>> >>>> >>>> On Tuesday, April 24, 2012, Abhishek Pratap wrote: >>>> >>>> > Hi Steve >>>> > >>>> > That's precisely correct. Thanks for rewording it. >>>> > >>>> > I am trying to merge all the rows which have a tiling distance of 1 >>>> bp or >>>> > less on both start and end. If a range internal doesnt agree with >>>> that >>>> > then it should not be collapsed. >>>> > >>>> > I agree it is somewhat related to clustering but not exactly it. >>>> > >>>> > I will mention the data link again in case anyone else want to have >>>> look >>>> > without digging deep in the thread. >>>> > http://pastebin.com/cccZby2t >>>> > >>>> > Thanks for your help guys, >>>> > -Abhi >>>> > >>>> > On Tue, Apr 24, 2012 at 3:54 PM, Steve Lianoglou < >>>> > mailinglist.honeypot@gmail.com <javascript:_e({}, 'cvml',="">>>> >>>> > 'mailinglist.honeypot@gmail.com');>> wrote: >>>> > >>>> >> Hi, >>>> >> >>>> >> On Tue, Apr 24, 2012 at 6:43 PM, Kasper Daniel Hansen >>>> >> <kasperdanielhansen@gmail.com <javascript:_e({},="" 'cvml',="">>>> >> 'kasperdanielhansen@gmail.com');>> wrote: >>>> >> >> ir <- IRanges(start = c(1,5), width = 3) >>>> >> >> ir >>>> >> > IRanges of length 2 >>>> >> > start end width >>>> >> > [1] 1 3 3 >>>> >> > [2] 5 7 3 >>>> >> >> reduce(ir) >>>> >> > IRanges of length 2 >>>> >> > start end width >>>> >> > [1] 1 3 3 >>>> >> > [2] 5 7 3 >>>> >> >> reduce(ir, min.gapwidth = 2) >>>> >> > IRanges of length 1 >>>> >> > start end width >>>> >> > [1] 1 7 7 >>>> >> > >>>> >> > As far as I understand, this is exactly what you want. If you have >>>> >> > further questions, I suggest you read the help page. >>>> >> >>>> >> This actually isn't what he wants -- but what he wants isn't >>>> >> answerable w/ reduce. >>>> >> >>>> >> All of the OP's ranges overlap eachother -- I think he's trying to >>>> >> split his ranges into two groups because one group has ends that are >>>> >> very close to each other and the other's are a bit different. Note >>>> >> that the last two rows, the ends are a greater distance away from >>>> (but >>>> >> still *within*) the ends of the first group of ranges he's trying to >>>> >> group. >>>> >> >>>> >> It's almost trying to cluster groups of reads together, but again -- >>>> not >>>> >> really. >>>> >> >>>> >> Not sure why you might want to do this, but ... just to say that >>>> >> reduce isn't the answer here. >>>> >> >>>> >> -steve >>>> >> >>>> >> -- >>>> >> Steve Lianoglou >>>> >> Graduate Student: Computational Systems Biology >>>> >> | Memorial Sloan-Kettering Cancer Center >>>> >> | Weill Medical College of Cornell University >>>> >> Contact Info: http://cbio.mskcc.org/~lianos/contact >>>> >> >>>> > >>>> > >>>> >>>> -- >>>> Steve Lianoglou >>>> Graduate Student: Computational Systems Biology >>>> | Memorial Sloan-Kettering Cancer Center >>>> | Weill Medical College of Cornell University >>>> Contact Info: http://cbio.mskcc.org/~lianos/contact >>>> >>>> [[alternative HTML version deleted]] >>>> >>>> _______________________________________________ >>>> Bioconductor mailing list >>>> Bioconductor@r-project.org >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> Search the archives: >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>> >>> >>> >> > [[alternative HTML version deleted]]
Hi, On Wed, Apr 25, 2012 at 5:21 PM, Abhishek Pratap <apratap at="" lbl.gov=""> wrote: > Hi Michael > > SessionInfo copied below. My versions could be one older to current one.? I > am still wondering how I can get this information in a format that can be > digested by connectedComp or something similar. I think we are close to a > solution. Step 1: Upgrade R ;-) It's not necessary for the approach I'm going to suggest, but it'll probably make it easier for Michael to help you w/ his solution, which is probably going to be more robust than the duct-tape-and-elmer's-glue snippet I'm going to try: R> library(GenomicRanges) R> ir <- IRanges(c(10,10,11,9,10,11), width=c(190,190,190,190,180,180)) R> starts <- reduce(resize(ir, width=1, fix='start'), min.gapwidth=2) R> ends <- reduce(resize(ir, width=1, fix='end'), min.gapwidth=2) R> sc <- countOverlaps(ir, starts) R> ec <- countOverlaps(ir, ends) ... and ... good morning: R> split(ir, (paste(sc,ec,sep=":"))) CompressedIRangesList of length 2 $1:1 IRanges of length 2 start end width [1] 10 189 180 [2] 11 190 180$1:2 IRanges of length 4 start end width [1] 10 199 190 [2] 10 199 190 [3] 11 200 190 [4] 9 198 190 HTH, -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology ?| Memorial Sloan-Kettering Cancer Center ?| Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact
Thanks Steve. Legit solution and it works for me too. Based on my partial understanding of methods I have no idea how this will scale for a million points,I have in the actual data(may be it will) but I will let you know. @Michael : I updated my installation and I am able to run the intersect step on the findOverlaps() output from start and end. I guess now I need to convert the common hits to a graph object and call connComp on it. Any way I could convert hits matrix to a adjacency matrix to create a graph or maybe there is another slick way to find the connected points. ir <- IRanges(c(10,10,11,9,10,11), width=c(190,190,190,190,180,180)) start <- flank(ir,1,both=TRUE) end <- flank(ir,1,start=FALSE,both=TRUE) start_overlaps <- findOverlaps(start) end_overlaps <- findOverlaps(end) good_hits <- intersect(start_overlaps,end_overlaps) Thanks! -Abhi On Wed, Apr 25, 2012 at 3:08 PM, Steve Lianoglou < mailinglist.honeypot@gmail.com> wrote: > Hi, > > On Wed, Apr 25, 2012 at 5:21 PM, Abhishek Pratap <apratap@lbl.gov> wrote: > > Hi Michael > > > > SessionInfo copied below. My versions could be one older to current > one. I > > am still wondering how I can get this information in a format that can be > > digested by connectedComp or something similar. I think we are close to a > > solution. > > Step 1: Upgrade R ;-) > > It's not necessary for the approach I'm going to suggest, but it'll > probably make it easier for Michael to help you w/ his solution, which > is probably going to be more robust than the > duct-tape-and-elmer's-glue snippet I'm going to try: > > R> library(GenomicRanges) > R> ir <- IRanges(c(10,10,11,9,10,11), width=c(190,190,190,190,180,180)) > R> starts <- reduce(resize(ir, width=1, fix='start'), min.gapwidth=2) > R> ends <- reduce(resize(ir, width=1, fix='end'), min.gapwidth=2) > R> sc <- countOverlaps(ir, starts) > R> ec <- countOverlaps(ir, ends) > > ... and ... good morning: > > R> split(ir, (paste(sc,ec,sep=":"))) > CompressedIRangesList of length 2 > $1:1 > IRanges of length 2 > start end width > [1] 10 189 180 > [2] 11 190 180 > >$1:2 > IRanges of length 4 > start end width > [1] 10 199 190 > [2] 10 199 190 > [3] 11 200 190 > [4] 9 198 190 > > HTH, > -steve > -- > Steve Lianoglou > Graduate Student: Computational Systems Biology > | Memorial Sloan-Kettering Cancer Center > | Weill Medical College of Cornell University > Contact Info: http://cbio.mskcc.org/~lianos/contact > [[alternative HTML version deleted]]
Making an adjacency matrix from a Hits object would be something like: am <- matrix(0, queryLength(hits), subjectLength(hits)) am[as.matrix(hits)] <- 1 Michael On Wed, Apr 25, 2012 at 3:40 PM, Abhishek Pratap <apratap@lbl.gov> wrote: > Thanks Steve. Legit solution and it works for me too. Based on my partial > understanding of methods I have no idea how this will scale for a million > points,I have in the actual data(may be it will) but I will let you know. > > @Michael : I updated my installation and I am able to run the intersect > step on the findOverlaps() output from start and end. > I guess now I need to convert the common hits to a graph object and call > connComp on it. Any way I could convert hits matrix to a adjacency matrix > to create a graph or maybe there is another slick way to find the connected > points. > > ir <- IRanges(c(10,10,11,9,10,11), width=c(190,190,190,190,180,180)) > start <- flank(ir,1,both=TRUE) > end <- flank(ir,1,start=FALSE,both=TRUE) > start_overlaps <- findOverlaps(start) > end_overlaps <- findOverlaps(end) > good_hits <- intersect(start_overlaps,end_overlaps) > > > Thanks! > -Abhi > > > On Wed, Apr 25, 2012 at 3:08 PM, Steve Lianoglou < > mailinglist.honeypot@gmail.com> wrote: > >> Hi, >> >> On Wed, Apr 25, 2012 at 5:21 PM, Abhishek Pratap <apratap@lbl.gov> wrote: >> > Hi Michael >> > >> > SessionInfo copied below. My versions could be one older to current >> one. I >> > am still wondering how I can get this information in a format that can >> be >> > digested by connectedComp or something similar. I think we are close to >> a >> > solution. >> >> Step 1: Upgrade R ;-) >> >> It's not necessary for the approach I'm going to suggest, but it'll >> probably make it easier for Michael to help you w/ his solution, which >> is probably going to be more robust than the >> duct-tape-and-elmer's-glue snippet I'm going to try: >> >> R> library(GenomicRanges) >> R> ir <- IRanges(c(10,10,11,9,10,11), width=c(190,190,190,190,180,180)) >> R> starts <- reduce(resize(ir, width=1, fix='start'), min.gapwidth=2) >> R> ends <- reduce(resize(ir, width=1, fix='end'), min.gapwidth=2) >> R> sc <- countOverlaps(ir, starts) >> R> ec <- countOverlaps(ir, ends) >> >> ... and ... good morning: >> >> R> split(ir, (paste(sc,ec,sep=":"))) >> CompressedIRangesList of length 2 >> $1:1 >> IRanges of length 2 >> start end width >> [1] 10 189 180 >> [2] 11 190 180 >> >>$1:2 >> IRanges of length 4 >> start end width >> [1] 10 199 190 >> [2] 10 199 190 >> [3] 11 200 190 >> [4] 9 198 190 >> >> HTH, >> -steve >> -- >> Steve Lianoglou >> Graduate Student: Computational Systems Biology >> | Memorial Sloan-Kettering Cancer Center >> | Weill Medical College of Cornell University >> Contact Info: http://cbio.mskcc.org/~lianos/contact >> > > [[alternative HTML version deleted]]
Given my misunderstanding earlier about this task, I should probably be careful about speaking up. But clustering is not going to work on millions of reads. The distance matrix is going to be too big, I think. Depends on the size of the problem. Kasper On Wed, Apr 25, 2012 at 7:14 PM, Michael Lawrence <lawrence.michael at="" gene.com=""> wrote: > Making an adjacency matrix from a Hits object would be something like: > > am <- matrix(0, queryLength(hits), subjectLength(hits)) > am[as.matrix(hits)] <- 1 > > Michael > > On Wed, Apr 25, 2012 at 3:40 PM, Abhishek Pratap <apratap at="" lbl.gov=""> wrote: > >> Thanks Steve. Legit solution and it works for me too. Based on my partial >> understanding of methods I have no idea how this will scale for a million >> points,I have in the actual data(may be it will) but I will let you know. >> >> @Michael : I updated my installation and I am able to run the intersect >> step on the findOverlaps() output from start and end. >> I guess now I need to convert the common hits to a graph object and call >> connComp on it. Any way I could convert hits matrix to a adjacency matrix >> to create a graph or maybe there is another slick way to find the connected >> points. >> >> ir <- IRanges(c(10,10,11,9,10,11), width=c(190,190,190,190,180,180)) >> start ?<- flank(ir,1,both=TRUE) >> end <- flank(ir,1,start=FALSE,both=TRUE) >> start_overlaps <- findOverlaps(start) >> end_overlaps <- findOverlaps(end) >> good_hits <- intersect(start_overlaps,end_overlaps) >> >> >> Thanks! >> -Abhi >> >> >> On Wed, Apr 25, 2012 at 3:08 PM, Steve Lianoglou < >> mailinglist.honeypot at gmail.com> wrote: >> >>> Hi, >>> >>> On Wed, Apr 25, 2012 at 5:21 PM, Abhishek Pratap <apratap at="" lbl.gov=""> wrote: >>> > Hi Michael >>> > >>> > SessionInfo copied below. My versions could be one older to current >>> one. ?I >>> > am still wondering how I can get this information in a format that can >>> be >>> > digested by connectedComp or something similar. I think we are close to >>> a >>> > solution. >>> >>> Step 1: Upgrade R ;-) >>> >>> It's not necessary for the approach I'm going to suggest, but it'll >>> probably make it easier for Michael to help you w/ his solution, which >>> is probably going to be more robust than the >>> duct-tape-and-elmer's-glue snippet I'm going to try: >>> >>> R> library(GenomicRanges) >>> R> ir <- IRanges(c(10,10,11,9,10,11), width=c(190,190,190,190,180,180)) >>> R> starts <- reduce(resize(ir, width=1, fix='start'), min.gapwidth=2) >>> R> ends <- reduce(resize(ir, width=1, fix='end'), min.gapwidth=2) >>> R> sc <- countOverlaps(ir, starts) >>> R> ec <- countOverlaps(ir, ends) >>> >>> ... and ... good morning: >>> >>> R> split(ir, (paste(sc,ec,sep=":"))) >>> CompressedIRangesList of length 2 >>> $1:1 >>> IRanges of length 2 >>> ? ?start end width >>> [1] ? ?10 189 ? 180 >>> [2] ? ?11 190 ? 180 >>> >>>$1:2 >>> IRanges of length 4 >>> ? ?start end width >>> [1] ? ?10 199 ? 190 >>> [2] ? ?10 199 ? 190 >>> [3] ? ?11 200 ? 190 >>> [4] ? ? 9 198 ? 190 >>> >>> HTH, >>> -steve >>> -- >>> Steve Lianoglou >>> Graduate Student: Computational Systems Biology >>> ?| Memorial Sloan-Kettering Cancer Center >>> ?| Weill Medical College of Cornell University >>> Contact Info: http://cbio.mskcc.org/~lianos/contact >>> >> >> > > ? ? ? ?[[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
ADD REPLYlink written 6.1 years ago by Kasper Daniel Hansen6.3k
Hi list, On 04/24/2012 03:54 PM, Steve Lianoglou wrote: > Hi, > > On Tue, Apr 24, 2012 at 6:43 PM, Kasper Daniel Hansen > <kasperdanielhansen at="" gmail.com=""> wrote: >>> ir<- IRanges(start = c(1,5), width = 3) >>> ir >> IRanges of length 2 >> start end width >> [1] 1 3 3 >> [2] 5 7 3 >>> reduce(ir) >> IRanges of length 2 >> start end width >> [1] 1 3 3 >> [2] 5 7 3 >>> reduce(ir, min.gapwidth = 2) >> IRanges of length 1 >> start end width >> [1] 1 7 7 >> >> As far as I understand, this is exactly what you want. If you have >> further questions, I suggest you read the help page. > > This actually isn't what he wants -- but what he wants isn't > answerable w/ reduce. > > All of the OP's ranges overlap eachother -- I think he's trying to > split his ranges into two groups because one group has ends that are > very close to each other and the other's are a bit different. Note > that the last two rows, the ends are a greater distance away from (but > still *within*) the ends of the first group of ranges he's trying to > group. > > It's almost trying to cluster groups of reads together, but again -- not really. The clustering approach looks good to me because it allows reusing existing clustering tools like as.dist(), hclust() and cutree(). It's just a matter of defining the "right" distance between 2 ranges. Here it seems that a suitable distance would be the distance between the 2 starts + the distance between the 2 ends: D(range1, range2) = abs(start(range1) - start(range2)) + abs(end(range1) - end(range2)) Another good candidate would be the euclidian distance if ranges are seen as points in the 2D space but it's a little bit more expensive to compute and I would expect it to give very similar results to D() for clustering. The code is very simple: rangesDist <- function(x) { lx <- length(x) m <- sapply(seq_len(lx), function(i) abs(start(x)[i] - start(x)) + abs(end(x)[i] - end(x)) ) as.dist(m) } d <- rangesDist(ir) hc <- hclust(d) plot(hc) ## Then use cutree() to assign a group id to each original range. ## For example if you want to split in 2 groups: > cutree(hc, k=2) [1] 1 1 1 1 2 2 Would be nice if the "cutting" could be specified in terms of minimum distance between groups but I'm not familiar enough with the existing clustering tools to know how to do that... Cheers, H. > > Not sure why you might want to do this, but ... just to say that > reduce isn't the answer here. > > -steve > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
ADD REPLYlink written 6.1 years ago by Hervé Pagès ♦♦ 13k
0
6.1 years ago by
Abhishek Pratap400 wrote:
Forgot to copy everyone on my last email and also just for completeness I do think that the previous error is likely due to the memory. So reading some more about the memory limits in ?"Memory-limits" I am wondering if any tweaks can be made for me to create a big adjacency matrix. -A On Thu, Apr 26, 2012 at 2:24 PM, Abhishek Pratap <apratap@lbl.gov> wrote: > Hi Michael > > I finally got thru this and got expected results on the test data set. > > On full data till the intersect step it runs well but when I am trying to > create an adjacency matrix in order to create a graph I am getting an error > by the matrix creation object. > > Error in matrix(0, queryLength(good_hits), subjectLength(good_hits)) : > too many elements specified > > queryLength(good_hits) = 900,000 > > I am wondering if there is an efficient way to construct a graph to find > the connected components for a large set of points. > > > > Thanks! > -Abhi > > > > On Wed, Apr 25, 2012 at 4:14 PM, Michael Lawrence < > lawrence.michael@gene.com> wrote: > >> Making an adjacency matrix from a Hits object would be something like: >> >> am <- matrix(0, queryLength(hits), subjectLength(hits)) >> am[as.matrix(hits)] <- 1 >> >> Michael >> >> >> On Wed, Apr 25, 2012 at 3:40 PM, Abhishek Pratap <apratap@lbl.gov> wrote: >> >>> Thanks Steve. Legit solution and it works for me too. Based on my >>> partial understanding of methods I have no idea how this will scale for a >>> million points,I have in the actual data(may be it will) but I will let you >>> know. >>> >>> @Michael : I updated my installation and I am able to run the intersect >>> step on the findOverlaps() output from start and end. >>> I guess now I need to convert the common hits to a graph object and call >>> connComp on it. Any way I could convert hits matrix to a adjacency matrix >>> to create a graph or maybe there is another slick way to find the connected >>> points. >>> >>> ir <- IRanges(c(10,10,11,9,10,11), width=c(190,190,190,190,180,180)) >>> start <- flank(ir,1,both=TRUE) >>> end <- flank(ir,1,start=FALSE,both=TRUE) >>> start_overlaps <- findOverlaps(start) >>> end_overlaps <- findOverlaps(end) >>> good_hits <- intersect(start_overlaps,end_overlaps) >>> >>> >>> Thanks! >>> -Abhi >>> >>> >>> On Wed, Apr 25, 2012 at 3:08 PM, Steve Lianoglou < >>> mailinglist.honeypot@gmail.com> wrote: >>> >>>> Hi, >>>> >>>> On Wed, Apr 25, 2012 at 5:21 PM, Abhishek Pratap <apratap@lbl.gov> >>>> wrote: >>>> > Hi Michael >>>> > >>>> > SessionInfo copied below. My versions could be one older to current >>>> one. I >>>> > am still wondering how I can get this information in a format that >>>> can be >>>> > digested by connectedComp or something similar. I think we are close >>>> to a >>>> > solution. >>>> >>>> Step 1: Upgrade R ;-) >>>> >>>> It's not necessary for the approach I'm going to suggest, but it'll >>>> probably make it easier for Michael to help you w/ his solution, which >>>> is probably going to be more robust than the >>>> duct-tape-and-elmer's-glue snippet I'm going to try: >>>> >>>> R> library(GenomicRanges) >>>> R> ir <- IRanges(c(10,10,11,9,10,11), width=c(190,190,190,190,180,180)) >>>> R> starts <- reduce(resize(ir, width=1, fix='start'), min.gapwidth=2) >>>> R> ends <- reduce(resize(ir, width=1, fix='end'), min.gapwidth=2) >>>> R> sc <- countOverlaps(ir, starts) >>>> R> ec <- countOverlaps(ir, ends) >>>> >>>> ... and ... good morning: >>>> >>>> R> split(ir, (paste(sc,ec,sep=":"))) >>>> CompressedIRangesList of length 2 >>>> $1:1 >>>> IRanges of length 2 >>>> start end width >>>> [1] 10 189 180 >>>> [2] 11 190 180 >>>> >>>>$1:2 >>>> IRanges of length 4 >>>> start end width >>>> [1] 10 199 190 >>>> [2] 10 199 190 >>>> [3] 11 200 190 >>>> [4] 9 198 190 >>>> >>>> HTH, >>>> -steve >>>> -- >>>> Steve Lianoglou >>>> Graduate Student: Computational Systems Biology >>>> | Memorial Sloan-Kettering Cancer Center >>>> | Weill Medical College of Cornell University >>>> Contact Info: http://cbio.mskcc.org/~lianos/contact >>>> >>> >>> >> > [[alternative HTML version deleted]]
I think you're better off using the NEL (Nodes Edge List) graph representation. See the documentation in the graph package. Michael On Thu, Apr 26, 2012 at 2:52 PM, Abhishek Pratap <apratap@lbl.gov> wrote: > Forgot to copy everyone on my last email and also just for completeness I > do think that the previous error is likely due to the memory. So reading > some more about the memory limits in ?"Memory-limits" I am wondering if > any tweaks can be made for me to create a big adjacency matrix. > > -A > > > On Thu, Apr 26, 2012 at 2:24 PM, Abhishek Pratap <apratap@lbl.gov> wrote: > >> Hi Michael >> >> I finally got thru this and got expected results on the test data set. >> >> On full data till the intersect step it runs well but when I am trying to >> create an adjacency matrix in order to create a graph I am getting an error >> by the matrix creation object. >> >> Error in matrix(0, queryLength(good_hits), subjectLength(good_hits)) : >> too many elements specified >> >> queryLength(good_hits) = 900,000 >> >> I am wondering if there is an efficient way to construct a graph to find >> the connected components for a large set of points. >> >> >> >> Thanks! >> -Abhi >> >> >> >> On Wed, Apr 25, 2012 at 4:14 PM, Michael Lawrence < >> lawrence.michael@gene.com> wrote: >> >>> Making an adjacency matrix from a Hits object would be something like: >>> >>> am <- matrix(0, queryLength(hits), subjectLength(hits)) >>> am[as.matrix(hits)] <- 1 >>> >>> Michael >>> >>> >>> On Wed, Apr 25, 2012 at 3:40 PM, Abhishek Pratap <apratap@lbl.gov>wrote: >>> >>>> Thanks Steve. Legit solution and it works for me too. Based on my >>>> partial understanding of methods I have no idea how this will scale for a >>>> million points,I have in the actual data(may be it will) but I will let you >>>> know. >>>> >>>> @Michael : I updated my installation and I am able to run the intersect >>>> step on the findOverlaps() output from start and end. >>>> I guess now I need to convert the common hits to a graph object and >>>> call connComp on it. Any way I could convert hits matrix to a adjacency >>>> matrix to create a graph or maybe there is another slick way to find the >>>> connected points. >>>> >>>> ir <- IRanges(c(10,10,11,9,10,11), width=c(190,190,190,190,180,180)) >>>> start <- flank(ir,1,both=TRUE) >>>> end <- flank(ir,1,start=FALSE,both=TRUE) >>>> start_overlaps <- findOverlaps(start) >>>> end_overlaps <- findOverlaps(end) >>>> good_hits <- intersect(start_overlaps,end_overlaps) >>>> >>>> >>>> Thanks! >>>> -Abhi >>>> >>>> >>>> On Wed, Apr 25, 2012 at 3:08 PM, Steve Lianoglou < >>>> mailinglist.honeypot@gmail.com> wrote: >>>> >>>>> Hi, >>>>> >>>>> On Wed, Apr 25, 2012 at 5:21 PM, Abhishek Pratap <apratap@lbl.gov> >>>>> wrote: >>>>> > Hi Michael >>>>> > >>>>> > SessionInfo copied below. My versions could be one older to current >>>>> one. I >>>>> > am still wondering how I can get this information in a format that >>>>> can be >>>>> > digested by connectedComp or something similar. I think we are close >>>>> to a >>>>> > solution. >>>>> >>>>> Step 1: Upgrade R ;-) >>>>> >>>>> It's not necessary for the approach I'm going to suggest, but it'll >>>>> probably make it easier for Michael to help you w/ his solution, which >>>>> is probably going to be more robust than the >>>>> duct-tape-and-elmer's-glue snippet I'm going to try: >>>>> >>>>> R> library(GenomicRanges) >>>>> R> ir <- IRanges(c(10,10,11,9,10,11), width=c(190,190,190,190,180,180)) >>>>> R> starts <- reduce(resize(ir, width=1, fix='start'), min.gapwidth=2) >>>>> R> ends <- reduce(resize(ir, width=1, fix='end'), min.gapwidth=2) >>>>> R> sc <- countOverlaps(ir, starts) >>>>> R> ec <- countOverlaps(ir, ends) >>>>> >>>>> ... and ... good morning: >>>>> >>>>> R> split(ir, (paste(sc,ec,sep=":"))) >>>>> CompressedIRangesList of length 2 >>>>> $1:1 >>>>> IRanges of length 2 >>>>> start end width >>>>> [1] 10 189 180 >>>>> [2] 11 190 180 >>>>> >>>>>$1:2 >>>>> IRanges of length 4 >>>>> start end width >>>>> [1] 10 199 190 >>>>> [2] 10 199 190 >>>>> [3] 11 200 190 >>>>> [4] 9 198 190 >>>>> >>>>> HTH, >>>>> -steve >>>>> -- >>>>> Steve Lianoglou >>>>> Graduate Student: Computational Systems Biology >>>>> | Memorial Sloan-Kettering Cancer Center >>>>> | Weill Medical College of Cornell University >>>>> Contact Info: http://cbio.mskcc.org/~lianos/contact >>>>> >>>> >>>> >>> >> > [[alternative HTML version deleted]]
can you use an adjacency list (i.e. a graphNEL type structure) instead of an adjacency matrix for this purpose? the former can be orders of magnitude smaller than the latter and the method of constructiong/adding weights might be quite similar. On Thu, Apr 26, 2012 at 2:52 PM, Abhishek Pratap <apratap@lbl.gov> wrote: > Forgot to copy everyone on my last email and also just for completeness I > do think that the previous error is likely due to the memory. So reading > some more about the memory limits in ?"Memory-limits" I am wondering if > any tweaks can be made for me to create a big adjacency matrix. > > -A > > On Thu, Apr 26, 2012 at 2:24 PM, Abhishek Pratap <apratap@lbl.gov> wrote: > > > Hi Michael > > > > I finally got thru this and got expected results on the test data set. > > > > On full data till the intersect step it runs well but when I am trying to > > create an adjacency matrix in order to create a graph I am getting an > error > > by the matrix creation object. > > > > Error in matrix(0, queryLength(good_hits), subjectLength(good_hits)) : > > too many elements specified > > > > queryLength(good_hits) = 900,000 > > > > I am wondering if there is an efficient way to construct a graph to find > > the connected components for a large set of points. > > > > > > > > Thanks! > > -Abhi > > > > > > > > On Wed, Apr 25, 2012 at 4:14 PM, Michael Lawrence < > > lawrence.michael@gene.com> wrote: > > > >> Making an adjacency matrix from a Hits object would be something like: > >> > >> am <- matrix(0, queryLength(hits), subjectLength(hits)) > >> am[as.matrix(hits)] <- 1 > >> > >> Michael > >> > >> > >> On Wed, Apr 25, 2012 at 3:40 PM, Abhishek Pratap <apratap@lbl.gov> > wrote: > >> > >>> Thanks Steve. Legit solution and it works for me too. Based on my > >>> partial understanding of methods I have no idea how this will scale > for a > >>> million points,I have in the actual data(may be it will) but I will > let you > >>> know. > >>> > >>> @Michael : I updated my installation and I am able to run the intersect > >>> step on the findOverlaps() output from start and end. > >>> I guess now I need to convert the common hits to a graph object and > call > >>> connComp on it. Any way I could convert hits matrix to a adjacency > matrix > >>> to create a graph or maybe there is another slick way to find the > connected > >>> points. > >>> > >>> ir <- IRanges(c(10,10,11,9,10,11), width=c(190,190,190,190,180,180)) > >>> start <- flank(ir,1,both=TRUE) > >>> end <- flank(ir,1,start=FALSE,both=TRUE) > >>> start_overlaps <- findOverlaps(start) > >>> end_overlaps <- findOverlaps(end) > >>> good_hits <- intersect(start_overlaps,end_overlaps) > >>> > >>> > >>> Thanks! > >>> -Abhi > >>> > >>> > >>> On Wed, Apr 25, 2012 at 3:08 PM, Steve Lianoglou < > >>> mailinglist.honeypot@gmail.com> wrote: > >>> > >>>> Hi, > >>>> > >>>> On Wed, Apr 25, 2012 at 5:21 PM, Abhishek Pratap <apratap@lbl.gov> > >>>> wrote: > >>>> > Hi Michael > >>>> > > >>>> > SessionInfo copied below. My versions could be one older to current > >>>> one. I > >>>> > am still wondering how I can get this information in a format that > >>>> can be > >>>> > digested by connectedComp or something similar. I think we are close > >>>> to a > >>>> > solution. > >>>> > >>>> Step 1: Upgrade R ;-) > >>>> > >>>> It's not necessary for the approach I'm going to suggest, but it'll > >>>> probably make it easier for Michael to help you w/ his solution, which > >>>> is probably going to be more robust than the > >>>> duct-tape-and-elmer's-glue snippet I'm going to try: > >>>> > >>>> R> library(GenomicRanges) > >>>> R> ir <- IRanges(c(10,10,11,9,10,11), > width=c(190,190,190,190,180,180)) > >>>> R> starts <- reduce(resize(ir, width=1, fix='start'), min.gapwidth=2) > >>>> R> ends <- reduce(resize(ir, width=1, fix='end'), min.gapwidth=2) > >>>> R> sc <- countOverlaps(ir, starts) > >>>> R> ec <- countOverlaps(ir, ends) > >>>> > >>>> ... and ... good morning: > >>>> > >>>> R> split(ir, (paste(sc,ec,sep=":"))) > >>>> CompressedIRangesList of length 2 > >>>> $1:1 > >>>> IRanges of length 2 > >>>> start end width > >>>> [1] 10 189 180 > >>>> [2] 11 190 180 > >>>> > >>>>$1:2 > >>>> IRanges of length 4 > >>>> start end width > >>>> [1] 10 199 190 > >>>> [2] 10 199 190 > >>>> [3] 11 200 190 > >>>> [4] 9 198 190 > >>>> > >>>> HTH, > >>>> -steve > >>>> -- > >>>> Steve Lianoglou > >>>> Graduate Student: Computational Systems Biology > >>>> | Memorial Sloan-Kettering Cancer Center > >>>> | Weill Medical College of Cornell University > >>>> Contact Info: http://cbio.mskcc.org/~lianos/contact > >>>> > >>> > >>> > >> > > > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > -- *A model is a lie that helps you see the truth.* * * Howard Skipper<http: cancerres.aacrjournals.org="" content="" 31="" 9="" 1173.full.pdf=""> [[alternative HTML version deleted]]
Thanks Tim and Michael. This is great ..for a moment I thought I might be stuck with this big data set due to memory. So I am now trying to make a graphNEL class object and having trouble with creating a edgeList. I have made two unsucessfull attempts copied below. I am copying the full test code for completeness. I belv I need to supply graphNEL constructor uniq nodes names and a named list of node to edges and getting stuck with the later part. Appreciate you kind help so far, -Abhi ------------------ Code below: ------------------ ##test data start=c(10,10,11,9,10,11,10,10,10) end=c(200,200,201,199,190,191,149,150,153) ir <- IRanges(start,end) #flank starts starts <- flank(ir,overlap,start=TRUE,both=TRUE) #flank ends ends <- flank(ir,overlap,start=FALSE,both=TRUE) ##find the overlap regions starts_overlaps <- findOverlaps(starts) ends_overlaps <- findOverlaps(ends) #common poins that are contiguous at both start and end good_hits <- intersect(starts_overlaps,ends_overlaps) ##Attempt 1 #nodes nodes = ( as.character( queryHits( good_hits ) ) ) #nodes = unique( as.character( queryHits( good_hits ) ) ) probably I need this #creating a edge list edl <- vector( "list", length=length(nodes) ) names(edl) = nodes #filing the edge list with edges for each node for ( i in 1:length(nodes)){ edl[[i]] <- list(edges=edges[[i]]) } >>>Error gr <- new ("graphNEL", nodes=nodes, edgeL=edl, edgemode="directed") Node names may not be duplicated Error in validObject(.Object) : invalid class "graphNEL" object: FALSE ##Attempt 2 edges = ( as.character( subjectHits( good_hits ) ) ) tapply(edges,nodes,function(x) (edges=unique(x)) ) On Thu, Apr 26, 2012 at 3:37 PM, Tim Triche, Jr. <tim.triche@gmail.com>wrote: > can you use an adjacency list (i.e. a graphNEL type structure) instead of > an adjacency matrix for this purpose? > > the former can be orders of magnitude smaller than the latter and the > method of constructiong/adding weights might be quite similar. > > > > On Thu, Apr 26, 2012 at 2:52 PM, Abhishek Pratap <apratap@lbl.gov> wrote: > >> Forgot to copy everyone on my last email and also just for completeness I >> do think that the previous error is likely due to the memory. So reading >> some more about the memory limits in ?"Memory-limits" I am wondering if >> any tweaks can be made for me to create a big adjacency matrix. >> >> -A >> >> On Thu, Apr 26, 2012 at 2:24 PM, Abhishek Pratap <apratap@lbl.gov> wrote: >> >> > Hi Michael >> > >> > I finally got thru this and got expected results on the test data set. >> > >> > On full data till the intersect step it runs well but when I am trying >> to >> > create an adjacency matrix in order to create a graph I am getting an >> error >> > by the matrix creation object. >> > >> > Error in matrix(0, queryLength(good_hits), subjectLength(good_hits)) : >> > too many elements specified >> > >> > queryLength(good_hits) = 900,000 >> > >> > I am wondering if there is an efficient way to construct a graph to find >> > the connected components for a large set of points. >> > >> > >> > >> > Thanks! >> > -Abhi >> > >> > >> > >> > On Wed, Apr 25, 2012 at 4:14 PM, Michael Lawrence < >> > lawrence.michael@gene.com> wrote: >> > >> >> Making an adjacency matrix from a Hits object would be something like: >> >> >> >> am <- matrix(0, queryLength(hits), subjectLength(hits)) >> >> am[as.matrix(hits)] <- 1 >> >> >> >> Michael >> >> >> >> >> >> On Wed, Apr 25, 2012 at 3:40 PM, Abhishek Pratap <apratap@lbl.gov> >> wrote: >> >> >> >>> Thanks Steve. Legit solution and it works for me too. Based on my >> >>> partial understanding of methods I have no idea how this will scale >> for a >> >>> million points,I have in the actual data(may be it will) but I will >> let you >> >>> know. >> >>> >> >>> @Michael : I updated my installation and I am able to run the >> intersect >> >>> step on the findOverlaps() output from start and end. >> >>> I guess now I need to convert the common hits to a graph object and >> call >> >>> connComp on it. Any way I could convert hits matrix to a adjacency >> matrix >> >>> to create a graph or maybe there is another slick way to find the >> connected >> >>> points. >> >>> >> >>> ir <- IRanges(c(10,10,11,9,10,11), width=c(190,190,190,190,180,180)) >> >>> start <- flank(ir,1,both=TRUE) >> >>> end <- flank(ir,1,start=FALSE,both=TRUE) >> >>> start_overlaps <- findOverlaps(start) >> >>> end_overlaps <- findOverlaps(end) >> >>> good_hits <- intersect(start_overlaps,end_overlaps) >> >>> >> >>> >> >>> Thanks! >> >>> -Abhi >> >>> >> >>> >> >>> On Wed, Apr 25, 2012 at 3:08 PM, Steve Lianoglou < >> >>> mailinglist.honeypot@gmail.com> wrote: >> >>> >> >>>> Hi, >> >>>> >> >>>> On Wed, Apr 25, 2012 at 5:21 PM, Abhishek Pratap <apratap@lbl.gov> >> >>>> wrote: >> >>>> > Hi Michael >> >>>> > >> >>>> > SessionInfo copied below. My versions could be one older to current >> >>>> one. I >> >>>> > am still wondering how I can get this information in a format that >> >>>> can be >> >>>> > digested by connectedComp or something similar. I think we are >> close >> >>>> to a >> >>>> > solution. >> >>>> >> >>>> Step 1: Upgrade R ;-) >> >>>> >> >>>> It's not necessary for the approach I'm going to suggest, but it'll >> >>>> probably make it easier for Michael to help you w/ his solution, >> which >> >>>> is probably going to be more robust than the >> >>>> duct-tape-and-elmer's-glue snippet I'm going to try: >> >>>> >> >>>> R> library(GenomicRanges) >> >>>> R> ir <- IRanges(c(10,10,11,9,10,11), >> width=c(190,190,190,190,180,180)) >> >>>> R> starts <- reduce(resize(ir, width=1, fix='start'), min.gapwidth=2) >> >>>> R> ends <- reduce(resize(ir, width=1, fix='end'), min.gapwidth=2) >> >>>> R> sc <- countOverlaps(ir, starts) >> >>>> R> ec <- countOverlaps(ir, ends) >> >>>> >> >>>> ... and ... good morning: >> >>>> >> >>>> R> split(ir, (paste(sc,ec,sep=":"))) >> >>>> CompressedIRangesList of length 2 >> >>>> $1:1 >> >>>> IRanges of length 2 >> >>>> start end width >> >>>> [1] 10 189 180 >> >>>> [2] 11 190 180 >> >>>> >> >>>>$1:2 >> >>>> IRanges of length 4 >> >>>> start end width >> >>>> [1] 10 199 190 >> >>>> [2] 10 199 190 >> >>>> [3] 11 200 190 >> >>>> [4] 9 198 190 >> >>>> >> >>>> HTH, >> >>>> -steve >> >>>> -- >> >>>> Steve Lianoglou >> >>>> Graduate Student: Computational Systems Biology >> >>>> | Memorial Sloan-Kettering Cancer Center >> >>>> | Weill Medical College of Cornell University >> >>>> Contact Info: http://cbio.mskcc.org/~lianos/contact >> >>>> >> >>> >> >>> >> >> >> > >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor@r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > > > -- > *A model is a lie that helps you see the truth.* > * > * > Howard Skipper<http: cancerres.aacrjournals.org="" content="" 31="" 9="" 1173.full.pdf=""> > > [[alternative HTML version deleted]]
I think it's just something like: edgeL <- split(queryHits(good_hits), subjectHits(good_hits)) nodes <- names(edgeL) gr <- new ("graphNEL", nodes = nodes, edgeL=edgeL, edgemode="directed") Michael On Thu, Apr 26, 2012 at 5:34 PM, Abhishek Pratap <apratap@lbl.gov> wrote: > Thanks Tim and Michael. This is great ..for a moment I thought I might be > stuck with this big data set due to memory. > > So I am now trying to make a graphNEL class object and having trouble with > creating a edgeList. I have made two unsucessfull attempts copied below. I > am copying the full test code for completeness. I belv I need to supply > graphNEL constructor uniq nodes names and a named list of node to edges and > getting stuck with the later part. > > > Appreciate you kind help so far, > -Abhi > > ------------------ > Code below: > ------------------ > > ##test data > start=c(10,10,11,9,10,11,10,10,10) > end=c(200,200,201,199,190,191,149,150,153) > ir <- IRanges(start,end) > > #flank starts > starts <- flank(ir,overlap,start=TRUE,both=TRUE) > > #flank ends > ends <- flank(ir,overlap,start=FALSE,both=TRUE) > > > ##find the overlap regions > starts_overlaps <- findOverlaps(starts) > ends_overlaps <- findOverlaps(ends) > > #common poins that are contiguous at both start and end > good_hits <- intersect(starts_overlaps,ends_overlaps) > > > ##Attempt 1 > > #nodes > nodes = ( as.character( queryHits( good_hits ) ) ) > #nodes = unique( as.character( queryHits( good_hits ) ) ) probably I need > this > > #creating a edge list > edl <- vector( "list", length=length(nodes) ) > names(edl) = nodes > > #filing the edge list with edges for each node > for ( i in 1:length(nodes)){ > edl[[i]] <- list(edges=edges[[i]]) > } > > >>>Error > gr <- new ("graphNEL", nodes=nodes, edgeL=edl, edgemode="directed") > Node names may not be duplicated > Error in validObject(.Object) : invalid class "graphNEL" object: FALSE > > > > ##Attempt 2 > edges = ( as.character( subjectHits( good_hits ) ) ) > tapply(edges,nodes,function(x) (edges=unique(x)) ) > > > > > > > > > > > On Thu, Apr 26, 2012 at 3:37 PM, Tim Triche, Jr. <tim.triche@gmail.com>wrote: > >> can you use an adjacency list (i.e. a graphNEL type structure) instead of >> an adjacency matrix for this purpose? >> >> the former can be orders of magnitude smaller than the latter and the >> method of constructiong/adding weights might be quite similar. >> >> >> >> On Thu, Apr 26, 2012 at 2:52 PM, Abhishek Pratap <apratap@lbl.gov> wrote: >> >>> Forgot to copy everyone on my last email and also just for completeness I >>> do think that the previous error is likely due to the memory. So reading >>> some more about the memory limits in ?"Memory-limits" I am wondering if >>> any tweaks can be made for me to create a big adjacency matrix. >>> >>> -A >>> >>> On Thu, Apr 26, 2012 at 2:24 PM, Abhishek Pratap <apratap@lbl.gov> >>> wrote: >>> >>> > Hi Michael >>> > >>> > I finally got thru this and got expected results on the test data set. >>> > >>> > On full data till the intersect step it runs well but when I am trying >>> to >>> > create an adjacency matrix in order to create a graph I am getting an >>> error >>> > by the matrix creation object. >>> > >>> > Error in matrix(0, queryLength(good_hits), subjectLength(good_hits)) : >>> > too many elements specified >>> > >>> > queryLength(good_hits) = 900,000 >>> > >>> > I am wondering if there is an efficient way to construct a graph to >>> find >>> > the connected components for a large set of points. >>> > >>> > >>> > >>> > Thanks! >>> > -Abhi >>> > >>> > >>> > >>> > On Wed, Apr 25, 2012 at 4:14 PM, Michael Lawrence < >>> > lawrence.michael@gene.com> wrote: >>> > >>> >> Making an adjacency matrix from a Hits object would be something like: >>> >> >>> >> am <- matrix(0, queryLength(hits), subjectLength(hits)) >>> >> am[as.matrix(hits)] <- 1 >>> >> >>> >> Michael >>> >> >>> >> >>> >> On Wed, Apr 25, 2012 at 3:40 PM, Abhishek Pratap <apratap@lbl.gov> >>> wrote: >>> >> >>> >>> Thanks Steve. Legit solution and it works for me too. Based on my >>> >>> partial understanding of methods I have no idea how this will scale >>> for a >>> >>> million points,I have in the actual data(may be it will) but I will >>> let you >>> >>> know. >>> >>> >>> >>> @Michael : I updated my installation and I am able to run the >>> intersect >>> >>> step on the findOverlaps() output from start and end. >>> >>> I guess now I need to convert the common hits to a graph object and >>> call >>> >>> connComp on it. Any way I could convert hits matrix to a adjacency >>> matrix >>> >>> to create a graph or maybe there is another slick way to find the >>> connected >>> >>> points. >>> >>> >>> >>> ir <- IRanges(c(10,10,11,9,10,11), width=c(190,190,190,190,180,180)) >>> >>> start <- flank(ir,1,both=TRUE) >>> >>> end <- flank(ir,1,start=FALSE,both=TRUE) >>> >>> start_overlaps <- findOverlaps(start) >>> >>> end_overlaps <- findOverlaps(end) >>> >>> good_hits <- intersect(start_overlaps,end_overlaps) >>> >>> >>> >>> >>> >>> Thanks! >>> >>> -Abhi >>> >>> >>> >>> >>> >>> On Wed, Apr 25, 2012 at 3:08 PM, Steve Lianoglou < >>> >>> mailinglist.honeypot@gmail.com> wrote: >>> >>> >>> >>>> Hi, >>> >>>> >>> >>>> On Wed, Apr 25, 2012 at 5:21 PM, Abhishek Pratap <apratap@lbl.gov> >>> >>>> wrote: >>> >>>> > Hi Michael >>> >>>> > >>> >>>> > SessionInfo copied below. My versions could be one older to >>> current >>> >>>> one. I >>> >>>> > am still wondering how I can get this information in a format that >>> >>>> can be >>> >>>> > digested by connectedComp or something similar. I think we are >>> close >>> >>>> to a >>> >>>> > solution. >>> >>>> >>> >>>> Step 1: Upgrade R ;-) >>> >>>> >>> >>>> It's not necessary for the approach I'm going to suggest, but it'll >>> >>>> probably make it easier for Michael to help you w/ his solution, >>> which >>> >>>> is probably going to be more robust than the >>> >>>> duct-tape-and-elmer's-glue snippet I'm going to try: >>> >>>> >>> >>>> R> library(GenomicRanges) >>> >>>> R> ir <- IRanges(c(10,10,11,9,10,11), >>> width=c(190,190,190,190,180,180)) >>> >>>> R> starts <- reduce(resize(ir, width=1, fix='start'), >>> min.gapwidth=2) >>> >>>> R> ends <- reduce(resize(ir, width=1, fix='end'), min.gapwidth=2) >>> >>>> R> sc <- countOverlaps(ir, starts) >>> >>>> R> ec <- countOverlaps(ir, ends) >>> >>>> >>> >>>> ... and ... good morning: >>> >>>> >>> >>>> R> split(ir, (paste(sc,ec,sep=":"))) >>> >>>> CompressedIRangesList of length 2 >>> >>>> $1:1 >>> >>>> IRanges of length 2 >>> >>>> start end width >>> >>>> [1] 10 189 180 >>> >>>> [2] 11 190 180 >>> >>>> >>> >>>>$1:2 >>> >>>> IRanges of length 4 >>> >>>> start end width >>> >>>> [1] 10 199 190 >>> >>>> [2] 10 199 190 >>> >>>> [3] 11 200 190 >>> >>>> [4] 9 198 190 >>> >>>> >>> >>>> HTH, >>> >>>> -steve >>> >>>> -- >>> >>>> Steve Lianoglou >>> >>>> Graduate Student: Computational Systems Biology >>> >>>> | Memorial Sloan-Kettering Cancer Center >>> >>>> | Weill Medical College of Cornell University >>> >>>> Contact Info: http://cbio.mskcc.org/~lianos/contact >>> >>>> >>> >>> >>> >>> >>> >> >>> > >>> >>> [[alternative HTML version deleted]] >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor@r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >> >> >> >> -- >> *A model is a lie that helps you see the truth.* >> * >> * >> Howard Skipper<http: cancerres.aacrjournals.org="" content="" 31="" 9="" 1173.full.pdf=""> >> >> > [[alternative HTML version deleted]]