Restriction enzyme digestion
2
0
Entering edit mode
Anh Tran ▴ 90
@anh-tran-2795
Last seen 9.7 years ago
Hi all, I'm new to this mailing list and have a very simple question about digestion with restriction enzyme for the whole genome. I'm using package BSgenome.Hsapiens.UCSC.hg18 to find the cut site of enzyme. Is there a faster way (ie. pre-built package) to get the fragment sequence and it's location in UCSC genome browser format (Chr#: start bp - end bp). I'm planning on finding the restriction site location, then find the substring between two consecutive cut sites. Is there any faster way to do this? Here's the code that I'm thinking of: library(BSgenome.Hsapiens.UCSC.hg18) c1<-Hsapiens$chr1 m<-matchPatterns('CTAG', c1) #And the for loop reslt<-NULL for (i in 1:(length(m)-1)) { reslt<-rbind(reslt, cbind(start(m[i])+1, start(m[i+1])-3, getSeq(Hsapiens, 'chr1', start(m[i])+2, start(m[i+1])-2))) } It seems like this code is no where near perfect. Please give me some input. Thanks -- Regards, Anh Tran [[alternative HTML version deleted]]
BSgenome BSgenome BSgenome BSgenome • 2.2k views
ADD COMMENT
0
Entering edit mode
Kort, Eric ▴ 220
@kort-eric-1483
Last seen 6.2 years ago
United States
Anh Tran writes: > Hi all, > > I'm new to this mailing list and have a very simple question > about digestion with restriction enzyme for the whole genome. > > I'm using package BSgenome.Hsapiens.UCSC.hg18 to find the cut > site of enzyme. Is there a faster way (ie. pre-built package) > to get the fragment sequence and it's location in UCSC genome > browser format (Chr#: start bp - end bp). > > I'm planning on finding the restriction site location, then > find the substring between two consecutive cut sites. > getSeq accepts vectors of start and stop sites, so you can try this: library(Biostrings) library(BSgenome.Hsapiens.UCSC.hg18) c1<-Hsapiens$chr1 m<-matchPattern('CTAG', c1) starts <- m at start[1:length(m at start)-1] ends <- m at start[2:length(m at start)] seqs <- getSeq(Hsapiens, 'chr1', starts, ends) Which works great for a "few" sequences, but stalls on my windows box with this many sequences due to memory constraints. While Biostrings is fantastic (for example, the matchPattern above takes only a few seconds), it does come at a cost (IMHO) in terms of memory overhead (the authors know this, and explicitly state that retrieving this many sequences at once is not likely to work). But you might be able to loop through in chunks and rbind together. An alternative may be to use biomaRt's getSequence in mysql mode (requires mySql libraries to be installed and the RMySQL package). I am not sure if you can send multiple sites, though (and I can't test it at the moment due to firewall issues), but you could try: mart <- useMart("ensembl",dataset="hsapiens_gene_ensembl", mysql=TRUE) getSequence(chromosome="1", start=starts, end=ends, mart=ensembl) <driftingofftopic> An unpublished alternative is to use a new library I have concocted which will suck sequences directly out of a 2bit file. You have to download the human genome .2bit file (800Mb) from here: ftp://hgdownload.cse.ucsc.edu/gbdb/hg18/hg18.2bit And then, if you had my library installed (which I can send to you if you want it), you could do this: library(r2bit) chr <- rep("chr1", length(starts)) seq <- fetch2bit("/path/to/hg18.2bit", chr, starts, ends) On my box, this takes 35 seconds to retrieve the 623,302 sequences (as character strings, not at Biostrings objects) from my local copy of hg18.2bit. Personally, as we move into the massively high throughput sequencing era, I wonder if this isn't a good alternative (i.e., coupling biomaRt's ability to quickly figure out what sequences are of interest, and then retrieving those sequences quickly out of a local compressed file) </driftingofftopic> Cheers, Eric > Is there any faster way to do this? > > Here's the code that I'm thinking of: > > library(BSgenome.Hsapiens.UCSC.hg18) > c1<-Hsapiens$chr1 > m<-matchPatterns('CTAG', c1) > > #And the for loop > reslt<-NULL > for (i in 1:(length(m)-1)) { > reslt<-rbind(reslt, cbind(start(m[i])+1, start(m[i+1])-3, > getSeq(Hsapiens, 'chr1', start(m[i])+2, start(m[i+1])-2))) } > > It seems like this code is no where near perfect. Please give > me some input. > > Thanks > > -- > Regards, > Anh Tran > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > This email message, including any attachments, is for th...{{dropped:6}}
ADD COMMENT
0
Entering edit mode
Hi Eric, Thanks so much for your fast reply. Your library is great. It's exactly what I need actually (2bit file). I want to digest the genome and import the fragments into MySQL for reference.Other analysis will be done through RMySQL and it's base index. If you can send me your library, it would be great. Thanks from UCLA, Neurology Research Lab. Anh Tran On Mon, May 12, 2008 at 1:15 PM, Kort, Eric <eric.kort@vai.org> wrote: > Anh Tran writes: > > > Hi all, > > > > I'm new to this mailing list and have a very simple question > > about digestion with restriction enzyme for the whole genome. > > > > I'm using package BSgenome.Hsapiens.UCSC.hg18 to find the cut > > site of enzyme. Is there a faster way (ie. pre-built package) > > to get the fragment sequence and it's location in UCSC genome > > browser format (Chr#: start bp - end bp). > > > > > I'm planning on finding the restriction site location, then > > find the substring between two consecutive cut sites. > > > > getSeq accepts vectors of start and stop sites, so you can try this: > > library(Biostrings) > library(BSgenome.Hsapiens.UCSC.hg18) > c1<-Hsapiens$chr1 > m<-matchPattern('CTAG', c1) > starts <- m@start[1:length(m@start)-1] > ends <- m@start[2:length(m@start)] > seqs <- getSeq(Hsapiens, 'chr1', starts, ends) > > Which works great for a "few" sequences, but stalls on my windows box with > this many sequences due to memory constraints. While Biostrings is > fantastic (for example, the matchPattern above takes only a few seconds), it > does come at a cost (IMHO) in terms of memory overhead (the authors know > this, and explicitly state that retrieving this many sequences at once is > not likely to work). But you might be able to loop through in chunks and > rbind together. > > An alternative may be to use biomaRt's getSequence in mysql mode (requires > mySql libraries to be installed and the RMySQL package). I am not sure if > you can send multiple sites, though (and I can't test it at the moment due > to firewall issues), but you could try: > > mart <- useMart("ensembl",dataset="hsapiens_gene_ensembl", mysql=TRUE) > getSequence(chromosome="1", start=starts, end=ends, mart=ensembl) > > <driftingofftopic> > > An unpublished alternative is to use a new library I have concocted which > will suck sequences directly out of a 2bit file. You have to download the > human genome .2bit file (800Mb) from here: > > ftp://hgdownload.cse.ucsc.edu/gbdb/hg18/hg18.2bit > > And then, if you had my library installed (which I can send to you if you > want it), you could do this: > > library(r2bit) > chr <- rep("chr1", length(starts)) > seq <- fetch2bit("/path/to/hg18.2bit", chr, starts, ends) > > On my box, this takes 35 seconds to retrieve the 623,302 sequences (as > character strings, not at Biostrings objects) from my local copy of > hg18.2bit. > > Personally, as we move into the massively high throughput sequencing era, > I wonder if this isn't a good alternative (i.e., coupling biomaRt's ability > to quickly figure out what sequences are of interest, and then retrieving > those sequences quickly out of a local compressed file) > > </driftingofftopic> > > Cheers, > Eric > > > > Is there any faster way to do this? > > > > Here's the code that I'm thinking of: > > > > library(BSgenome.Hsapiens.UCSC.hg18) > > c1<-Hsapiens$chr1 > > m<-matchPatterns('CTAG', c1) > > > > #And the for loop > > reslt<-NULL > > for (i in 1:(length(m)-1)) { > > reslt<-rbind(reslt, cbind(start(m[i])+1, start(m[i+1])-3, > > getSeq(Hsapiens, 'chr1', start(m[i])+2, start(m[i+1])-2))) } > > > > It seems like this code is no where near perfect. Please give > > me some input. > > > > Thanks > > > > -- > > Regards, > > Anh Tran > > > > [[alternative HTML version deleted]] > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor@stat.math.ethz.ch > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: > > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > > This email message, including any attachments, is for ...{{dropped:16}}
ADD REPLY
0
Entering edit mode
@herve-pages-1542
Last seen 15 hours ago
Seattle, WA, United States
Hi Anh, Quoting Anh Tran <popophobia at="" gmail.com="">: > Hi all, > > I'm new to this mailing list and have a very simple question about digestion > with restriction enzyme for the whole genome. > > I'm using package BSgenome.Hsapiens.UCSC.hg18 to find the cut site of > enzyme. Is there a faster way (ie. pre-built package) to get the fragment > sequence and it's location in UCSC genome browser format (Chr#: start bp - > end bp). > > I'm planning on finding the restriction site location, then find the > substring between two consecutive cut sites. > > Is there any faster way to do this? > > Here's the code that I'm thinking of: > > library(BSgenome.Hsapiens.UCSC.hg18) > c1<-Hsapiens$chr1 > m<-matchPatterns('CTAG', c1) > > #And the for loop > reslt<-NULL > for (i in 1:(length(m)-1)) { > reslt<-rbind(reslt, cbind(start(m[i])+1, start(m[i+1])-3, > getSeq(Hsapiens, 'chr1', start(m[i])+2, start(m[i+1])-2))) > } > > It seems like this code is no where near perfect. Please give me some input. Yes looping on an 623303-element vector at the R level will take a long time. To "invert" the views of an XStringViews object i.e. to make the current gaps between the views become the new views and the current views become the new gaps, just do: mgaps <- gaps(m) This is very fast! (< 0.3 sec. on my machine) Then if you really want this as a standard character vector, do: mgaps2 <- as.character(mgaps) but, unfortunately this is still very slow at the moment because the main loop in the "as.character" method for XStringViews objects is still written in R and not in C (we need to change this ASAP). You could also use the getSeq() function in a vectorized way if you really wanted to use getSeq(). Then your code would look like: reslt_start <- end(m)[-length(m)] + 1L reslt_end <- start(m)[-1] - 1L reslt <- getSeq(Hsapiens, 'chr1', start(m)[i]+2, start(m)[i+1]-2)) at least in theory... but I realized that there are a few problems with the current getSeq() implementation that I need to address too, sorry! Note that you'll need the latest available version of Biostrings in order to use the gaps() function. Cheers, H. > > Thanks > > -- > Regards, > Anh Tran > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor >
ADD COMMENT
0
Entering edit mode
Quoting hpages at fhcrc.org: [...] > You could also use the getSeq() function in a vectorized way if you > really wanted to use getSeq(). Then your code would look like: > > reslt_start <- end(m)[-length(m)] + 1L > reslt_end <- start(m)[-1] - 1L > reslt <- getSeq(Hsapiens, 'chr1', start(m)[i]+2, start(m)[i+1]-2)) I meant this: reslt_start <- end(m)[-length(m)] + 1L reslt_end <- start(m)[-1] - 1L reslt <- getSeq(Hsapiens, 'chr1', start=reslt_start, end=reslt_end) sorry... (but as I said in my previous post, getSeq() needs some improvements and fixes) Cheers, H.
ADD REPLY
0
Entering edit mode
Hi Anh, I fixed a few things in Biostrings: - The "as.character" method for XStringViews (and XStringSet) objects is much faster now (about 250x faster than yesterday for your use case): library(BSgenome.Hsapiens.UCSC.hg18) c1 <- Hsapiens$chr1 m <- matchPattern('CTAG', c1) mgaps <- gaps(m) mgaps2 <- as.character(mgaps) # now takes about 5 sec. instead of 20 minutes - As a consequence, the getSeq() function is much faster too (it uses as.character() internally): mgaps3_start <- end(m)[-length(m)] + 1L mgaps3_end <- start(m)[-1] - 1L ## The empty gaps must be dropped because the 'start' and 'end' args of getSeq() ## must verify 'start' <= 'end' nonemptygaps <- mgaps3_start <= mgaps3_end mgaps3_start <- mgaps3_start[nonemptygaps] mgaps3_end <- mgaps3_end[nonemptygaps] mgaps3 <- getSeq(Hsapiens, 'chr1', start=mgaps3_start, end=mgaps3_end) identical(mgaps3, mgaps2) # TRUE This will be available in BioC 2.2 (Biostrings 2.8.5) tonight around midnight (Seattle time) for R-2.7 users and in BioC 2.3 (Biostrings 2.9.8) around noon (Seattle time). As always, install with biocLite(). Please let me know by posting here if you run into other problems. Thanks! H. hpages at fhcrc.org wrote: > Hi Anh, > > Quoting Anh Tran <popophobia at="" gmail.com="">: > >> Hi all, >> >> I'm new to this mailing list and have a very simple question about >> digestion >> with restriction enzyme for the whole genome. >> >> I'm using package BSgenome.Hsapiens.UCSC.hg18 to find the cut site of >> enzyme. Is there a faster way (ie. pre-built package) to get the fragment >> sequence and it's location in UCSC genome browser format (Chr#: start >> bp - >> end bp). >> >> I'm planning on finding the restriction site location, then find the >> substring between two consecutive cut sites. >> >> Is there any faster way to do this? >> >> Here's the code that I'm thinking of: >> >> library(BSgenome.Hsapiens.UCSC.hg18) >> c1<-Hsapiens$chr1 >> m<-matchPatterns('CTAG', c1) >> >> #And the for loop >> reslt<-NULL >> for (i in 1:(length(m)-1)) { >> reslt<-rbind(reslt, cbind(start(m[i])+1, start(m[i+1])-3, >> getSeq(Hsapiens, 'chr1', start(m[i])+2, start(m[i+1])-2))) >> } >> >> It seems like this code is no where near perfect. Please give me some >> input. > > Yes looping on an 623303-element vector at the R level will take a long > time. To "invert" the views of an XStringViews object i.e. to make the > current gaps between the views become the new views and the current views > become the new gaps, just do: > > mgaps <- gaps(m) > > This is very fast! (< 0.3 sec. on my machine) > > Then if you really want this as a standard character vector, do: > > mgaps2 <- as.character(mgaps) > > but, unfortunately this is still very slow at the moment because the > main loop in the "as.character" method for XStringViews objects is still > written in R and not in C (we need to change this ASAP). > > You could also use the getSeq() function in a vectorized way if you > really wanted to use getSeq(). Then your code would look like: > > reslt_start <- end(m)[-length(m)] + 1L > reslt_end <- start(m)[-1] - 1L > reslt <- getSeq(Hsapiens, 'chr1', start(m)[i]+2, start(m)[i+1]-2)) > > at least in theory... but I realized that there are a few problems > with the current getSeq() implementation that I need to address too, > sorry! > > Note that you'll need the latest available version of Biostrings in order > to use the gaps() function. > > Cheers, > H. > > >> >> Thanks >> >> -- >> Regards, >> Anh Tran >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor
ADD REPLY
0
Entering edit mode
Thank you very much. I'll check it out tonight. In the mean time, I'll have to set up R 2.7.0. Thanks again. On Tue, May 13, 2008 at 10:05 AM, Herve Pages <hpages@fhcrc.org> wrote: > Hi Anh, > > I fixed a few things in Biostrings: > > - The "as.character" method for XStringViews (and XStringSet) objects is > much > faster now (about 250x faster than yesterday for your use case): > > library(BSgenome.Hsapiens.UCSC.hg18) > c1 <- Hsapiens$chr1 > m <- matchPattern('CTAG', c1) > mgaps <- gaps(m) > mgaps2 <- as.character(mgaps) # now takes about 5 sec. instead of 20 > minutes > > - As a consequence, the getSeq() function is much faster too (it uses > as.character() > internally): > > mgaps3_start <- end(m)[-length(m)] + 1L > mgaps3_end <- start(m)[-1] - 1L > ## The empty gaps must be dropped because the 'start' and 'end' args of > getSeq() > ## must verify 'start' <= 'end' > nonemptygaps <- mgaps3_start <= mgaps3_end > mgaps3_start <- mgaps3_start[nonemptygaps] > mgaps3_end <- mgaps3_end[nonemptygaps] > mgaps3 <- getSeq(Hsapiens, 'chr1', start=mgaps3_start, end=mgaps3_end) > > identical(mgaps3, mgaps2) # TRUE > > This will be available in BioC 2.2 (Biostrings 2.8.5) tonight around > midnight > (Seattle time) for R-2.7 users and in BioC 2.3 (Biostrings 2.9.8) around > noon > (Seattle time). As always, install with biocLite(). > > Please let me know by posting here if you run into other problems. Thanks! > > H. > > > > hpages@fhcrc.org wrote: > > > Hi Anh, > > > > Quoting Anh Tran <popophobia@gmail.com>: > > > > Hi all, > > > > > > I'm new to this mailing list and have a very simple question about > > > digestion > > > with restriction enzyme for the whole genome. > > > > > > I'm using package BSgenome.Hsapiens.UCSC.hg18 to find the cut site of > > > enzyme. Is there a faster way (ie. pre-built package) to get the > > > fragment > > > sequence and it's location in UCSC genome browser format (Chr#: start > > > bp - > > > end bp). > > > > > > I'm planning on finding the restriction site location, then find the > > > substring between two consecutive cut sites. > > > > > > Is there any faster way to do this? > > > > > > Here's the code that I'm thinking of: > > > > > > library(BSgenome.Hsapiens.UCSC.hg18) > > > c1<-Hsapiens$chr1 > > > m<-matchPatterns('CTAG', c1) > > > > > > #And the for loop > > > reslt<-NULL > > > for (i in 1:(length(m)-1)) { > > > reslt<-rbind(reslt, cbind(start(m[i])+1, start(m[i+1])-3, > > > getSeq(Hsapiens, 'chr1', start(m[i])+2, start(m[i+1])-2))) > > > } > > > > > > It seems like this code is no where near perfect. Please give me some > > > input. > > > > > > > Yes looping on an 623303-element vector at the R level will take a long > > time. To "invert" the views of an XStringViews object i.e. to make the > > current gaps between the views become the new views and the current > > views > > become the new gaps, just do: > > > > mgaps <- gaps(m) > > > > This is very fast! (< 0.3 sec. on my machine) > > > > Then if you really want this as a standard character vector, do: > > > > mgaps2 <- as.character(mgaps) > > > > but, unfortunately this is still very slow at the moment because the > > main loop in the "as.character" method for XStringViews objects is still > > written in R and not in C (we need to change this ASAP). > > > > You could also use the getSeq() function in a vectorized way if you > > really wanted to use getSeq(). Then your code would look like: > > > > reslt_start <- end(m)[-length(m)] + 1L > > reslt_end <- start(m)[-1] - 1L > > reslt <- getSeq(Hsapiens, 'chr1', start(m)[i]+2, start(m)[i+1]-2)) > > > > at least in theory... but I realized that there are a few problems > > with the current getSeq() implementation that I need to address too, > > sorry! > > > > Note that you'll need the latest available version of Biostrings in > > order > > to use the gaps() function. > > > > Cheers, > > H. > > > > > > > > > Thanks > > > > > > -- > > > Regards, > > > Anh Tran > > > > > > [[alternative HTML version deleted]] > > > > > > _______________________________________________ > > > Bioconductor mailing list > > > Bioconductor@stat.math.ethz.ch > > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > > Search the archives: > > > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor@stat.math.ethz.ch > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: > > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > -- Regards, Anh Tran [[alternative HTML version deleted]]
ADD REPLY
0
Entering edit mode
Hi, I updated my system to R 2.7.0 but still cannot see the update of the Biostrings package (2.8.5). I'm using MacOSX and will test the Windows box. I'm installing using the script from bioconductor: source("http://bioconductor.org/biocLite.R") biocLite() Please let me know if there's something I should be paying attention to. Thanks. On Tue, May 13, 2008 at 11:23 AM, Anh Tran <popophobia@gmail.com> wrote: > Thank you very much. > > I'll check it out tonight. In the mean time, I'll have to set up R 2.7.0. > > Thanks again. > > > On Tue, May 13, 2008 at 10:05 AM, Herve Pages <hpages@fhcrc.org> wrote: > >> Hi Anh, >> >> I fixed a few things in Biostrings: >> >> - The "as.character" method for XStringViews (and XStringSet) objects is >> much >> faster now (about 250x faster than yesterday for your use case): >> >> library(BSgenome.Hsapiens.UCSC.hg18) >> c1 <- Hsapiens$chr1 >> m <- matchPattern('CTAG', c1) >> mgaps <- gaps(m) >> mgaps2 <- as.character(mgaps) # now takes about 5 sec. instead of 20 >> minutes >> >> - As a consequence, the getSeq() function is much faster too (it uses >> as.character() >> internally): >> >> mgaps3_start <- end(m)[-length(m)] + 1L >> mgaps3_end <- start(m)[-1] - 1L >> ## The empty gaps must be dropped because the 'start' and 'end' args of >> getSeq() >> ## must verify 'start' <= 'end' >> nonemptygaps <- mgaps3_start <= mgaps3_end >> mgaps3_start <- mgaps3_start[nonemptygaps] >> mgaps3_end <- mgaps3_end[nonemptygaps] >> mgaps3 <- getSeq(Hsapiens, 'chr1', start=mgaps3_start, end=mgaps3_end) >> >> identical(mgaps3, mgaps2) # TRUE >> >> This will be available in BioC 2.2 (Biostrings 2.8.5) tonight around >> midnight >> (Seattle time) for R-2.7 users and in BioC 2.3 (Biostrings 2.9.8) around >> noon >> (Seattle time). As always, install with biocLite(). >> >> Please let me know by posting here if you run into other problems. Thanks! >> >> H. >> >> >> >> hpages@fhcrc.org wrote: >> >>> Hi Anh, >>> >>> Quoting Anh Tran <popophobia@gmail.com>: >>> >>> Hi all, >>>> >>>> I'm new to this mailing list and have a very simple question about >>>> digestion >>>> with restriction enzyme for the whole genome. >>>> >>>> I'm using package BSgenome.Hsapiens.UCSC.hg18 to find the cut site of >>>> enzyme. Is there a faster way (ie. pre-built package) to get the >>>> fragment >>>> sequence and it's location in UCSC genome browser format (Chr#: start bp >>>> - >>>> end bp). >>>> >>>> I'm planning on finding the restriction site location, then find the >>>> substring between two consecutive cut sites. >>>> >>>> Is there any faster way to do this? >>>> >>>> Here's the code that I'm thinking of: >>>> >>>> library(BSgenome.Hsapiens.UCSC.hg18) >>>> c1<-Hsapiens$chr1 >>>> m<-matchPatterns('CTAG', c1) >>>> >>>> #And the for loop >>>> reslt<-NULL >>>> for (i in 1:(length(m)-1)) { >>>> reslt<-rbind(reslt, cbind(start(m[i])+1, start(m[i+1])-3, >>>> getSeq(Hsapiens, 'chr1', start(m[i])+2, start(m[i+1])-2))) >>>> } >>>> >>>> It seems like this code is no where near perfect. Please give me some >>>> input. >>>> >>> >>> Yes looping on an 623303-element vector at the R level will take a long >>> time. To "invert" the views of an XStringViews object i.e. to make the >>> current gaps between the views become the new views and the current views >>> become the new gaps, just do: >>> >>> mgaps <- gaps(m) >>> >>> This is very fast! (< 0.3 sec. on my machine) >>> >>> Then if you really want this as a standard character vector, do: >>> >>> mgaps2 <- as.character(mgaps) >>> >>> but, unfortunately this is still very slow at the moment because the >>> main loop in the "as.character" method for XStringViews objects is still >>> written in R and not in C (we need to change this ASAP). >>> >>> You could also use the getSeq() function in a vectorized way if you >>> really wanted to use getSeq(). Then your code would look like: >>> >>> reslt_start <- end(m)[-length(m)] + 1L >>> reslt_end <- start(m)[-1] - 1L >>> reslt <- getSeq(Hsapiens, 'chr1', start(m)[i]+2, start(m)[i+1]-2)) >>> >>> at least in theory... but I realized that there are a few problems >>> with the current getSeq() implementation that I need to address too, >>> sorry! >>> >>> Note that you'll need the latest available version of Biostrings in order >>> to use the gaps() function. >>> >>> Cheers, >>> H. >>> >>> >>> >>>> Thanks >>>> >>>> -- >>>> Regards, >>>> Anh Tran >>>> >>>> [[alternative HTML version deleted]] >>>> >>>> _______________________________________________ >>>> Bioconductor mailing list >>>> Bioconductor@stat.math.ethz.ch >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> Search the archives: >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>> >>>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor@stat.math.ethz.ch >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >> >> > > > -- > Regards, > Anh Tran -- Regards, Anh Tran [[alternative HTML version deleted]]
ADD REPLY
0
Entering edit mode
Hi Anh, Biostrings 2.8.6 (latest release version) was pushed to the public repository just a few minutes ago: http://bioconductor.org/packages/2.2/bioc/html/Biostrings.html Can you please try again and let me know if there are further problems. Thanks! H. Anh Tran wrote: > Hi, > I updated my system to R 2.7.0 but still cannot see the update of the > Biostrings package (2.8.5). I'm using MacOSX and will test the Windows box. > > I'm installing using the script from bioconductor: > > source("http://bioconductor.org/biocLite.R") > biocLite() > > Please let me know if there's something I should be paying attention to. > > Thanks. > > On Tue, May 13, 2008 at 11:23 AM, Anh Tran <popophobia at="" gmail.com=""> <mailto:popophobia at="" gmail.com="">> wrote: > > Thank you very much. > > I'll check it out tonight. In the mean time, I'll have to set up R > 2.7.0. <http: 2.7.0.=""> > > Thanks again. > > > On Tue, May 13, 2008 at 10:05 AM, Herve Pages <hpages at="" fhcrc.org=""> <mailto:hpages at="" fhcrc.org="">> wrote: > > Hi Anh, > > I fixed a few things in Biostrings: > > - The "as.character" method for XStringViews (and XStringSet) > objects is much > faster now (about 250x faster than yesterday for your use case): > > > library(BSgenome.Hsapiens.UCSC.hg18) > c1 <- Hsapiens$chr1 > m <- matchPattern('CTAG', c1) > mgaps <- gaps(m) > mgaps2 <- as.character(mgaps) # now takes about 5 sec. > instead of 20 minutes > > - As a consequence, the getSeq() function is much faster too (it > uses as.character() > internally): > > mgaps3_start <- end(m)[-length(m)] + 1L > mgaps3_end <- start(m)[-1] - 1L > ## The empty gaps must be dropped because the 'start' and > 'end' args of getSeq() > ## must verify 'start' <= 'end' > nonemptygaps <- mgaps3_start <= mgaps3_end > mgaps3_start <- mgaps3_start[nonemptygaps] > mgaps3_end <- mgaps3_end[nonemptygaps] > mgaps3 <- getSeq(Hsapiens, 'chr1', start=mgaps3_start, > end=mgaps3_end) > > identical(mgaps3, mgaps2) # TRUE > > This will be available in BioC 2.2 (Biostrings 2.8.5) tonight > around midnight > (Seattle time) for R-2.7 users and in BioC 2.3 (Biostrings > 2.9.8) around noon > (Seattle time). As always, install with biocLite(). > > Please let me know by posting here if you run into other > problems. Thanks! > > H. > > > > hpages at fhcrc.org <mailto:hpages at="" fhcrc.org=""> wrote: > > Hi Anh, > > Quoting Anh Tran <popophobia at="" gmail.com=""> <mailto:popophobia at="" gmail.com="">>: > > Hi all, > > I'm new to this mailing list and have a very simple > question about digestion > with restriction enzyme for the whole genome. > > I'm using package BSgenome.Hsapiens.UCSC.hg18 to find > the cut site of > enzyme. Is there a faster way (ie. pre-built package) to > get the fragment > sequence and it's location in UCSC genome browser format > (Chr#: start bp - > end bp). > > I'm planning on finding the restriction site location, > then find the > substring between two consecutive cut sites. > > Is there any faster way to do this? > > Here's the code that I'm thinking of: > > library(BSgenome.Hsapiens.UCSC.hg18) > c1<-Hsapiens$chr1 > m<-matchPatterns('CTAG', c1) > > #And the for loop > reslt<-NULL > for (i in 1:(length(m)-1)) { > reslt<-rbind(reslt, cbind(start(m[i])+1, start(m[i+1])-3, > getSeq(Hsapiens, 'chr1', start(m[i])+2, start(m[i+1])-2))) > } > > It seems like this code is no where near perfect. Please > give me some input. > > > Yes looping on an 623303-element vector at the R level will > take a long > time. To "invert" the views of an XStringViews object i.e. > to make the > current gaps between the views become the new views and the > current views > become the new gaps, just do: > > mgaps <- gaps(m) > > This is very fast! (< 0.3 sec. on my machine) > > Then if you really want this as a standard character vector, do: > > mgaps2 <- as.character(mgaps) > > but, unfortunately this is still very slow at the moment > because the > main loop in the "as.character" method for XStringViews > objects is still > written in R and not in C (we need to change this ASAP). > > You could also use the getSeq() function in a vectorized way > if you > really wanted to use getSeq(). Then your code would look like: > > reslt_start <- end(m)[-length(m)] + 1L > reslt_end <- start(m)[-1] - 1L > reslt <- getSeq(Hsapiens, 'chr1', start(m)[i]+2, > start(m)[i+1]-2)) > > at least in theory... but I realized that there are a few > problems > with the current getSeq() implementation that I need to > address too, > sorry! > > Note that you'll need the latest available version of > Biostrings in order > to use the gaps() function. > > Cheers, > H. > > > > Thanks > > -- > Regards, > Anh Tran > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > <mailto:bioconductor at="" stat.math.ethz.ch=""> > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > <mailto:bioconductor at="" stat.math.ethz.ch=""> > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > > > -- > Regards, > Anh Tran > > > > > -- > Regards, > Anh Tran
ADD REPLY
0
Entering edit mode
I've got it. Thank you for your support. Anh Tran On Wed, May 14, 2008 at 12:05 PM, Herve Pages <hpages@fhcrc.org> wrote: > Hi Anh, > > Biostrings 2.8.6 (latest release version) was pushed to the public > repository just a > few minutes ago: > > http://bioconductor.org/packages/2.2/bioc/html/Biostrings.html > > Can you please try again and let me know if there are further problems. > > Thanks! > H. > > > Anh Tran wrote: > >> Hi, >> I updated my system to R 2.7.0 but still cannot see the update of the >> Biostrings package (2.8.5). I'm using MacOSX and will test the Windows box. >> >> I'm installing using the script from bioconductor: >> >> source("http://bioconductor.org/biocLite.R") >> biocLite() >> >> Please let me know if there's something I should be paying attention to. >> >> Thanks. >> >> On Tue, May 13, 2008 at 11:23 AM, Anh Tran <popophobia@gmail.com <mailto:="">> popophobia@gmail.com>> wrote: >> >> Thank you very much. >> >> I'll check it out tonight. In the mean time, I'll have to set up R >> 2.7.0. <http: 2.7.0.=""> >> >> Thanks again. >> >> >> On Tue, May 13, 2008 at 10:05 AM, Herve Pages <hpages@fhcrc.org>> <mailto:hpages@fhcrc.org>> wrote: >> >> Hi Anh, >> >> I fixed a few things in Biostrings: >> >> - The "as.character" method for XStringViews (and XStringSet) >> objects is much >> faster now (about 250x faster than yesterday for your use case): >> >> >> library(BSgenome.Hsapiens.UCSC.hg18) >> c1 <- Hsapiens$chr1 >> m <- matchPattern('CTAG', c1) >> mgaps <- gaps(m) >> mgaps2 <- as.character(mgaps) # now takes about 5 sec. >> instead of 20 minutes >> >> - As a consequence, the getSeq() function is much faster too (it >> uses as.character() >> internally): >> >> mgaps3_start <- end(m)[-length(m)] + 1L >> mgaps3_end <- start(m)[-1] - 1L >> ## The empty gaps must be dropped because the 'start' and >> 'end' args of getSeq() >> ## must verify 'start' <= 'end' >> nonemptygaps <- mgaps3_start <= mgaps3_end >> mgaps3_start <- mgaps3_start[nonemptygaps] >> mgaps3_end <- mgaps3_end[nonemptygaps] >> mgaps3 <- getSeq(Hsapiens, 'chr1', start=mgaps3_start, >> end=mgaps3_end) >> >> identical(mgaps3, mgaps2) # TRUE >> >> This will be available in BioC 2.2 (Biostrings 2.8.5) tonight >> around midnight >> (Seattle time) for R-2.7 users and in BioC 2.3 (Biostrings >> 2.9.8) around noon >> (Seattle time). As always, install with biocLite(). >> >> Please let me know by posting here if you run into other >> problems. Thanks! >> >> H. >> >> >> >> hpages@fhcrc.org <mailto:hpages@fhcrc.org> wrote: >> >> Hi Anh, >> >> Quoting Anh Tran <popophobia@gmail.com>> <mailto:popophobia@gmail.com>>: >> >> >> Hi all, >> >> I'm new to this mailing list and have a very simple >> question about digestion >> with restriction enzyme for the whole genome. >> >> I'm using package BSgenome.Hsapiens.UCSC.hg18 to find >> the cut site of >> enzyme. Is there a faster way (ie. pre-built package) to >> get the fragment >> sequence and it's location in UCSC genome browser format >> (Chr#: start bp - >> end bp). >> >> I'm planning on finding the restriction site location, >> then find the >> substring between two consecutive cut sites. >> >> Is there any faster way to do this? >> >> Here's the code that I'm thinking of: >> >> library(BSgenome.Hsapiens.UCSC.hg18) >> c1<-Hsapiens$chr1 >> m<-matchPatterns('CTAG', c1) >> >> #And the for loop >> reslt<-NULL >> for (i in 1:(length(m)-1)) { >> reslt<-rbind(reslt, cbind(start(m[i])+1, start(m[i+1])-3, >> getSeq(Hsapiens, 'chr1', start(m[i])+2, start(m[i+1])-2))) >> } >> >> It seems like this code is no where near perfect. Please >> give me some input. >> >> >> Yes looping on an 623303-element vector at the R level will >> take a long >> time. To "invert" the views of an XStringViews object i.e. >> to make the >> current gaps between the views become the new views and the >> current views >> become the new gaps, just do: >> >> mgaps <- gaps(m) >> >> This is very fast! (< 0.3 sec. on my machine) >> >> Then if you really want this as a standard character vector, >> do: >> >> mgaps2 <- as.character(mgaps) >> >> but, unfortunately this is still very slow at the moment >> because the >> main loop in the "as.character" method for XStringViews >> objects is still >> written in R and not in C (we need to change this ASAP). >> >> You could also use the getSeq() function in a vectorized way >> if you >> really wanted to use getSeq(). Then your code would look like: >> >> reslt_start <- end(m)[-length(m)] + 1L >> reslt_end <- start(m)[-1] - 1L >> reslt <- getSeq(Hsapiens, 'chr1', start(m)[i]+2, >> start(m)[i+1]-2)) >> >> at least in theory... but I realized that there are a few >> problems >> with the current getSeq() implementation that I need to >> address too, >> sorry! >> >> Note that you'll need the latest available version of >> Biostrings in order >> to use the gaps() function. >> >> Cheers, >> H. >> >> >> >> Thanks >> >> -- Regards, >> Anh Tran >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor@stat.math.ethz.ch >> <mailto:bioconductor@stat.math.ethz.ch> >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor@stat.math.ethz.ch >> <mailto:bioconductor@stat.math.ethz.ch> >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> >> >> >> >> -- Regards, >> Anh Tran >> >> >> >> -- >> Regards, >> Anh Tran >> > > -- Regards, Anh Tran [[alternative HTML version deleted]]
ADD REPLY

Login before adding your answer.

Traffic: 887 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6