question regarding BSGenome/Biostrings

0

Entering edit mode

Elizabeth Purdom ▴ 210

@elizabeth-purdom-2486

Last seen 2.4 years ago

USA/ Berkeley/UC Berkeley

Hello, I am trying to run code like that in the pdf documentation of BSGenome to match patterns to all the chromosomes of a genome. I have copied the function 'runOneStrandAnalysis' from the documentation and have changed runAnalysis2 to be the human genome (BSgenome.Hsapiens.UCSC.hg18) as follows: runAnalysis2 <- function(dict0, outfile="") { seqnames <- seqnames(Hsapiens) runOneStrandAnalysis(dict0, Celegans, seqnames, "+", outfile=outfile, append=FALSE) runOneStrandAnalysis(dict0, Celegans, seqnames, "-", outfile=outfile, append=TRUE) } When I run the code using the patterns in the example, it runs fine (though of course the patterns in the dictionary were intended for C elegans, but that's irrelevant): runAnalysis2(ce2dict0cw15, outfile="ce2dict0cw15_ana2.txt") Instead I want to find if regions in the genome are repeated. So I made a dictionary that is class 'XStringViews' by using views to grab the regions: MAPLENGTH<-36 WRITELENGTH<-100 myDictViews<-views(Hsapiens$chr1, 1:WRITELENGTH, (1:WRITELENGTH)+MAPLENGTH-1) And I get the following error: runAnalysis2(myDictViews, outfile="sample.txt") >>> Finding all hits in strand + of chromosome chr1 ... Error in .match.CWdna_PDict.exact(pdict, subject, fixed, count.only) : Biostrings internal error in _init_chrtrtable(): invalid code -1 The problem is in matchPDict: mypdict<-PDict(myDictViews) subject <- Hsapiens[["chr1"]] mindex <- matchPDict(mypdict, subject) Error in .match.CWdna_PDict.exact(pdict, subject, fixed, count.only) : Biostrings internal error in _init_chrtrtable(): invalid code -1 I have checked and there are only A,C,G,T values in this region. Is it because I have a 'XStringViews' object rather than a DNAStringSet object? If this is a bug (I know this part is still under progress) is there a quick fix, like converting from 'XStringViews' to 'DNAStringSet'? Also, if I want to do this on a large scale, is this a reasonable way to go or are there tricks for this that would speed it up? I'm still not sure that I will ultimately do this in R, but just to know. Thanks, Elizabeth UC, Berkeley

celegans Biostrings celegans Biostrings • 946 views

ADD COMMENT • link updated 15.5 years ago by Hervé Pagès 16k • written 15.5 years ago by Elizabeth Purdom ▴ 210

0

Entering edit mode

Hervé Pagès 16k

@herve-pages-1542

Last seen 23 hours ago

Seattle, WA, United States

Hi Elizabeth, Elizabeth Purdom wrote: > Hello, > > I am trying to run code like that in the pdf documentation of BSGenome > to match patterns to all the chromosomes of a genome. I have copied the > function 'runOneStrandAnalysis' from the documentation and have changed > runAnalysis2 to be the human genome (BSgenome.Hsapiens.UCSC.hg18) as > follows: > > runAnalysis2 <- function(dict0, outfile="") > { > seqnames <- seqnames(Hsapiens) > runOneStrandAnalysis(dict0, Celegans, seqnames, "+", outfile=outfile, > append=FALSE) > runOneStrandAnalysis(dict0, Celegans, seqnames, "-", outfile=outfile, > append=TRUE) > } You also need to replace the 2nd argument of your call to runOneStrandAnalysis() by Hsapiens. > > When I run the code using the patterns in the example, it runs fine > (though of course the patterns in the dictionary were intended for C > elegans, but that's irrelevant): > > runAnalysis2(ce2dict0cw15, outfile="ce2dict0cw15_ana2.txt") > > Instead I want to find if regions in the genome are repeated. So I made > a dictionary that is class 'XStringViews' by using views to grab the > regions: > > MAPLENGTH<-36 > WRITELENGTH<-100 > myDictViews<-views(Hsapiens$chr1, 1:WRITELENGTH, > (1:WRITELENGTH)+MAPLENGTH-1) > > And I get the following error: > runAnalysis2(myDictViews, outfile="sample.txt") > >>> Finding all hits in strand + of chromosome chr1 ... > Error in .match.CWdna_PDict.exact(pdict, subject, fixed, count.only) : > Biostrings internal error in _init_chrtrtable(): invalid code -1 > > The problem is in matchPDict: > mypdict<-PDict(myDictViews) > subject <- Hsapiens[["chr1"]] > mindex <- matchPDict(mypdict, subject) > Error in .match.CWdna_PDict.exact(pdict, subject, fixed, count.only) : > Biostrings internal error in _init_chrtrtable(): invalid code -1 > > I have checked and there are only A,C,G,T values in this region. Is it > because I have a 'XStringViews' object rather than a DNAStringSet > object? If this is a bug (I know this part is still under progress) is > there a quick fix, like converting from 'XStringViews' to 'DNAStringSet'? This is an old bug when one of the 4 base letters was missing (in your case the views contain no Gs). The bug was fixed in the devel branch of BioC a couple of months before the last BioC release. So I suspect you are still using an outdated version of Biostrings (providing your sessionInfo() would help us to figure out this). Please update your installation and make sure you use BioC 2.3 and the latest version of Biostrings (v 2.10.6). Then create the views with: myDictViews <- Views(unmasked(Hsapiens$chr1), start=1:WRITELENGTH, end=(1:WRITELENGTH)+MAPLENGTH-1) (note that views() is deprecated in favor of Views()) and try to run your code again with this 'myDictViews'. > > Also, if I want to do this on a large scale, is this a reasonable way to > go or are there tricks for this that would speed it up? I'm still not > sure that I will ultimately do this in R, but just to know. If you want to slide the 36-nucleotide window along the entire Human genome in order to find which views are repeated, yes it's going to be a huge analysis and will require a lot of computing power. I don't think the matchPDict() approach is the best way to go but it's certainly doable with it. Here are a few tips: o If you blindly slide the window (of width 36) along chromosome 1 (or any other chromosome), you will run thru N-blocks. These blocks are generally assembly gaps (inter-contig) but they can also be found inside a contig (intra-contig Ns). Of course you want to skip the Ns (or any other ambiguity letter but I don't think there are any in hg18), even isolated Ns, because PDict() doesn't support them. One easy way to do this is to split chr1 into N-free views with: as(Hsapiens$chr1, "XStringViews") o Even on a machine with a lot of memory, there are currently some limitations to the size of a PDict object. For a window of width 36, the max number of views that you will be able to store in the PDict object should be between 10 and 15 millions. Note that for 10 millions, the resulting PDict object will occupy about 7GB in memory. So make sure that you have at least 16GB or RAM if you plan to use PDict objects of this size. Otherwise you will need to use smaller PDict objects (5 millions views) and the entire analysis will be twice longer. o The MIndex object returned by matchPDict() contains the coordinates of all the hits and for each PDict object that you run along the genome will collect tens or hundreds of millions of hits. If you are only interested in knowing how many times each view is repeated in the entire genome, it's highly recommended that you use countPDict() instead of matchPDict(). That will help you save a lot of memory. o Use a streaming approach to save your results to a file (i.e. save as soon as you can so you don't have to keep everything in memory during the entire analysis). Also try to save in a compact way. For example you could just use a 4-column format, e.g.: chromosome strand start nb_of_occurences ... ... ... ... chr1 + 147776276 86 chr1 + 147776277 135 chr1 + 147776278 97 ... ... ... ... and save lines only for views that have other occurences in the genome (nb_of_occurences >= 2). But if you are only interested in views that are unique in the genome (nb_of_occurences = 1), you could only save those and then you don't need the nb_of_occurences column anymore. Note that this is maybe the kind of analysis where using an indexed genome (suffix tree or suffix array) could be more efficient. But I cannot give any advice on this since I'm not familiar with this kind of approach. Hope this helps. H. > > Thanks, > Elizabeth > UC, Berkeley > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M2-B876 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319

ADD COMMENT • link 15.5 years ago Hervé Pagès 16k

Login before adding your answer.