Limit on number of sequence files for forging a BSgenome

0

Entering edit mode

Hervé Pagès 16k

@herve-pages-1542

Last seen 18 hours ago

Seattle, WA, United States

Hi Marco, Sorry for the delay in answering this. Seems like you are hitting a limitation in the implementation of the read.dcf() function. Having so many entries in the 'seqnames' field of your seed file is probably a bad idea anyway. This field is typically expected to list the chromosomes or "top-level sequences" so it will always have a small number of entries (i.e. < 100). Things like contigs or scaffolds, which are generally counted by hundreds or thousands, should go in the 'mseqnames' field. Here they are grouped and stored in 1 (or a small number) of DNAStringSet objects (i.e. < 100) but each object can contain millions of sequences. One entry per object should be reported in the 'mseqnames' field. See for example the 'chrUn.scaffolds' multiple sequence in the BSgenome.Btaurus.UCSC.bosTau4 package, or the 'Zv9_NA' and 'Zv9_scaffold' multiple sequences in the BSgenome.Drerio.UCSC.danRer7 package. See the BSgenomeForge vignette for details about to handle the 'mseqnames' field when forging your own package. Hope this helps. Please let me know if you have further questions. H. On 03/29/2013 12:10 PM, Blanchette, Marco wrote: > I traced back the error "Error: Line longer than buffer size" that I am > getting from forgeBSgenomeDataPkg() to a call to read.dcf() made in the > forgeBSgenomeDataPkg() that is used to read the seed file. I came to the > realization that there is an upper limit to the number of character > allowed per single line for the DCF files. > > For instance: > > This works > cat("test:",paste(sample(letters,8184,TRUE),collapse=""),"\n",file=" test.dc > f");t <- read.dcf("test.dcf") > > While this breaks with the same error I get from forgeBSgenomeDataPkg() > cat("test:",paste(sample(letters,8184,TRUE),collapse=""),"\n",file=" test.dc > f");t <- read.dcf("test.dcf") > > Since the seqnames: field I creates in my seed file contains several > thousands entries, I am busting that upper limit. I can reproduce the > error just by trying to read the seed file with read.dcf("mySeedFile.txt") > > At this point, I am not sure if there is an easy workaround and whether > this should be consider a bug in BSgenome or read.dcf() that should be > reported... > > Advise? > > -- Marco Blanchette, Ph.D. > Stowers Institute for Medical Research > 1000 East 50th Street > Kansas City MO 64110 > www.stowers.org > > > Tel: 816-926-4071 > Cell: 816-726-8419 > Fax: 816-926-2018 > > > > > > > On 3/28/13 11:58 AM, "Blanchette, Marco" <mab at="" stowers.org=""> wrote: > >> Kasper, >> >> I see your line of thought, is there a particular fasta file causing >> forgeBSgenomeDataPkg() to break? >> >> The answer is no. Once I reach a certain number of fasta files, adding one >> more contig breaks the function. For instance, taking the first 454 >> contigs of C. brenneri breaks while removing the last or the first fasta >> file from the list (keeping only 453) compile without a problem (neither >> the last or the first fasta files are responsible for breaking the >> function, the number of file is the trigger) >> >> What's even more puzzling is that the number that breaks is not a fixed >> number. Selecting a random selection of contigs or changing genome will >> change the number that triggers the function to break... However it's >> always around 440 files, which might be due to the size of the fasta files >> being all of very similar sizes. >> >> Any clues? >> >> >> -- Marco Blanchette, Ph.D. >> Stowers Institute for Medical Research >> 1000 East 50th Street >> Kansas City MO 64110 >> www.stowers.org >> >> >> Tel: 816-926-4071 >> Cell: 816-726-8419 >> Fax: 816-926-2018 >> >> >> >> >> >> >> On 3/27/13 8:22 PM, "Kasper Daniel Hansen" <kasperdanielhansen at="" gmail.com=""> >> wrote: >> >>> Marco, >>> >>> You are probably right in diagnosing the problem, but sometimes I >>> think I have seen FASTA files with the entire sequence on a single >>> line, instead of (say) 80 nucleotides and then a newline. I could >>> believe that a really long contig on a single line without a newline, >>> could cause an error like this. You could quickly check if there is a >>> suspicious file by >>> wc -l * >>> and look for files with #lines like 2-3. Somehow 460 seems a weird >>> number to fail at. >>> >>> This may not be your problem, and I am sure Herve will respond in due >>> time. >>> >>> Best, >>> Kasper >>> >>> On Wed, Mar 27, 2013 at 4:28 PM, Blanchette, Marco <mab at="" stowers.org=""> >>> wrote: >>>> Hi, >>>> >>>> Is there a maximum number of sequence files (chromosomes or contigs in >>>> my case) that can be fed to the forgeBSgenomeDataPkg() function? I am >>>> trying to build a BSgenome for C. brenneri and C. japonica available >>> >from EnsemblGenomes. These genomes are made from thousands of contigs >>>> with genes annotated to them. Currently, I get the following error when >>>> running "Error: Line longer than buffer size" when running on the full >>>> set of contigs. However, it works fine on a seed file containing a >>>> subset of the contigs (I can forge a genome with 450 contigs but not >>>> with 460!) >>>> >>>> Any suggestions will be appreciated (I can provide a toy example but I >>>> am not sure what would be the merit of it at this point) >>>> >>>> Thanks >>>> >>>> -- Marco Blanchette, Ph.D. >>>> Stowers Institute for Medical Research >>>> 1000 East 50th Street >>>> Kansas City MO 64110 >>>> www.stowers.org >>>> >>>> Tel: 816-926-4071 >>>> Cell: 816-726-8419 >>>> Fax: 816-926-2018 >>>> >>>> [[alternative HTML version deleted]] >>>> >>>> _______________________________________________ >>>> Bioconductor mailing list >>>> Bioconductor at r-project.org >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> Search the archives: >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319

GO Cancer BSgenome BSgenome genomes GO Cancer BSgenome BSgenome genomes • 1.9k views

ADD COMMENT • link 12.8 years ago Hervé Pagès 16k

0

Entering edit mode

Hi

I have a similar problem building a BSgenome for a Yeast strain. The 'Genome' consists ONLY of scaffold and unplaced sequences. There are around 500 scaffolds and 100 unplaced sequences (and nothing else). I CAN forge and load a BSgenome if I list the names of all sequences under 'seqnames' (I had to continue on extra lines in the _seed file with whitespace starting continuation lines). However, when I tried to use mseqnames (as above) - I made two fasta files, one with all the scaffolds another with all the unplaced and added the line

mseqnames: c("scaffold", "unplaced")

Then forgeBSgenomeDataPkg() failed with

Error in .forgeTwobitFile(seqnames, prefix, suffix, seqs_srcdir, seqs_destdir,  : 
  'seqnames' must be a character vector
In addition: Warning message:
In forgeSeqFiles(.seqnames, mseqnames = .mseqnames, seqfile_name = x@seqfile_name,  :
  'seqnames' is empty

Which is correct, but not very helpful. As far as I can tell, mseqnames only works when there is also some seqnames.

Otherwise, well done on some really useful software.

Dave Gerrard

ADD REPLY • link 10.3 years ago Dave Gerrard ▴ 10

0

Entering edit mode

Nevermind!

If I set

seqnames: character()

Then it builds.

ADD REPLY • link 10.3 years ago Dave Gerrard ▴ 10

Login before adding your answer.