Question about MEDIPS extend parameter : RE: Success: RE: Seed file error: RE: BioStrings for current R: RE: BSgenomeForge seed file

Question about MEDIPS extend parameter : RE: Success: RE: Seed file error: RE: BioStrings for current R: RE: BSgenomeForge seed file - seqnames field

0

Entering edit mode

Hervé Pagès 16k

@herve-pages-1542

Last seen 3 hours ago

Seattle, WA, United States

Hi Kelly, I'm afraid you're asking the wrong person, sorry. The best place to ask this kind of question is on the mailing list (cc'ed) where there is a good chance that someone will be able to help you. Cheers, H. On 12/16/2013 02:55 PM, Vining, Kelly wrote: > Hi Herve, > I have a quick question about the 'extend' vs. 'shift' parameters in the MEDIPS package. From the documentation: > > All reads will be extended to a length of 300nt according to the given strandinformation: >> extend=300 > As an alternative to the extend parameter, the > shift parameter can be specifed. Here, the reads are not extended but shifted by the specifed number > of nucleotides with respect to the given strand infomation. One of the two parameters extend or > shift has to be 0. > > Is this parameter supposed to represent the IP'd fragment size? > > Thanks, > --Kelly > > -----Original Message----- > From: Vining, Kelly > Sent: Thursday, December 12, 2013 6:51 AM > To: Hervé Pagès > Subject: Success: RE: Seed file error: RE: BioStrings for current R: RE: [BioC] BSgenomeForge seed file - seqnames field > > Hi Herve, > I have now succeeded in forging the poplar BSgenome object. I will have to use it in order to confirm that it's exactly as I expect, but just wanted to let you know I got this far and thank you again for your help. > > Cheers, > --Kelly > ________________________________________ > From: Hervé Pagès [hpages at fhcrc.org] > Sent: Tuesday, December 10, 2013 2:38 PM > To: Vining, Kelly > Cc: bioconductor at r-project.org > Subject: Re: Seed file error: RE: BioStrings for current R: RE: [BioC] BSgenomeForge seed file - seqnames field > > Hi Kelly, > > Hard to tell without knowing exactly what you've downloaded but the 1st thing that strikes me when I look at your seed file below is that, depending on where I look exactly, the Populus trichocarpa genome seems to have either 12 chromosomes (based on your seqnames field), or 13 chromosomes (based on your SrcDataFiles1 field), or 19 chromosomes (based on the error reported by forgeBSgenomeDataPkg()). You need to figure out how many chromosomes there are and stick to that. > > A few other things: > > - No ".fa" extensions in the seqnames. > > - Make sure you have one .fa file per chromosome. Each file should be > named Chr01.fa, Chr02.fa, Chr03.fa, etc... Note that if you've > downloaded Ptrichocarpa_210.fa.gz from > > ftp://ftp.jgi- psf.org/pub/compgen/phytozome/v9.0/Ptrichocarpa/assembly/ > > this file is a single FASTA file that contains all the chromosome > sequences so you need to split it first in order to have one FASTA > file per chromosome. See the splitbigfasta.R script in > BSgenome/inst/extdata/GentlemanLab/BSgenome.Btaurus.UCSC.bosTau6-tools > for how to do this (you will need to adapt the script to your > situation). I realize this splitting step is a burden and I have on > my list to improve forgeBSgenomeDataPkg() so it won't be needed > anymore. > > - Finally make sure the Chr01.fa, Chr02.fa, Chr03.fa, etc... files > are in the folder specified in the seqs_srcdir field > (/raid1/home/pi/viningk/Poplar/genomedb in your case). > > Let me know if that still doesn't work for you. > > Cheers, > H. > > > On 12/10/2013 12:56 PM, Vining, Kelly wrote: >> Hi Herve, >> I'm almost there with my BSgenome package, but am still getting an error. It's clearly due to my seed file format, but I am not sure what to change at this point. It appears that BSgenome is not seeing all of my fasta files as separate. I tried not having the .fa suffixes in the seqnames field, but got a similar "files not found" error. >> >> Thanks again for your help. >> >> Error: >>> forgeBSgenomeDataPkg("BSgenome.Ptrichocarpa.Phytozome.v3-seed") >> Creating package in ./BSgenome.Ptrichocarpa.Phytozome.v3 >> Error in getSeqSrcpaths(seqnames, prefix = prefix, suffix = suffix, seqs_srcdir = seqs_srcdir) : >> /raid1/home/pi/viningk/Poplar/genomedb/Chr01.fa Chr02.fa Chr03.fa >> Chr04.fa Chr05.fa Chr06.fa Chr07.fa Chr08.fa Chr09.fa Chr10.fa >> Chr11.fa Chr12.fa Chr13.fa Chr14.fa Chr15.fa Chr16.fa Chr17.fa >> Chr18.fa Chr19.fa.fa: file(s) not found >> >> My seed file: >> >> Package: BSgenome.Ptrichocarpa.Phytozome.v3 >> Title: Populus trichocarpa full genome (Phytozome version 3) >> Description: Populus trichocarpa (Black cottonwood) genome as provided >> by Phytozome (v3, 2013) stored in Bioconductor >> Version: 3.0 >> organism: Populus trichocarpa >> species: Black cottonwood >> provider: Phytozome (JGI) >> provider_version: 3.0 >> release_date: January 2010 >> release_name: Populus trichocarpa v3.0 >> source_url: >> ftp://ftp.jgi-psf.org/pub/compgen/phytozome/v9.0/Ptrichocarpa/ >> organism_biocview: Populus_trichocarpa >> BSgenomeObjname: Ptrichocarpa >> seqnames: paste("Chr01.fa", "Chr02.fa", "Chr03.fa", "Chr04.fa", >> "Chr05.fa", "Chr06.fa", "Chr07.fa", "Chr08.fa", "Chr09.fa", >> "Chr10.fa", "Chr11.fa", "Chr12.fa",$ >> mseqnames: names(scaf_mseq) >> SrcDataFiles1: sequences: Chr01.fa, Chr02.fa, Chr03.fa, Chr04.fa, >> Chr05.fa, Chr06.fa, Chr07.fa, Chr08.fa, Chr09.fa, Chr10.fa, Chr11.fa, >> Chr12.fa, Chr13.fa, Chr$ >> PkgExamples: genome$Chr01 # same as genome[["Chr01"]] >> seqs_srcdir: /raid1/home/pi/viningk/Poplar/genomedb >> >> >> ________________________________________ >> From: Hervé Pagès [hpages at fhcrc.org] >> Sent: Monday, December 09, 2013 2:09 PM >> To: Vining, Kelly >> Cc: bioconductor at r-project.org >> Subject: Re: BioStrings for current R: RE: [BioC] BSgenomeForge seed >> file - seqnames field >> >> Hi Kelly, >> >> On 12/06/2013 12:23 PM, Vining, Kelly wrote: >>> Hi Herve, >>> It has been a while since I have worked with Biostrings and the MEDIPS package. When I went to install Biostrings just now, got an error message that Biostrings isn't available: >>> >>>> install.packages("Biostrings") >>> Installing package into ?/raid1/home/pi/viningk/R? (as ?lib? is >>> unspecified) >>> --- Please select a CRAN mirror for use in this session --- >> >> Here you are asked to choose a CRAN mirror but Biostrings is not a >> CRAN package. This suggests that you might be on the wrong path. >> >> >>> Warning message: >>> package ?Biostrings? is not available (for R version 3.0.0) >>> >>> Revisiting the online documentation, it does look like the package is still available. >>> Is there an available update, or a different alternative I should be using? >> >> The only proper way to install a Bioconductor package is to use >> biocLite(), as documented here: >> >> http://bioconductor.org/install/ >> >> Unlike install.packages(), biocLite() "knows" where to find >> Bioconductor packages. Also make sure you're using the latest BioC >> release (BioC 2.13). >> >> Cheers, >> H. >> >>> >>> Thanks, >>> --Kelly Vining >>> ________________________________________ >>> From: Hervé Pagès [hpages at fhcrc.org] >>> Sent: Thursday, May 02, 2013 10:49 AM >>> To: Vining, Kelly >>> Cc: Kelly V [guest]; bioconductor at r-project.org; MEDIPS Maintainer >>> Subject: Re: [BioC] BSgenomeForge seed file - seqnames field >>> >>> Hi Kelly, >>> >>> On 05/02/2013 06:44 AM, Vining, Kelly wrote: >>>> Hi Herve, >>>> Thanks for your helpful response, and for pointing me to the proper scripts. Given this, it appears that I should treat my gene feature annotation files (promoters, genes) as mseqnames objects, similarly to how I will treat the extrachromosomal scaffolds. Is there any way to include additional features that I may find of interest after I forge this initial package (say, using gff files), or am I limited such that I would have to re-forge a new reference? >>> >>> A BSgenome data package is for storing the sequences of a given >>> genome/assembly. No annotations should go there. >>> >>> In Bioconductor we try to maintain a clear separation between >>> packages that contain the sequences of a genome/assembly (BSgenome >>> data packages), and packages that contain annotations for that >>> genome/assembly (TxDb, FDb, OrganismDb, SNPlocs, etc... packages). >>> >>> There are at least 2 reasons for this: >>> >>> 1. For a given assembly there are many kinds of annotations available >>> in many places on the internet, and some of them tend to be updated >>> frequently. By contrast, the sequences of a given assembly >>> never change. So putting only the reference sequences in a >>> BSgenome package make it very stable. It almost never needs to >>> be updated, except for updating a man page or when something >>> changes in the way sequences are stored in the package. This is >>> good with such big packages (800 Mb for the biggest ones). >>> >>> 2. Storing the DNA sequences of the genes, promoters, or other genomic >>> feature in a BSgenome package would duplicate a lot of DNA >>> sequences that are already in the reference sequences (those >>> small sequences are substrings of the reference sequences). >>> This would make the BSgenome package unnecessarily bigger. >>> It's better to store only the locations of those genomic features, >>> not in the BSgenome package, but in a separate package, and then >>> to use those locations to extract the corresponding sequences from >>> the BSgenome object. This extraction can be done with getSeq() >>> (defined in the BSgenome software package) and is relatively >>> cheap. This approach gives you a lot more flexibility than having >>> those sequences pre-extracted and bundled with the BSgenome data >>> package. >>> >>> If your annotations are in GFF files, you don't really need to >>> put them in a package: you can load them directly in GRanges >>> objects with import() (defined in the rtracklayer package), >>> or if you are particularly interested in the exon/transcript >>> structure of your genes, you could use makeTranscriptDbFromGFF() >>> to load a GFF file containing genes, mRNAs, exons, and CDSs into >>> a TranscriptDb. See ?makeTranscriptDbFromGFF in the >>> GenomicFeatures package for more information. Then it's easy to >>> extract the locations of transcripts, exons, or CDSs as a GRanges >>> or GRangesList object from this TranscriptDb object. See >>> ?transcripts and ?exonsBy. Finally once you have the locations >>> in a GRanges object, you can use getSeq() to extract the >>> corresponding sequences from the BSgenome object. >>> >>> Other more specialized functions are promoters() and >>> getPromoterSeq() for extracting the promoter locations and >>> their sequences, respectively. Also extractTranscriptsFromGenome() >>> for extracting the full transcriptome from a BSgenome package. >>> >>> HTH, >>> H. >>> >>> PS: The upstream sequences that you see in some of our BSgenome data >>> packages are a relic of the past and will be removed soon. >>> >>>> >>>> Thanks much, >>>> --Kelly V. >>>> >>>> -----Original Message----- >>>> From: Hervé Pagès [mailto:hpages at fhcrc.org] >>>> Sent: Tuesday, April 30, 2013 11:12 PM >>>> To: Kelly V [guest] >>>> Cc: bioconductor at r-project.org; Vining, Kelly; MEDIPS Maintainer >>>> Subject: Re: [BioC] BSgenomeForge seed file - seqnames field >>>> >>>> Hi Kelly, >>>> >>>> On 04/30/2013 03:52 PM, Kelly V [guest] wrote: >>>>> >>>>> I'm preparing a custom reference genome for use with the MEDIPS package. I see that one field of the seed file, which is apparently not optional, is the 'seqnames' field. The example given in the documentation is this: >>>>> >>>>> paste("chr", c(1:20, "X", "M", "Un", paste(c(1:20, "X", "Un"), >>>>> "_random", sep="")), sep="") >>>>> >>>>> I have two simple questions about this. >>>>> >>>>> 1. Does R match this information with the source sequence file? For example, if I have a single fasta file with fasta headers chr_01...chr_20, must the seqnames entries exactly match those headers? >>>> >>>> No. You need to provide 1 FASTA file per single sequence, that is, 1 file per name you put in the 'seqnames' field. That means that each file is expected to contain only 1 sequence. What's in the FASTA header of each file is not important. What's important is that the name of each file be of the form <prefix><seqname><suffix>, where <seqname> is the name of the sequence as it appears in the 'seqnames' field, and <prefix> and <suffix> are a prefix and a suffix (eventually empty) that are the same for all the files. >>>> >>>> If what you have is a big FASTA file containing all the chromosome sequences, then you first need to split it into 1 file per chromosome. >>>> This is easy to do in R. For example, here is the script I used to split the big bosTau6.fa file provided by UCSC for the bosTau6 genome: >>>> >>>> library(Biostrings) >>>> bosTau6 <- readDNAStringSet("bosTau6.fa") >>>> >>>> ### Partitioning: >>>> is_chrUn <- grepl("^chrUn", names(bosTau6)) >>>> is_chrom <- !is_chrUn >>>> >>>> ### Send each chromosome to a FASTA file. >>>> seqnames <- paste("chr", c(1:29, "X", "M"), sep="") >>>> stopifnot(setequal(seqnames, names(bosTau6)[is_chrom])) >>>> >>>> for (seqname in seqnames) { >>>> seq <- bosTau6[match(seqname, names(bosTau6))] >>>> filename <- paste(seqname, ".fa", sep="") >>>> cat("writing ", filename, "\n", sep="") >>>> writeXStringSet(seq, file=filename, width=50L) >>>> } >>>> >>>> ### Send the 3286 chrUn_* sequences to 1 FASTA file. >>>> chrUn_mseq <- bosTau6[is_chrUn] >>>> writeXStringSet(chrUn_mseq, file="chrUn.fa", width=50L) >>>> >>>> The input is the bosTau6.fa file containing 3317 FASTA records: >>>> 31 records for the chromosomes, and 3286 for the chrUn_* sequences. >>>> The script produces 32 FASTA files: 1 per chromosome (chr1.fa, chr2.fa, chr3.fa, ..., chr29.fa, chrX.fa, chrM.fa), and the chrUn.fa file (containing the 3286 chrUn_* sequences). >>>> >>>> Note that you can find this script in the BSgenome package, and display its source with: >>>> >>>> > splitbigfasta_R <- system.file("extdata", >>>> "GentlemanLab", >>>> "BSgenome.Btaurus.UCSC.bosTau6-tools", >>>> "splitbigfasta.R", >>>> package="BSgenome") >>>> >>>> > cat(readLines(splitbigfasta_R), sep="\n") >>>> >>>> It should not be too hard to adapt this script to your own needs. >>>> >>>>> >>>>> 2. Revealing the reason for my first question:In my genome fasta file, I have 1427 extrachromosomal scaffolds, but they are not all sequentially numbered, so that I have scaffold_1..scaffold_3681. Do I need to use a regular expression in my seqnames field to tell R to look for scaffold_ followed by 1-4 digits? >>>> >>>> No, not in the 'seqnames' field, because those 1427 extrachromosomal scaffolds should not go there. They would need to go in the 'mseqnames' >>>> field. >>>> >>>> Unlike the 'seqnames' field where you enumerate objects that can >>>> only contain 1 sequence, in the 'mseqnames' field you can enumerate >>>> objects that contain more than 1 sequence. More precisely, in the >>>> final BSgenome data package, each entry in the 'seqnames' field will >>>> correspond to a DNAString object (the DNAString container can hold >>>> 1 sequence only), and each entry in the 'mseqnames' field will correspond to a DNAStringSet object (the DNAStringSet container can hold multiple sequences). >>>> >>>> So typically, all the extrachromosomal scaffolds would go in one DNAStringSet object in the final BSgenome package (this is for example what is done in the BSgenome.Drerio.UCSC.danRer7 package). >>>> To achieve this, you need to put 1 entry in the 'mseqnames' field (e.g. "scaffolds"), and to put the 1427 extrachromosomal scaffolds in one FASTA file named accordingly to that entry (e.g. scaffolds.fa). >>>> >>>> In the above script, replace: >>>> >>>> is_chrUn <- grepl("^chrUn", names(bosTau6)) >>>> >>>> with: >>>> >>>> is_scaffold <- grepl("^scaffold", names(<your_genome>)) >>>> >>>> then every occurrence of 'is_chrUn' with 'is_scaffold', and finally those 3 lines: >>>> >>>> ### Send the 3286 chrUn_* sequences to 1 FASTA file. >>>> chrUn_mseq <- bosTau6[is_chrUn] >>>> writeXStringSet(chrUn_mseq, file="chrUn.fa", width=50L) >>>> >>>> with: >>>> >>>> ### Send the 1427 scaffold_* sequences to 1 FASTA file. >>>> scaffolds_mseq <- <your_genome>[is_scaffold] >>>> writeXStringSet(scaffolds_mseq, file="scaffolds.fa", >>>> width=50L) >>>> >>>> and that should take care of sending all the scaffold sequences to the scaffolds.fa file. >>>> >>>> Let me know if you need further assistance with this. >>>> >>>> Cheers, >>>> H. >>>> >>>>> >>>>> Thanks for any help, >>>>> --Kelly V. >>>>> >>>>> -- output of sessionInfo(): >>>>> >>>>> R version 3.0.0 (2013-04-03) >>>>> Platform: i386-w64-mingw32/i386 (32-bit) >>>>> >>>>> locale: >>>>> [1] LC_COLLATE=English_United States.1252 [2] >>>>> LC_CTYPE=English_United >>>>> States.1252 [3] LC_MONETARY=English_United States.1252 [4] >>>>> LC_NUMERIC=C [5] LC_TIME=English_United States.1252 >>>>> >>>>> attached base packages: >>>>> [1] stats graphics grDevices utils datasets methods base >>>>> >>>>> -- >>>>> Sent via the guest posting facility at bioconductor.org. >>>>> >>>>> _______________________________________________ >>>>> Bioconductor mailing list >>>>> Bioconductor at r-project.org >>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>> Search the archives: >>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>> >>>> >>>> -- >>>> Hervé Pagès >>>> >>>> Program in Computational Biology >>>> Division of Public Health Sciences >>>> Fred Hutchinson Cancer Research Center >>>> 1100 Fairview Ave. N, M1-B514 >>>> P.O. Box 19024 >>>> Seattle, WA 98109-1024 >>>> >>>> E-mail: hpages at fhcrc.org >>>> Phone: (206) 667-5791 >>>> Fax: (206) 667-1319 >>>> >>> >>> -- >>> Hervé Pagès >>> >>> Program in Computational Biology >>> Division of Public Health Sciences >>> Fred Hutchinson Cancer Research Center >>> 1100 Fairview Ave. N, M1-B514 >>> P.O. Box 19024 >>> Seattle, WA 98109-1024 >>> >>> E-mail: hpages at fhcrc.org >>> Phone: (206) 667-5791 >>> Fax: (206) 667-1319 >>> >> >> -- >> Hervé Pagès >> >> Program in Computational Biology >> Division of Public Health Sciences >> Fred Hutchinson Cancer Research Center >> 1100 Fairview Ave. N, M1-B514 >> P.O. Box 19024 >> Seattle, WA 98109-1024 >> >> E-mail: hpages at fhcrc.org >> Phone: (206) 667-5791 >> Fax: (206) 667-1319 >> > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages at fhcrc.org > Phone: (206) 667-5791 > Fax: (206) 667-1319 > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319

Annotation GO Cancer BSgenome OrganismDb SNPlocs TranscriptDb Biostrings BSgenome MEDIPS • 1.7k views

ADD COMMENT • link updated 11.3 years ago by Matthias Lienhard ▴ 280 • written 11.3 years ago by Hervé Pagès 16k

0

Entering edit mode

Matthias Lienhard ▴ 280

@matthias-lienhard-6292

Last seen 15 months ago

Max Planck Institute for molecular Gene…

Hi Kelly, yes, the MEDIPS extend parameter is supposed to be the size of the fragments that have been IP'd. Best, Matthias Am 16.12.2013 15:03, schrieb Hervé Pagès: > Hi Kelly, > > I'm afraid you're asking the wrong person, sorry. The best place to ask > this kind of question is on the mailing list (cc'ed) where there is a > good chance that someone will be able to help you. > > Cheers, > H. > > > On 12/16/2013 02:55 PM, Vining, Kelly wrote: >> Hi Herve, >> I have a quick question about the 'extend' vs. 'shift' parameters in >> the MEDIPS package. From the documentation: >> >> All reads will be extended to a length of 300nt according to the >> given strandinformation: >>> extend=300 >> As an alternative to the extend parameter, the >> shift parameter can be specifed. Here, the reads are not extended but >> shifted by the specifed number >> of nucleotides with respect to the given strand infomation. One of >> the two parameters extend or >> shift has to be 0. >> >> Is this parameter supposed to represent the IP'd fragment size? >> >> Thanks, >> --Kelly >> >> -----Original Message----- >> From: Vining, Kelly >> Sent: Thursday, December 12, 2013 6:51 AM >> To: Hervé Pagès >> Subject: Success: RE: Seed file error: RE: BioStrings for current R: >> RE: [BioC] BSgenomeForge seed file - seqnames field >> >> Hi Herve, >> I have now succeeded in forging the poplar BSgenome object. I will >> have to use it in order to confirm that it's exactly as I expect, but >> just wanted to let you know I got this far and thank you again for >> your help. >> >> Cheers, >> --Kelly >> ________________________________________ >> From: Hervé Pagès [hpages at fhcrc.org] >> Sent: Tuesday, December 10, 2013 2:38 PM >> To: Vining, Kelly >> Cc: bioconductor at r-project.org >> Subject: Re: Seed file error: RE: BioStrings for current R: RE: >> [BioC] BSgenomeForge seed file - seqnames field >> >> Hi Kelly, >> >> Hard to tell without knowing exactly what you've downloaded but the >> 1st thing that strikes me when I look at your seed file below is >> that, depending on where I look exactly, the Populus trichocarpa >> genome seems to have either 12 chromosomes (based on your seqnames >> field), or 13 chromosomes (based on your SrcDataFiles1 field), or 19 >> chromosomes (based on the error reported by forgeBSgenomeDataPkg()). >> You need to figure out how many chromosomes there are and stick to that. >> >> A few other things: >> >> - No ".fa" extensions in the seqnames. >> >> - Make sure you have one .fa file per chromosome. Each file should be >> named Chr01.fa, Chr02.fa, Chr03.fa, etc... Note that if you've >> downloaded Ptrichocarpa_210.fa.gz from >> >> ftp://ftp.jgi- psf.org/pub/compgen/phytozome/v9.0/Ptrichocarpa/assembly/ >> >> this file is a single FASTA file that contains all the chromosome >> sequences so you need to split it first in order to have one FASTA >> file per chromosome. See the splitbigfasta.R script in >> BSgenome/inst/extdata/GentlemanLab/BSgenome.Btaurus.UCSC.bosTau6-tools >> for how to do this (you will need to adapt the script to your >> situation). I realize this splitting step is a burden and I have on >> my list to improve forgeBSgenomeDataPkg() so it won't be needed >> anymore. >> >> - Finally make sure the Chr01.fa, Chr02.fa, Chr03.fa, etc... files >> are in the folder specified in the seqs_srcdir field >> (/raid1/home/pi/viningk/Poplar/genomedb in your case). >> >> Let me know if that still doesn't work for you. >> >> Cheers, >> H. >> >> >> On 12/10/2013 12:56 PM, Vining, Kelly wrote: >>> Hi Herve, >>> I'm almost there with my BSgenome package, but am still getting an >>> error. It's clearly due to my seed file format, but I am not sure >>> what to change at this point. It appears that BSgenome is not seeing >>> all of my fasta files as separate. I tried not having the .fa >>> suffixes in the seqnames field, but got a similar "files not found" >>> error. >>> >>> Thanks again for your help. >>> >>> Error: >>>> forgeBSgenomeDataPkg("BSgenome.Ptrichocarpa.Phytozome.v3-seed") >>> Creating package in ./BSgenome.Ptrichocarpa.Phytozome.v3 >>> Error in getSeqSrcpaths(seqnames, prefix = prefix, suffix = suffix, >>> seqs_srcdir = seqs_srcdir) : >>> /raid1/home/pi/viningk/Poplar/genomedb/Chr01.fa Chr02.fa Chr03.fa >>> Chr04.fa Chr05.fa Chr06.fa Chr07.fa Chr08.fa Chr09.fa Chr10.fa >>> Chr11.fa Chr12.fa Chr13.fa Chr14.fa Chr15.fa Chr16.fa Chr17.fa >>> Chr18.fa Chr19.fa.fa: file(s) not found >>> >>> My seed file: >>> >>> Package: BSgenome.Ptrichocarpa.Phytozome.v3 >>> Title: Populus trichocarpa full genome (Phytozome version 3) >>> Description: Populus trichocarpa (Black cottonwood) genome as provided >>> by Phytozome (v3, 2013) stored in Bioconductor >>> Version: 3.0 >>> organism: Populus trichocarpa >>> species: Black cottonwood >>> provider: Phytozome (JGI) >>> provider_version: 3.0 >>> release_date: January 2010 >>> release_name: Populus trichocarpa v3.0 >>> source_url: >>> ftp://ftp.jgi-psf.org/pub/compgen/phytozome/v9.0/Ptrichocarpa/ >>> organism_biocview: Populus_trichocarpa >>> BSgenomeObjname: Ptrichocarpa >>> seqnames: paste("Chr01.fa", "Chr02.fa", "Chr03.fa", "Chr04.fa", >>> "Chr05.fa", "Chr06.fa", "Chr07.fa", "Chr08.fa", "Chr09.fa", >>> "Chr10.fa", "Chr11.fa", "Chr12.fa",$ >>> mseqnames: names(scaf_mseq) >>> SrcDataFiles1: sequences: Chr01.fa, Chr02.fa, Chr03.fa, Chr04.fa, >>> Chr05.fa, Chr06.fa, Chr07.fa, Chr08.fa, Chr09.fa, Chr10.fa, Chr11.fa, >>> Chr12.fa, Chr13.fa, Chr$ >>> PkgExamples: genome$Chr01 # same as genome[["Chr01"]] >>> seqs_srcdir: /raid1/home/pi/viningk/Poplar/genomedb >>> >>> >>> ________________________________________ >>> From: Hervé Pagès [hpages at fhcrc.org] >>> Sent: Monday, December 09, 2013 2:09 PM >>> To: Vining, Kelly >>> Cc: bioconductor at r-project.org >>> Subject: Re: BioStrings for current R: RE: [BioC] BSgenomeForge seed >>> file - seqnames field >>> >>> Hi Kelly, >>> >>> On 12/06/2013 12:23 PM, Vining, Kelly wrote: >>>> Hi Herve, >>>> It has been a while since I have worked with Biostrings and the >>>> MEDIPS package. When I went to install Biostrings just now, got an >>>> error message that Biostrings isn't available: >>>> >>>>> install.packages("Biostrings") >>>> Installing package into ?/raid1/home/pi/viningk/R? (as ?lib? is >>>> unspecified) >>>> --- Please select a CRAN mirror for use in this session --- >>> >>> Here you are asked to choose a CRAN mirror but Biostrings is not a >>> CRAN package. This suggests that you might be on the wrong path. >>> >>> >>>> Warning message: >>>> package ?Biostrings? is not available (for R version 3.0.0) >>>> >>>> Revisiting the online documentation, it does look like the package >>>> is still available. >>>> Is there an available update, or a different alternative I should >>>> be using? >>> >>> The only proper way to install a Bioconductor package is to use >>> biocLite(), as documented here: >>> >>> http://bioconductor.org/install/ >>> >>> Unlike install.packages(), biocLite() "knows" where to find >>> Bioconductor packages. Also make sure you're using the latest BioC >>> release (BioC 2.13). >>> >>> Cheers, >>> H. >>> >>>> >>>> Thanks, >>>> --Kelly Vining >>>> ________________________________________ >>>> From: Hervé Pagès [hpages at fhcrc.org] >>>> Sent: Thursday, May 02, 2013 10:49 AM >>>> To: Vining, Kelly >>>> Cc: Kelly V [guest]; bioconductor at r-project.org; MEDIPS Maintainer >>>> Subject: Re: [BioC] BSgenomeForge seed file - seqnames field >>>> >>>> Hi Kelly, >>>> >>>> On 05/02/2013 06:44 AM, Vining, Kelly wrote: >>>>> Hi Herve, >>>>> Thanks for your helpful response, and for pointing me to the >>>>> proper scripts. Given this, it appears that I should treat my gene >>>>> feature annotation files (promoters, genes) as mseqnames objects, >>>>> similarly to how I will treat the extrachromosomal scaffolds. Is >>>>> there any way to include additional features that I may find of >>>>> interest after I forge this initial package (say, using gff >>>>> files), or am I limited such that I would have to re-forge a new >>>>> reference? >>>> >>>> A BSgenome data package is for storing the sequences of a given >>>> genome/assembly. No annotations should go there. >>>> >>>> In Bioconductor we try to maintain a clear separation between >>>> packages that contain the sequences of a genome/assembly (BSgenome >>>> data packages), and packages that contain annotations for that >>>> genome/assembly (TxDb, FDb, OrganismDb, SNPlocs, etc... packages). >>>> >>>> There are at least 2 reasons for this: >>>> >>>> 1. For a given assembly there are many kinds of annotations >>>> available >>>> in many places on the internet, and some of them tend to >>>> be updated >>>> frequently. By contrast, the sequences of a given assembly >>>> never change. So putting only the reference sequences in a >>>> BSgenome package make it very stable. It almost never >>>> needs to >>>> be updated, except for updating a man page or when something >>>> changes in the way sequences are stored in the package. >>>> This is >>>> good with such big packages (800 Mb for the biggest ones). >>>> >>>> 2. Storing the DNA sequences of the genes, promoters, or >>>> other genomic >>>> feature in a BSgenome package would duplicate a lot of DNA >>>> sequences that are already in the reference sequences (those >>>> small sequences are substrings of the reference sequences). >>>> This would make the BSgenome package unnecessarily bigger. >>>> It's better to store only the locations of those genomic >>>> features, >>>> not in the BSgenome package, but in a separate package, >>>> and then >>>> to use those locations to extract the corresponding >>>> sequences from >>>> the BSgenome object. This extraction can be done with >>>> getSeq() >>>> (defined in the BSgenome software package) and is relatively >>>> cheap. This approach gives you a lot more flexibility than >>>> having >>>> those sequences pre-extracted and bundled with the >>>> BSgenome data >>>> package. >>>> >>>> If your annotations are in GFF files, you don't really >>>> need to >>>> put them in a package: you can load them directly in GRanges >>>> objects with import() (defined in the rtracklayer package), >>>> or if you are particularly interested in the exon/transcript >>>> structure of your genes, you could use >>>> makeTranscriptDbFromGFF() >>>> to load a GFF file containing genes, mRNAs, exons, and >>>> CDSs into >>>> a TranscriptDb. See ?makeTranscriptDbFromGFF in the >>>> GenomicFeatures package for more information. Then it's >>>> easy to >>>> extract the locations of transcripts, exons, or CDSs as a >>>> GRanges >>>> or GRangesList object from this TranscriptDb object. See >>>> ?transcripts and ?exonsBy. Finally once you have the >>>> locations >>>> in a GRanges object, you can use getSeq() to extract the >>>> corresponding sequences from the BSgenome object. >>>> >>>> Other more specialized functions are promoters() and >>>> getPromoterSeq() for extracting the promoter locations and >>>> their sequences, respectively. Also >>>> extractTranscriptsFromGenome() >>>> for extracting the full transcriptome from a BSgenome >>>> package. >>>> >>>> HTH, >>>> H. >>>> >>>> PS: The upstream sequences that you see in some of our BSgenome data >>>> packages are a relic of the past and will be removed soon. >>>> >>>>> >>>>> Thanks much, >>>>> --Kelly V. >>>>> >>>>> -----Original Message----- >>>>> From: Hervé Pagès [mailto:hpages at fhcrc.org] >>>>> Sent: Tuesday, April 30, 2013 11:12 PM >>>>> To: Kelly V [guest] >>>>> Cc: bioconductor at r-project.org; Vining, Kelly; MEDIPS Maintainer >>>>> Subject: Re: [BioC] BSgenomeForge seed file - seqnames field >>>>> >>>>> Hi Kelly, >>>>> >>>>> On 04/30/2013 03:52 PM, Kelly V [guest] wrote: >>>>>> >>>>>> I'm preparing a custom reference genome for use with the MEDIPS >>>>>> package. I see that one field of the seed file, which is >>>>>> apparently not optional, is the 'seqnames' field. The example >>>>>> given in the documentation is this: >>>>>> >>>>>> paste("chr", c(1:20, "X", "M", "Un", paste(c(1:20, "X", "Un"), >>>>>> "_random", sep="")), sep="") >>>>>> >>>>>> I have two simple questions about this. >>>>>> >>>>>> 1. Does R match this information with the source sequence file? >>>>>> For example, if I have a single fasta file with fasta headers >>>>>> chr_01...chr_20, must the seqnames entries exactly match those >>>>>> headers? >>>>> >>>>> No. You need to provide 1 FASTA file per single sequence, that is, >>>>> 1 file per name you put in the 'seqnames' field. That means that >>>>> each file is expected to contain only 1 sequence. What's in the >>>>> FASTA header of each file is not important. What's important is >>>>> that the name of each file be of the form >>>>> <prefix><seqname><suffix>, where <seqname> is the name of the >>>>> sequence as it appears in the 'seqnames' field, and <prefix> and >>>>> <suffix> are a prefix and a suffix (eventually empty) that are the >>>>> same for all the files. >>>>> >>>>> If what you have is a big FASTA file containing all the chromosome >>>>> sequences, then you first need to split it into 1 file per >>>>> chromosome. >>>>> This is easy to do in R. For example, here is the script I used to >>>>> split the big bosTau6.fa file provided by UCSC for the bosTau6 >>>>> genome: >>>>> >>>>> library(Biostrings) >>>>> bosTau6 <- readDNAStringSet("bosTau6.fa") >>>>> >>>>> ### Partitioning: >>>>> is_chrUn <- grepl("^chrUn", names(bosTau6)) >>>>> is_chrom <- !is_chrUn >>>>> >>>>> ### Send each chromosome to a FASTA file. >>>>> seqnames <- paste("chr", c(1:29, "X", "M"), sep="") >>>>> stopifnot(setequal(seqnames, names(bosTau6)[is_chrom])) >>>>> >>>>> for (seqname in seqnames) { >>>>> seq <- bosTau6[match(seqname, names(bosTau6))] >>>>> filename <- paste(seqname, ".fa", sep="") >>>>> cat("writing ", filename, "\n", sep="") >>>>> writeXStringSet(seq, file=filename, width=50L) >>>>> } >>>>> >>>>> ### Send the 3286 chrUn_* sequences to 1 FASTA file. >>>>> chrUn_mseq <- bosTau6[is_chrUn] >>>>> writeXStringSet(chrUn_mseq, file="chrUn.fa", width=50L) >>>>> >>>>> The input is the bosTau6.fa file containing 3317 FASTA records: >>>>> 31 records for the chromosomes, and 3286 for the chrUn_* sequences. >>>>> The script produces 32 FASTA files: 1 per chromosome (chr1.fa, >>>>> chr2.fa, chr3.fa, ..., chr29.fa, chrX.fa, chrM.fa), and the >>>>> chrUn.fa file (containing the 3286 chrUn_* sequences). >>>>> >>>>> Note that you can find this script in the BSgenome package, and >>>>> display its source with: >>>>> >>>>> > splitbigfasta_R <- system.file("extdata", >>>>> "GentlemanLab", >>>>> "BSgenome.Btaurus.UCSC.bosTau6-tools", >>>>> "splitbigfasta.R", >>>>> package="BSgenome") >>>>> >>>>> > cat(readLines(splitbigfasta_R), sep="\n") >>>>> >>>>> It should not be too hard to adapt this script to your own needs. >>>>> >>>>>> >>>>>> 2. Revealing the reason for my first question:In my genome fasta >>>>>> file, I have 1427 extrachromosomal scaffolds, but they are not >>>>>> all sequentially numbered, so that I have >>>>>> scaffold_1..scaffold_3681. Do I need to use a regular expression >>>>>> in my seqnames field to tell R to look for scaffold_ followed by >>>>>> 1-4 digits? >>>>> >>>>> No, not in the 'seqnames' field, because those 1427 >>>>> extrachromosomal scaffolds should not go there. They would need to >>>>> go in the 'mseqnames' >>>>> field. >>>>> >>>>> Unlike the 'seqnames' field where you enumerate objects that can >>>>> only contain 1 sequence, in the 'mseqnames' field you can enumerate >>>>> objects that contain more than 1 sequence. More precisely, in the >>>>> final BSgenome data package, each entry in the 'seqnames' field will >>>>> correspond to a DNAString object (the DNAString container can hold >>>>> 1 sequence only), and each entry in the 'mseqnames' field will >>>>> correspond to a DNAStringSet object (the DNAStringSet container >>>>> can hold multiple sequences). >>>>> >>>>> So typically, all the extrachromosomal scaffolds would go in one >>>>> DNAStringSet object in the final BSgenome package (this is for >>>>> example what is done in the BSgenome.Drerio.UCSC.danRer7 package). >>>>> To achieve this, you need to put 1 entry in the 'mseqnames' field >>>>> (e.g. "scaffolds"), and to put the 1427 extrachromosomal scaffolds >>>>> in one FASTA file named accordingly to that entry (e.g. >>>>> scaffolds.fa). >>>>> >>>>> In the above script, replace: >>>>> >>>>> is_chrUn <- grepl("^chrUn", names(bosTau6)) >>>>> >>>>> with: >>>>> >>>>> is_scaffold <- grepl("^scaffold", names(<your_genome>)) >>>>> >>>>> then every occurrence of 'is_chrUn' with 'is_scaffold', and >>>>> finally those 3 lines: >>>>> >>>>> ### Send the 3286 chrUn_* sequences to 1 FASTA file. >>>>> chrUn_mseq <- bosTau6[is_chrUn] >>>>> writeXStringSet(chrUn_mseq, file="chrUn.fa", width=50L) >>>>> >>>>> with: >>>>> >>>>> ### Send the 1427 scaffold_* sequences to 1 FASTA file. >>>>> scaffolds_mseq <- <your_genome>[is_scaffold] >>>>> writeXStringSet(scaffolds_mseq, file="scaffolds.fa", >>>>> width=50L) >>>>> >>>>> and that should take care of sending all the scaffold sequences to >>>>> the scaffolds.fa file. >>>>> >>>>> Let me know if you need further assistance with this. >>>>> >>>>> Cheers, >>>>> H. >>>>> >>>>>> >>>>>> Thanks for any help, >>>>>> --Kelly V. >>>>>> >>>>>> -- output of sessionInfo(): >>>>>> >>>>>> R version 3.0.0 (2013-04-03) >>>>>> Platform: i386-w64-mingw32/i386 (32-bit) >>>>>> >>>>>> locale: >>>>>> [1] LC_COLLATE=English_United States.1252 [2] >>>>>> LC_CTYPE=English_United >>>>>> States.1252 [3] LC_MONETARY=English_United States.1252 [4] >>>>>> LC_NUMERIC=C [5] LC_TIME=English_United States.1252 >>>>>> >>>>>> attached base packages: >>>>>> [1] stats graphics grDevices utils datasets methods base >>>>>> >>>>>> -- >>>>>> Sent via the guest posting facility at bioconductor.org. >>>>>> >>>>>> _______________________________________________ >>>>>> Bioconductor mailing list >>>>>> Bioconductor at r-project.org >>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>> Search the archives: >>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>>> >>>>> >>>>> -- >>>>> Hervé Pagès >>>>> >>>>> Program in Computational Biology >>>>> Division of Public Health Sciences >>>>> Fred Hutchinson Cancer Research Center >>>>> 1100 Fairview Ave. N, M1-B514 >>>>> P.O. Box 19024 >>>>> Seattle, WA 98109-1024 >>>>> >>>>> E-mail: hpages at fhcrc.org >>>>> Phone: (206) 667-5791 >>>>> Fax: (206) 667-1319 >>>>> >>>> >>>> -- >>>> Hervé Pagès >>>> >>>> Program in Computational Biology >>>> Division of Public Health Sciences >>>> Fred Hutchinson Cancer Research Center >>>> 1100 Fairview Ave. N, M1-B514 >>>> P.O. Box 19024 >>>> Seattle, WA 98109-1024 >>>> >>>> E-mail: hpages at fhcrc.org >>>> Phone: (206) 667-5791 >>>> Fax: (206) 667-1319 >>>> >>> >>> -- >>> Hervé Pagès >>> >>> Program in Computational Biology >>> Division of Public Health Sciences >>> Fred Hutchinson Cancer Research Center >>> 1100 Fairview Ave. N, M1-B514 >>> P.O. Box 19024 >>> Seattle, WA 98109-1024 >>> >>> E-mail: hpages at fhcrc.org >>> Phone: (206) 667-5791 >>> Fax: (206) 667-1319 >>> >> >> -- >> Hervé Pagès >> >> Program in Computational Biology >> Division of Public Health Sciences >> Fred Hutchinson Cancer Research Center >> 1100 Fairview Ave. N, M1-B514 >> P.O. Box 19024 >> Seattle, WA 98109-1024 >> >> E-mail: hpages at fhcrc.org >> Phone: (206) 667-5791 >> Fax: (206) 667-1319 >> >

ADD COMMENT • link 11.3 years ago Matthias Lienhard ▴ 280

Login before adding your answer.