Forging a BSGenome
1
0
Entering edit mode
A.J. ▴ 20
@aj-24333
Last seen 2.6 years ago
United States

I am trying to forge a genome from the microtus ochrogaster genome, with this command

forgeBSgenomeDataPkg("/mnt/Prairie_Vole_Data/Arjen_Folder/Arjen/genomefiles_mo/NCBI/BSgenome.Mochrogaster.NCBI.micOch1.0-seed")

but keep getting errors, either this one:

Error in .make_Seqinfo_from_genome(genome) : 
  "MicOch1.0" is not a registered NCBI assembly or UCSC genome (use registered_NCBI_assemblies() or registered_UCSC_genomes() to list the
  NCBI or UCSC assemblies/genomes currently registered in the GenomeInfoDb package)
In addition: Warning messages:
1: In forgeBSgenomeDataPkg(y, seqs_srcdir = seqs_srcdir, destdir = destdir,  :
  field 'provider_version' is deprecated in favor of 'genome'
2: In forgeBSgenomeDataPkg(y, seqs_srcdir = seqs_srcdir, destdir = destdir,  :
  field 'release_name' is deprecated

Or this one:

Error in h(simpleError(msg, call)) : 
  error in evaluating the argument 'expr' in selecting a method for function 'eval': <text>:1:3055: unexpected input
1: W_004949236.1","NW_004949237.1","NW_004949238.1","NW_004949239.1","NW_004949240.1","NW_004949241.1","NW_004949242.1","NW_004949243.1","NW_004949244.1","NW_004949245.1","NW_004949246.1","NW_004

In addition: Warning messages:
1: In forgeBSgenomeDataPkg(y, seqs_srcdir = seqs_srcdir, destdir = destdir,  :
  field 'provider_version' is deprecated in favor of 'genome'
2: In forgeBSgenomeDataPkg(y, seqs_srcdir = seqs_srcdir, destdir = destdir,  :
  field 'release_name' is deprecated

The second error I can fix by simply removing the seqnames field from the seedfile, but then I get the first error.

Here is my seed file:

Package: BSgenome.Mochrogaster.NCBI.MicOch1 

Title: Full genome sequences for Michrotus Ochogaster (NCBI version GCA_000317375.1_MicOch1.0) 

Description: Full genome sequences for Michrotus Ochogaster (NCBI version GCF_000317375.1_MicOch1.0) and stored in Biostrings objects. 
Version: 1.0

organism: Microtus Ochrogaster

common_name: Prairie Vole

provider: NCBI

provider_version: MicOch1.0

release_date: July. 2012

release_name: GCA_000317375.1

source_url: https://www.ncbi.nlm.nih.gov/genome/?term=Microtus+ochrogaster

organism_biocview: Mic_Och

BSgenomeObjname: Microtus

seqnames: c("NC_022009.1","NC_022010.1","NC_022011.1","NC_022012.1","NC_022013.1","NW_004949095.1","NC_022014.1","NC_022015.1","NC_022016.1","NW_004949096.1","NW_004949097.1","NW_004949098.1","NC_022017.1","NC_022018.1","NC_022019.1","NC_022020.1","NC_022021.1","NC_022022.1","NC_022023.1","NC_022024.1","NC_022025.1","NC_022026.1","NC_022027.1","NC_022028.1","NC_022029.1","NC_022030.1","NC_022031.1","NC_022032.1","NC_022033.1","NC_022034.1","NC_022035.1","NC_022036.1","NW_004949099.1","NW_004949100.1","NW_004949101.1","NW_004949102.1","NW_004949103.1","NW_004949104.1","NW_004949105.1","NW_004949106.1","NW_004949107.1","NW_004949108.1","NW_004949109.1","NW_004949110.1","NW_004949111.1","NW_004949112.1","NW_004949113.1","NW_004949114.1","NW_004949115.1","NW_004949116.1","NW_004949117.1","NW_004949118.1","NW_004949119.1","NW_004949120.1","NW_004949121.1","NW_004949122.1","NW_004949123.1","NW_004949124.1","NW_004949125.1","NW_004949126.1","NW_004949127.1","NW_004949128.1","NW_004949129.1","NW_004949130.1","NW_004949131.1","NW_004949132.1","NW_004949133.1","NW_004949134.1","NW_004949135.1","NW_004949136.1","NW_004949137.1","NW_004949138.1","NW_004949139.1","NW_004949140.1","NW_004949141.1","NW_004949142.1","NW_004949143.1","NW_004949144.1","NW_004949145.1","NW_004949146.1","NW_004949147.1","NW_004949148.1","NW_004949149.1","NW_004949150.1","NW_004949151.1","NW_004949152.1","NW_004949153.1","NW_004949154.1","NW_004949155.1","NW_004949156.1","NW_004949157.1","NW_004949158.1","NW_004949159.1","NW_004949160.1","NW_004949161.1","NW_004949162.1","NW_004949163.1","NW_004949164.1","NW_004949165.1","NW_004949166.1","NW_004949167.1","NW_004949168.1","NW_004949169.1","NW_004949170.1","NW_004949171.1","NW_004949172.1","NW_004949173.1","NW_004949174.1","NW_004949175.1","NW_004949176.1","NW_004949177.1","NW_004949178.1","NW_004949179.1","NW_004949180.1","NW_004949181.1","NW_004949182.1","NW_004949183.1","NW_004949184.1","NW_004949185.1","NW_004949186.1","NW_004949187.1","NW_004949188.1","NW_004949189.1","NW_004949190.1","NW_004949191.1","NW_004949192.1","NW_004949193.1","NW_004949194.1","NW_004949195.1","NW_004949196.1","NW_004949197.1","NW_004949198.1","NW_004949199.1","NW_004949200.1","NW_004949201.1","NW_004949202.1","NW_004949203.1","NW_004949204.1","NW_004949205.1","NW_004949206.1","NW_004949207.1","NW_004949208.1","NW_004949209.1","NW_004949210.1","NW_004949211.1","NW_004949212.1","NW_004949213.1","NW_004949214.1","NW_004949215.1","NW_004949216.1","NW_004949217.1","NW_004949218.1","NW_004949219.1","NW_004949220.1","NW_004949221.1","NW_004949222.1","NW_004949223.1","NW_004949224.1","NW_004949225.1","NW_004949226.1","NW_004949227.1","NW_004949228.1","NW_004949229.1","NW_004949230.1","NW_004949231.1","NW_004949232.1","NW_004949233.1","NW_004949234.1","NW_004949235.1","NW_004949236.1","NW_004949237.1","NW_004949238.1","NW_004949239.1","NW_004949240.1","NW_004949241.1","NW_004949242.1","NW_004949243.1","NW_004949244.1","NW_004949245.1","NW_004949246.1","NW_004949247.1","NW_004949248.1","NW_004949249.1","NW_004949250.1","NW_004949251.1","NW_004949252.1","NW_004949253.1","NW_004949254.1","NW_004949255.1","NW_004949256.1","NW_004949257.1","NW_004949258.1","NW_004949259.1","NW_004949260.1","NW_004949261.1","NW_004949262.1","NW_004949263.1","NW_004949264.1","NW_004949266.1","NW_004949267.1","NW_004949268.1","NW_004949269.1","NW_004949270.1","NW_004949271.1","NW_004949272.1","NW_004949273.1","NW_004949274.1","NW_004949275.1","NW_004949276.1","NW_004949278.1","NW_004949279.1","NW_004949280.1","NW_004949282.1","NW_004949283.1","NW_004949284.1","NW_004949285.1","NW_004949286.1","NW_004949287.1","NW_004949288.1","NW_004949289.1","NW_004949290.1","NW_004949292.1","NW_004949293.1","NW_004949294.1","NW_004949296.1","NW_004949297.1","NW_004949298.1","NW_004949299.1","NW_004949300.1","NW_004949302.1","NW_004949303.1","NW_004949304.1","NW_004949305.1","NW_004949306.1","NW_004949307.1","NW_004949308.1","NW_004949309.1","NW_004949310.1","NW_004949311.1","NW_004949313.1","NW_004949315.1","NW_004949317.1","NW_004949318.1","NW_004949319.1","NW_004949320.1","NW_004949321.1","NW_004949323.1","NW_004949328.1","NW_004949329.1","NW_004949332.1","NW_004949333.1","NW_004949335.1","NW_004949336.1","NW_004949337.1","NW_004949339.1","NW_004949341.1","NW_004949346.1","NW_004949347.1","NW_004949348.1","NW_004949351.1","NW_004949352.1","NW_004949354.1","NW_004949358.1","NW_004949360.1","NW_004949364.1","NW_004949366.1","NW_004949367.1","NW_004949368.1","NW_004949369.1","NW_004949371.1","NW_004949373.1","NW_004949374.1","NW_004949375.1","NW_004949376.1","NW_004949378.1","NW_004949379.1","NW_004949384.1","NW_004949390.1","NW_004949392.1","NW_004949394.1","NW_004949398.1","NW_004949408.1","NW_004949410.1","NW_004949412.1","NW_004949418.1","NW_004949420.1","NW_004949427.1","NW_004949430.1","NW_004949431.1","NW_004949433.1","NW_004949450.1","NW_004949456.1","NW_004949457.1","NW_004949458.1","NW_004949459.1","NW_004949461.1","NW_004949466.1","NW_004949480.1","NW_004949485.1","NW_004949507.1","NW_004949511.1","NW_004949513.1","NW_004949518.1","NW_004949519.1","NW_004949522.1","NW_004949526.1","NW_004949531.1","NW_004949539.1","NW_004949549.1","NW_004949556.1","NW_004949574.1","NW_004949575.1","NW_004949582.1","NW_004949592.1","NW_004949593.1","NW_004949595.1","NW_004949603.1","NW_004949612.1","NW_004949630.1","NW_004949634.1","NW_004949637.1","NW_004949649.1","NW_004949651.1","NW_004949657.1","NW_004949668.1","NW_004949678.1","NW_004949680.1","NW_004949683.1","NW_004949694.1","NW_004949695.1","NW_004949703.1","NW_004949704.1","NW_004949717.1","NW_004949723.1","NW_004949728.1","NW_004949743.1","NW_004949744.1","NW_004949753.1","NW_004949755.1","NW_004949767.1","NW_004949773.1","NW_004949774.1","NW_004949776.1","NW_004949779.1","NW_004949780.1","NW_004949799.1","NW_004949803.1","NW_004949810.1","NW_004949816.1","NW_004949817.1","NW_004949818.1","NW_004949820.1","NW_004949824.1","NW_004949825.1","NW_004949826.1","NW_004949843.1","NW_004949845.1","NW_004949879.1","NW_004949881.1","NW_004949882.1","NW_004949885.1","NW_004949900.1","NW_004949905.1","NW_004949906.1","NW_004949908.1","NW_004949911.1","NW_004949912.1","NW_004949917.1","NW_004949918.1","NW_004949935.1","NW_004949936.1","NW_004949937.1","NW_004949946.1","NW_004949955.1","NW_004949956.1","NW_004949965.1","NW_004949972.1","NW_004949973.1","NW_004949977.1","NW_004949979.1","NW_004949981.1","NW_004949987.1","NW_004949992.1","NW_004950005.1","NW_004950012.1","NW_004950013.1","NW_004950014.1","NW_004950021.1","NW_004950030.1","NW_004950033.1","NW_004950065.1","NW_004950072.1","NW_004950085.1","NW_004950099.1","NW_004950102.1","NW_004950122.1","NW_004950143.1","NW_004950144.1","NW_004950176.1","NW_004950189.1","NW_004950195.1","NW_004950196.1","NW_004950260.1","NW_004950261.1","NW_004950262.1","NW_004950269.1","NW_004950271.1","NW_004950275.1","NW_004950291.1","NW_004950316.1","NW_004950326.1","NW_004950331.1","NW_004950353.1","NW_004950365.1","NW_004950370.1","NW_004950374.1","NW_004950419.1","NW_004950427.1","NW_004950441.1","NW_004950453.1","NW_004950455.1","NW_004950470.1","NW_004950491.1","NW_004950525.1","NW_004950542.1","NW_004950545.1","NW_004950568.1","NW_004950600.1","NW_004950606.1","NW_004950618.1","NW_004950619.1","NW_004950647.1","NW_004950648.1","NW_004950659.1","NW_004950671.1","NW_004950696.1","NW_004950709.1","NW_004950736.1","NW_004950738.1","NW_004950746.1","NW_004950749.1","NW_004950765.1","NW_004950766.1","NW_004950775.1","NW_004950788.1","NW_004950797.1","NW_004950841.1","NW_004950869.1","NW_004950873.1","NW_004950965.1","NW_004950996.1","NW_004951001.1","NW_004951014.1","NW_004951028.1","NW_004951061.1","NW_004951102.1","NW_004951123.1","NW_004951132.1","NW_004951148.1","NW_004951180.1","NW_004951203.1","NW_004951219.1","NW_004951266.1","NW_004951285.1","NW_004951365.1","NW_004951373.1","NW_004951466.1","NW_004951519.1","NW_004951529.1","NW_004951550.1","NW_004951572.1","NW_004951574.1","NW_004951581.1","NW_004951649.1","NW_004951673.1","NW_004951690.1","NW_004951693.1","NW_004951745.1","NW_004951762.1","NW_004951773.1","NW_004951784.1","NW_004951801.1","NW_004951838.1","NW_004951841.1","NW_004951842.1","NW_004951872.1","NW_004951930.1","NW_004951943.1","NW_004951950.1","NW_004951953.1","NW_004951961.1","NW_004952002.1","NW_004952026.1","NW_004952033.1","NW_004952065.1","NW_004952136.1","NW_004952137.1","NW_004952167.1","NW_004952179.1","NW_004952197.1","NW_004952245.1","NW_004952262.1","NW_004952349.1","NW_004952388.1","NW_004952408.1","NW_004952412.1","NW_004952429.1","NW_004952441.1","NW_004952476.1","NW_004952486.1","NW_004952574.1","NW_004952577.1","NW_004952620.1","NW_004952622.1","NW_004952703.1","NW_004952715.1","NW_004952782.1","NW_004952808.1","NW_004952836.1","NW_004952843.1","NW_004952891.1","NW_004952905.1","NW_004952913.1","NW_004952920.1","NW_004952951.1","NW_004952978.1","NW_004952990.1","NW_004953022.1","NW_004953030.1","NW_004953120.1","NW_004953124.1","NW_004953134.1","NW_004953273.1","NW_004953294.1","NW_004953296.1","NW_004953399.1","NW_004953483.1","NW_004953557.1","NW_004953569.1","NW_004953578.1","NW_004953634.1","NW_004953662.1","NW_004953690.1","NW_004953710.1","NW_004953725.1","NW_004953736.1","NW_004953821.1","NW_004953830.1","NW_004953850.1","NW_004953883.1","NW_004953897.1","NW_004953965.1","NW_004953966.1","NW_004953979.1","NW_004954025.1","NW_004954060.1","NW_004954061.1","NW_004954076.1","NW_004954332.1","NW_004954408.1","NW_004954461.1","NW_004954481.1","NW_004954484.1","NW_004954530.1","NW_004954564.1","NW_004954645.1","NW_004954648.1","NW_004954736.1","NW_004954739.1","NW_004954934.1","NW_004955311.1")

seqs_srcdir: /mnt/Prairie_Vole_Data/Arjen_Folder/Arjen/genomefiles_mo/NCBI/2bit

seqfile_name: PV_bit_out.2bit

If somebody can help me out, that'd be awesome!

BSgenome microtus forging • 2.6k views
ADD COMMENT
1
Entering edit mode
@james-w-macdonald-5106
Last seen 13 hours ago
United States

You should take a look at the BSgenomeForge vignette. In particular 2.2.4, the 'other fields', under which it explains that you only need the seqnames entry if you are supplying bare sequence (fasta) sequences. Which you are not (you are supplying a .2bit file).

Probably the most relevant example you could use is the first one, for ferret:

Package: BSgenome.Mfuro.UCSC.musFur1
Title: Full genome sequences for Mustela putorius furo (UCSC version musFur1)
Description: Full genome sequences for Mustela putorius furo (Ferret) as provided by UCSC (musFur1, Apr. 2011) and stored in Biostrings objects.
Version: 1.4.2
organism: Mustela putorius furo
common_name: Ferret
provider: UCSC
provider_version: musFur1
release_date: Apr. 2011
release_name: Ferret Genome Sequencing Consortium MusPutFur1.0
source_url: http://hgdownload.soe.ucsc.edu/goldenPath/musFur1/bigZips/
organism_biocview: Mustela_furo
BSgenomeObjname: Mfuro
SrcDataFiles: musFur1.2bit from http://hgdownload.soe.ucsc.edu/goldenPath/musFur1/bigZips/
PkgExamples: genome$GL896898  # same as genome[["GL896898"]]
seqs_srcdir: /fh/fast/morgan_m/BioC/BSgenomeForge/srcdata/BSgenome.Mfuro.UCSC.musFur1/seqs
seqfile_name: musFur1.2bit

But do note that provider_version should now be 'genome', 'release_name' has been deprecated.

Anyway, I always find it easier to get an example file, a listing of which you can get by doing

dir(paste0(path.package("BSgenome"), "/extdata/GentlemanLab"))
ADD COMMENT
2
Entering edit mode

Also NCBI typically provides FASTA files with ugly sequence names (e.g. GenBank or RefSeq accessions) so forging a BSgenome data package for an NCBI assembly typically requires a preliminary step. This step takes the NCBI FASTA file (GCA or GCF) and turns it into a clean 2bit file. For the GCA_000317375.1 assembly (MicOch1.0), this can be done with:

### Download GCA_000317375.1_MicOch1.0_genomic.fna.gz from
### https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/317/375/GCA_000317375.1_MicOch1.0/

library(Biostrings)   
dna <- readDNAStringSet("GCA_000317375.1_MicOch1.0_genomic.fna.gz")

### Check the seqnames.
current_GenBankAccn <- unlist(heads(strsplit(names(dna), " ", fixed=TRUE), n=1L))
library(GenomeInfoDb)
chrominfo <- getChromInfoFromNCBI("GCA_000317375.1")
expected_GenBankAccn <- chrominfo[ , "GenBankAccn"]
stopifnot(setequal(expected_GenBankAccn, current_GenBankAccn))

### Reorder the sequences.
dna <- dna[match(expected_GenBankAccn, current_GenBankAccn)]

### Rename the sequences.
names(dna) <- chrominfo[ , "SequenceName"]

### Export as 2bit.
library(rtracklayer)
export.2bit(dna, "MicOch1.0.sorted.2bit")

Then use the following seed file (copied from BSgenome.Amellifera.NCBI.AmelHAv3.1-seed located in the BSgenome package, and edited for MicOch1.0):

Package: BSgenome.Mochrogaster.NCBI.MicOch1.0
Title: Full genome sequences for Microtus ochrogaster (MicOch1.0)
Description: Full genome sequences for Microtus ochrogaster as provided by NCBI (assembly MicOch1.0, assembly accession GCA_000317375.1) and stored in Biostrings objects.
Version: 1.5.0
organism: Microtus ochrogaster
common_name: Prairie vole
genome: MicOch1.0
provider: NCBI
release_date: 2012/12/07
source_url: https://www.ncbi.nlm.nih.gov/assembly/GCA_000317375.1
organism_biocview: Microtus_ochrogaster
BSgenomeObjname: Mochrogaster
circ_seqs: character(0)
SrcDataFiles: GCA_000317375.1_MicOch1.0_genomic.fna.gz from https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/317/375/GCA_000317375.1_MicOch1.0/
PkgExamples: genome[["chr1"]]
seqs_srcdir: /path/to/folder/where/you/have/the/2bit/file
seqfile_name: MicOch1.0.sorted.2bit

FWIW I used the above file to forge my own BSgenome.Mochrogaster.NCBI.MicOch1.0 package:

library(BSgenome.Mochrogaster.NCBI.MicOch1.0)
BSgenome.Mochrogaster.NCBI.MicOch1.0
# Prairie vole genome:
## organism: Microtus ochrogaster (Prairie vole)
## genome: MicOch1.0
## provider: NCBI
## release date: 2012/12/07
## 6335 sequences:
##   chr1           chr2           chr4           chr5           chr6          
##   chr7           chr8           chr10          chr15          chr16         
##   chr17          chr18          chr19          chr21          chr22         
##   chr24          chr26          chrX           LG1            LG2           
##   LG3            LG4            LG5            LG7_11         LG8           
##   ...            ...            ...            ...            ...           
##   UNK6284        UNK6285        UNK6286        UNK6287        UNK6288       
##   UNK6289        UNK6290        UNK6291        UNK6292        UNK6293       
##   UNK6294        UNK6295        UNK6296        UNK6297        UNK6299       
##   UNK6300        UNK6301        UNK6302        UNK6303        UNK6304       
##   UNK6305        UNK6306        UNK6307        UNK6308        UNK6309       
## (use 'seqnames()' to see all the sequence names, use the '$' or '[[' operator
## to access a given sequence)

BSgenome.Mochrogaster.NCBI.MicOch1.0$chr1
# 126672627-letter DNAString object
# seq: TCATTTGGGATACTATTTTTCCCTTCTATTTCAAAC...TACATTTGTCAGAGCCTCCGTTTATTAGAGATGGGG

seqinfo(BSgenome.Mochrogaster.NCBI.MicOch1.0)
# Seqinfo object with 6335 sequences from MicOch1.0 genome:
#   seqnames seqlengths isCircular    genome
#   chr1      126672627      FALSE MicOch1.0
#   chr2       98408828      FALSE MicOch1.0
#   chr4       92994323      FALSE MicOch1.0
#   chr5       96435083      FALSE MicOch1.0
#   chr6       88288230      FALSE MicOch1.0
#   ...             ...        ...       ...
#   UNK6305        1000      FALSE MicOch1.0
#   UNK6306        1000      FALSE MicOch1.0
#   UNK6307        1000      FALSE MicOch1.0
#   UNK6308        1000      FALSE MicOch1.0
#   UNK6309        1000      FALSE MicOch1.0

Hope this helps,

H.

ADD REPLY

Login before adding your answer.

Traffic: 945 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6