Question

Forge a BSgenome data package

0

Entering edit mode

gtho123 ▴ 40

@gtho123-8872

Last seen 6.1 years ago

New Zealand

My supervisor has requested that I create coverage plots to visualize BAM alignments of RNA-Seq data. I though a good way to do this would be to use Gviz. We work on the model legume Medicago truncatula which does not have a BSgenome package so I though I'd try and make one.

Following the vignette I have placed all the chromosomes in their own FASTA files and gziped them. I then created a seed file like so:

Package: BSgenome.Mtruncatula.JCVI.v4
Title: Full genome sequences for Medicago truncatula A17 (JCVI version 4)
Description: Full genome sequences for Medicago truncatula A17 (Barrell medic) as provided by JCVI (v4, 2014) and stored in Biostrings objects. See Tang et al. (2014) BMC Genomics 15:312
Version: 4.0
organism: Medicago truncatula A17
common_name: Barrell medic
provider: JCVI
provider_version: v4
release_date: 2014
release_name: Mt4.0
source_url: ftp://ftp.jcvi.org/pub/data/m_truncatula/Mt4.0/Assembly/JCVI.Medtr.v4.20130313.fasta
organism_biocview: Medicago_truncatula
BSgenomeObjname: Mtruncatula
seqs_srcdir: /home/gthomson/Documents/Scratch/Alignment_visualisation/Medtr4_0.tar.gz
seqnames: c(paste0("Medtr4_0_", "chr",1:8), paste0("Medtr4_0_", "scaffold",sprintf("%04d", 1:2179)))

However when I run forgeBSgenomeDataPkg() i get this error:

Creating package in ./BSgenome.Mtruncatula.JCVI.v4
Error in getSeqSrcpaths(seqname, prefix = prefix, suffix = suffix, seqs_srcdir = seqs_srcdir) :
  file(s) not found: /home/gthomson/Documents/Scratch/Alignment_visualisation/Medtr4_0.tar.gz/Medtr4_0_chr1.fa

This is weird because I can look at this folder and it is there:

How can I do this and any easier methods to generate coverage plots are welcome.

bsgenome biostrings Gviz • 1.2k views

ADD COMMENT • link updated 8.6 years ago by Hervé Pagès 16k • written 8.6 years ago by gtho123 ▴ 40

score 1 · Accepted Answer · 2015-09-30

Hi,

Medtr4_0.tar.gz doesn't look like a folder to me (try to cd to it at the command line, I doubt this will work). It looks more like a tarball to me, that is, a single file that was created with the tar command to put several files together. You can click on its icon and that shows you its content but that still doesn't make it a folder. So forgeBSgenomeDataPkg() is right to complain that the file

/home/gthomson/Documents/Scratch/Alignment_visualisation/Medtr4_0.tar.gz/Medtr4_0_chr1.fa

doesn't exist. You can easily confirm this by trying to access this file at the command line with e.g.

file /home/gthomson/Documents/Scratch/Alignment_visualisation/Medtr4_0.tar.gz/Medtr4_0_chr1.fa

That should give you an error.

It seems that your genome has 2187 sequences. forgeBSgenomeDataPkg() wants 1 file per sequence or 1 single 2bit file containing all the sequences together. You said that you've placed all the chromosomes in their own FASTA files and gzipped them so I guess that means you chose the former. Please read carefully the BSgenomeForge vignette if you want to pursue this. Note that one of the requirements is:

Some basic knowledge of the Unix/Linux command line is required. The commands that you will most likely need are: cd, mkdir, mv, rmdir, tar, gunzip, unzip, ftp and wget. Also you will need to create and edit some text files.

All the files must be placed in the same folder and seqs_srcdir must point to that folder. Maybe you just need to extract them from Medtr4_0.tar.gz and place them in the folder of your choice to achieve this. Depending on how you've named the files, you might also need to define seqfiles_prefix and/or seqfiles_suffix in your seed file. See the vignette for how to use these fields.

H.