Search
Question: problem creating txdb object and bsgenome for solanum pennellii
0
gravatar for sekawaiwai2006
22 months ago by
sekawaiwai200620 wrote:

Hi everyone,

so i am trying to extract the 1000bp upstream, for every gene in solanum pennellii, i know i would need to build the txdb object and the bsgenome for this, firstly i downloaded the fasta format genome and the gff file from ncbi genome database

http://www.ncbi.nlm.nih.gov/genome/?term=solanum+pennellii

after that , i used this perl script to split the genome of s.pennellii into chromosomes and build the bsgenome from these 12 chromosomes.

#!/usr/bin/perl

$f = $ARGV[0]; #get the file name

open (INFILE, "<$f")
or die "Can't open: $f $!";

while (<INFILE>) {
$line = $_;
chomp $line;
if ($line =~ /\>/) { #if has fasta >
close OUTFILE;
$new_file = substr($line,1);
$new_file .= ".fa";
open (OUTFILE, ">$new_file")
or die "Can't open: $new_file $!";
}
print OUTFILE "$line\n";
}
close OUTFILE;

__________________________________________

then i created the seed file and forge the bsgenome package as the instructions. there are actually no errors when checking the built package. 

Package:BSgenome.Slycopersicum.ensembl.Heinz1706
Title:Full genome sequences for Solanum lycopersicum
Description:Full genome sequences for Solanmum lycopersicum provided by ensembl
Version:1.0
organism:Solanum lycopersicum
common_name:Tomato
provider:ensembl
provider_version:Heinz1706
release_date:Jan.2016
release_name:S.lyco Genome Sequencing
source_url:ftp://ftp.ensemblgenomes.org/pub/plants/release-30/fasta/solanum_lycopersicum/dna/
organism_biocview: Solanum_lycopersicum
BSgenomeObjname: Slycopersicum
seqnames: paste("chr", c(1:12), sep="")
circ_seqs: NULL
mseqnames: NULL
SrcDataFiles:ftp://ftp.ensemblgenomes.org/pub/plants/release-30/fasta/solanum_lycopersicum/dna/ PkgExamples: genome$GL896898 # same as genome[["GL896898"]] 
seqs_srcdir: ~/Desktop/S.lyco

_________________________

Unfortunately, when i typed the code, the test return something weird , also the seqinfo returns an error too

library("BSgenome.Spennellii.ncbi.SPENNV200")
test<-library("BSgenome.Spennellii.ncbi.SPENNV200")
> test
 [1] "BSgenome.Athaliana.TAIR.TAIR9"      
 [2] "TxDb.Athaliana.BioMart.plantsmart28"
 [3] "GenomicFeatures"                    
 [4] "AnnotationDbi"                      
 [5] "Biobase"                            
 [6] "BSgenome.Spennellii.ncbi.SPENNV200" 
 [7] "BSgenome"                           
 [8] "rtracklayer"                        
 [9] "Biostrings"                         
[10] "XVector"                            
[11] "GenomicRanges"                      
[12] "GenomeInfoDb"                       
[13] "IRanges"                            
[14] "S4Vectors"                          
[15] "stats4"                             
[16] "BiocGenerics"                       
[17] "parallel"                           
[18] "stats"                              
[19] "graphics"                           
[20] "grDevices"                          
[21] "utils"                              
[22] "datasets"                           
[23] "methods"                            
[24] "base" 

> seqinfo(test)
Error in (function (classes, fdef, mtable)  : 
  unable to find an inherited method for function ‘seqinfo’ for signature ‘"character"’

_______________________

for the txdb object , the code i used was as below and it looks wrong to me as well.

 

gtffile <- file.path("~/Desktop/s.pen/gff/GCF_001406875.1_SPENNV200_genomic.gff")
makeTxDbFromGFF(gtffile)
txdb <- makeTxDbFromGFF(gtffile)

seqinfo(txdb)
Seqinfo object with 12 sequences from an unspecified genome; no seqlengths:
  seqnames    seqlengths isCircular genome
  NC_028637.1       <NA>       <NA>   <NA>
  NC_028638.1       <NA>       <NA>   <NA>
  NC_028639.1       <NA>       <NA>   <NA>
  NC_028640.1       <NA>       <NA>   <NA>
  NC_028641.1       <NA>       <NA>   <NA>
  ...                ...        ...    ...
  NC_028644.1       <NA>       <NA>   <NA>
  NC_028645.1       <NA>       <NA>   <NA>
  NC_028646.1       <NA>       <NA>   <NA>
  NC_028647.1       <NA>       <NA>   <NA>
  NC_028648.1       <NA>       <NA>   <NA>

can anybody tell me may be i have done something wrong?? thank you very much.

> sessionInfo()

R version 3.2.3 (2015-12-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.3 LTS

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
 [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
 [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] BSgenome.Athaliana.TAIR.TAIR9_1.3.1000   
 [2] TxDb.Athaliana.BioMart.plantsmart28_3.2.2
 [3] GenomicFeatures_1.22.12                  
 [4] AnnotationDbi_1.32.3                     
 [5] Biobase_2.30.0                           
 [6] BSgenome.Spennellii.ncbi.SPENNV200_1.0   
 [7] BSgenome_1.38.0                          
 [8] rtracklayer_1.30.1                       
 [9] Biostrings_2.38.3                        
[10] XVector_0.10.0                           
[11] GenomicRanges_1.22.4                     
[12] GenomeInfoDb_1.6.3                       
[13] IRanges_2.4.6                            
[14] S4Vectors_0.8.11                         
[15] BiocGenerics_0.16.1                      

loaded via a namespace (and not attached):
 [1] zlibbioc_1.16.0            GenomicAlignments_1.6.3   
 [3] BiocParallel_1.4.3         tools_3.2.3               
 [5] SummarizedExperiment_1.0.2 DBI_0.3.1                 
 [7] lambda.r_1.1.7             futile.logger_1.4.1       
 [9] futile.options_1.0.0       bitops_1.0-6              
[11] biomaRt_2.26.1             RCurl_1.95-4.7            
[13] RSQLite_1.0.0              Rsamtools_1.22.0          
[15] XML_3.98-1.3    

 

 

ADD COMMENTlink modified 22 months ago by Martin Morgan ♦♦ 21k • written 22 months ago by sekawaiwai200620

Instead of saying test <- library("BSgenome.Spennellii.ncbi.SPENNV200"), say

library("BSgenome.Spennellii.ncbi.SPENNV200"),
​test = BSgenome.Spennellii.ncbi.SPENNV200
ADD REPLYlink modified 22 months ago • written 22 months ago by Martin Morgan ♦♦ 21k
1
gravatar for Martin Morgan
22 months ago by
Martin Morgan ♦♦ 21k
United States
Martin Morgan ♦♦ 21k wrote:

An alternative approach is to load relevant libraries

library(GenomicFeatures); library(rtracklayer); library(BSgenome)

Make the TxDb from the GFF

txdb = makeTxDbFromGFF("GCF_001406875.1_SPENNV200_genomic.gff.gz")

Import the fasta file, and export as an (indexed) 2bit file, and reference the 2bit file

fa = FastaFile("GCF_001406875.1_SPENNV200_genomic.fna.gz")
export(import(fa), "GCF_001406875.1_SPENNV200_genomic.2bit")
twob = TwoBitFile("GCF_001406875.1_SPENNV200_genomic.2bit")

For rtracklayer versions before 1.31.4, the names in the fasta file need to be cleaned

dna = import(fa)
names(dna) = sapply(strsplit(names(dna), " "), "[[", 1)
export(dna, "GCF_001406875.1_SPENNV200_genomic.2bit")

You're then ready to go, e.g., 

> p = promoters(txdb)
> getSeq(twob, p)
  A DNAStringSet instance of length 43195
        width seq
    [1]  2200 AAATGCCAAAAAATTTTGTGGACGTCCGTTAA...CTCAAAACTTCAGGTAATTAATTATTATTAT
    [2]  2200 CCAAAGTAAAAATTGATTGAAAAATATTAAAA...GAACAAAAAGAATGAAGAAACGAAGTTACTA
    [3]  2200 CAAAGTAAAAATTGATTGAAAAATATTAAAAC...AACAAAAAGAATGAAGAAACGAAGTTACTAA
    [4]  2200 AAAGTAAAAATTGATTGAAAAATATTAAAACA...ACAAAAAGAATGAAGAAACGAAGTTACTAAG
    [5]  2200 AAAGTAAAAATTGATTGAAAAATATTAAAACA...ACAAAAAGAATGAAGAAACGAAGTTACTAAG
    ...   ... ...
[43191]  2200 TATGAGATCGTAAATAAGATAGGAAGAGGAGC...TATGATACACCTGTGCCAAGTTGGCGCATTT
[43192]  2200 GAAAGAGCTAGCTGAGTAGTGTACCCAATAAT...TTGATTTTCAAGCAAAAAACTTAGGAAAAAT
[43193]  2200 CGTTTGAATTCTTCAACAAGTTTTCAACAGCG...TTGATGGGACCTCGAAGAAGAGGGTCTGGTC
[43194]  2200 TTCCAATTAAGTTTAAAAATAAAACGTTGAAG...GCTTAAATGGGTAAGTCTTGAACCACTTTCC
[43195]  2200 AAGCAAAATTAATCAATAAAAAACAAAGCAGA...ACTTCATCTCAACTCACGAAGAGTGTCCACC

See also getPromoterSeq().

 

ADD COMMENTlink modified 22 months ago • written 22 months ago by Martin Morgan ♦♦ 21k

Hi Martin ,

thank you so much for your response, everything works except for the last step

> getSeq(twob,p)
Error in .local(con, format, text, ...) : UCSC library operation failed
In addition: Warning message:
In .local(con, format, text, ...) :
  NC_028637.1 is not in GCF_001406875.1_SPENNV200_genomic.2bit
ADD REPLYlink modified 22 months ago • written 22 months ago by sekawaiwai200620

Yes, sorry, I updated my answer to include instructions for telling rtracklayer (in the current release) that the file is a FastaFile, and cleaning the fasta file names when creating the 2bit file

fa = FastaFile("GCF_001406875.1_SPENNV200_genomic.fna.gz")
dna = import(fa)
names(dna) = sapply(strsplit(names(dna), " "), "[[", 1)
export(dna, "GCF_001406875.1_SPENNV200_genomic.2bit")
ADD REPLYlink modified 22 months ago • written 22 months ago by Martin Morgan ♦♦ 21k

Thank you so much, it works now!

ADD REPLYlink written 22 months ago by sekawaiwai200620

Thank you so much, it works now!

ADD REPLYlink written 22 months ago by sekawaiwai200620
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 151 users visited in the last hour