problem creating txdb object and bsgenome for solanum pennellii
1
0
Entering edit mode
@sekawaiwai2006-9531
Last seen 8.3 years ago

Hi everyone,

so i am trying to extract the 1000bp upstream, for every gene in solanum pennellii, i know i would need to build the txdb object and the bsgenome for this, firstly i downloaded the fasta format genome and the gff file from ncbi genome database

http://www.ncbi.nlm.nih.gov/genome/?term=solanum+pennellii

after that , i used this perl script to split the genome of s.pennellii into chromosomes and build the bsgenome from these 12 chromosomes.

#!/usr/bin/perl

$f = $ARGV[0]; #get the file name

open (INFILE, "<$f")
or die "Can't open: $f $!";

while (<INFILE>) {
$line = $_;
chomp $line;
if ($line =~ /\>/) { #if has fasta >
close OUTFILE;
$new_file = substr($line,1);
$new_file .= ".fa";
open (OUTFILE, ">$new_file")
or die "Can't open: $new_file $!";
}
print OUTFILE "$line\n";
}
close OUTFILE;

__________________________________________

then i created the seed file and forge the bsgenome package as the instructions. there are actually no errors when checking the built package. 

Package:BSgenome.Slycopersicum.ensembl.Heinz1706
Title:Full genome sequences for Solanum lycopersicum
Description:Full genome sequences for Solanmum lycopersicum provided by ensembl
Version:1.0
organism:Solanum lycopersicum
common_name:Tomato
provider:ensembl
provider_version:Heinz1706
release_date:Jan.2016
release_name:S.lyco Genome Sequencing
source_url:ftp://ftp.ensemblgenomes.org/pub/plants/release-30/fasta/solanum_lycopersicum/dna/
organism_biocview: Solanum_lycopersicum
BSgenomeObjname: Slycopersicum
seqnames: paste("chr", c(1:12), sep="")
circ_seqs: NULL
mseqnames: NULL
SrcDataFiles:ftp://ftp.ensemblgenomes.org/pub/plants/release-30/fasta/solanum_lycopersicum/dna/ PkgExamples: genome$GL896898 # same as genome[["GL896898"]] 
seqs_srcdir: ~/Desktop/S.lyco

_________________________

Unfortunately, when i typed the code, the test return something weird , also the seqinfo returns an error too

library("BSgenome.Spennellii.ncbi.SPENNV200")
test<-library("BSgenome.Spennellii.ncbi.SPENNV200")
> test
 [1] "BSgenome.Athaliana.TAIR.TAIR9"      
 [2] "TxDb.Athaliana.BioMart.plantsmart28"
 [3] "GenomicFeatures"                    
 [4] "AnnotationDbi"                      
 [5] "Biobase"                            
 [6] "BSgenome.Spennellii.ncbi.SPENNV200" 
 [7] "BSgenome"                           
 [8] "rtracklayer"                        
 [9] "Biostrings"                         
[10] "XVector"                            
[11] "GenomicRanges"                      
[12] "GenomeInfoDb"                       
[13] "IRanges"                            
[14] "S4Vectors"                          
[15] "stats4"                             
[16] "BiocGenerics"                       
[17] "parallel"                           
[18] "stats"                              
[19] "graphics"                           
[20] "grDevices"                          
[21] "utils"                              
[22] "datasets"                           
[23] "methods"                            
[24] "base" 

> seqinfo(test)
Error in (function (classes, fdef, mtable)  : 
  unable to find an inherited method for function ‘seqinfo’ for signature ‘"character"’

_______________________

for the txdb object , the code i used was as below and it looks wrong to me as well.

 

gtffile <- file.path("~/Desktop/s.pen/gff/GCF_001406875.1_SPENNV200_genomic.gff")
makeTxDbFromGFF(gtffile)
txdb <- makeTxDbFromGFF(gtffile)

seqinfo(txdb)
Seqinfo object with 12 sequences from an unspecified genome; no seqlengths:
  seqnames    seqlengths isCircular genome
  NC_028637.1       <NA>       <NA>   <NA>
  NC_028638.1       <NA>       <NA>   <NA>
  NC_028639.1       <NA>       <NA>   <NA>
  NC_028640.1       <NA>       <NA>   <NA>
  NC_028641.1       <NA>       <NA>   <NA>
  ...                ...        ...    ...
  NC_028644.1       <NA>       <NA>   <NA>
  NC_028645.1       <NA>       <NA>   <NA>
  NC_028646.1       <NA>       <NA>   <NA>
  NC_028647.1       <NA>       <NA>   <NA>
  NC_028648.1       <NA>       <NA>   <NA>

can anybody tell me may be i have done something wrong?? thank you very much.

> sessionInfo()

R version 3.2.3 (2015-12-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.3 LTS

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
 [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
 [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] BSgenome.Athaliana.TAIR.TAIR9_1.3.1000   
 [2] TxDb.Athaliana.BioMart.plantsmart28_3.2.2
 [3] GenomicFeatures_1.22.12                  
 [4] AnnotationDbi_1.32.3                     
 [5] Biobase_2.30.0                           
 [6] BSgenome.Spennellii.ncbi.SPENNV200_1.0   
 [7] BSgenome_1.38.0                          
 [8] rtracklayer_1.30.1                       
 [9] Biostrings_2.38.3                        
[10] XVector_0.10.0                           
[11] GenomicRanges_1.22.4                     
[12] GenomeInfoDb_1.6.3                       
[13] IRanges_2.4.6                            
[14] S4Vectors_0.8.11                         
[15] BiocGenerics_0.16.1                      

loaded via a namespace (and not attached):
 [1] zlibbioc_1.16.0            GenomicAlignments_1.6.3   
 [3] BiocParallel_1.4.3         tools_3.2.3               
 [5] SummarizedExperiment_1.0.2 DBI_0.3.1                 
 [7] lambda.r_1.1.7             futile.logger_1.4.1       
 [9] futile.options_1.0.0       bitops_1.0-6              
[11] biomaRt_2.26.1             RCurl_1.95-4.7            
[13] RSQLite_1.0.0              Rsamtools_1.22.0          
[15] XML_3.98-1.3    

 

 

bsgenome txdb promoter • 2.1k views
ADD COMMENT
0
Entering edit mode

Instead of saying test <- library("BSgenome.Spennellii.ncbi.SPENNV200"), say

library("BSgenome.Spennellii.ncbi.SPENNV200"),
​test = BSgenome.Spennellii.ncbi.SPENNV200
ADD REPLY
1
Entering edit mode
@martin-morgan-1513
Last seen 4 months ago
United States

An alternative approach is to load relevant libraries

library(GenomicFeatures); library(rtracklayer); library(BSgenome)

Make the TxDb from the GFF

txdb = makeTxDbFromGFF("GCF_001406875.1_SPENNV200_genomic.gff.gz")

Import the fasta file, and export as an (indexed) 2bit file, and reference the 2bit file

fa = FastaFile("GCF_001406875.1_SPENNV200_genomic.fna.gz")
export(import(fa), "GCF_001406875.1_SPENNV200_genomic.2bit")
twob = TwoBitFile("GCF_001406875.1_SPENNV200_genomic.2bit")

For rtracklayer versions before 1.31.4, the names in the fasta file need to be cleaned

dna = import(fa)
names(dna) = sapply(strsplit(names(dna), " "), "[[", 1)
export(dna, "GCF_001406875.1_SPENNV200_genomic.2bit")

You're then ready to go, e.g., 

> p = promoters(txdb)
> getSeq(twob, p)
  A DNAStringSet instance of length 43195
        width seq
    [1]  2200 AAATGCCAAAAAATTTTGTGGACGTCCGTTAA...CTCAAAACTTCAGGTAATTAATTATTATTAT
    [2]  2200 CCAAAGTAAAAATTGATTGAAAAATATTAAAA...GAACAAAAAGAATGAAGAAACGAAGTTACTA
    [3]  2200 CAAAGTAAAAATTGATTGAAAAATATTAAAAC...AACAAAAAGAATGAAGAAACGAAGTTACTAA
    [4]  2200 AAAGTAAAAATTGATTGAAAAATATTAAAACA...ACAAAAAGAATGAAGAAACGAAGTTACTAAG
    [5]  2200 AAAGTAAAAATTGATTGAAAAATATTAAAACA...ACAAAAAGAATGAAGAAACGAAGTTACTAAG
    ...   ... ...
[43191]  2200 TATGAGATCGTAAATAAGATAGGAAGAGGAGC...TATGATACACCTGTGCCAAGTTGGCGCATTT
[43192]  2200 GAAAGAGCTAGCTGAGTAGTGTACCCAATAAT...TTGATTTTCAAGCAAAAAACTTAGGAAAAAT
[43193]  2200 CGTTTGAATTCTTCAACAAGTTTTCAACAGCG...TTGATGGGACCTCGAAGAAGAGGGTCTGGTC
[43194]  2200 TTCCAATTAAGTTTAAAAATAAAACGTTGAAG...GCTTAAATGGGTAAGTCTTGAACCACTTTCC
[43195]  2200 AAGCAAAATTAATCAATAAAAAACAAAGCAGA...ACTTCATCTCAACTCACGAAGAGTGTCCACC

See also getPromoterSeq().

 

ADD COMMENT
0
Entering edit mode

Hi Martin ,

thank you so much for your response, everything works except for the last step

> getSeq(twob,p)
Error in .local(con, format, text, ...) : UCSC library operation failed
In addition: Warning message:
In .local(con, format, text, ...) :
  NC_028637.1 is not in GCF_001406875.1_SPENNV200_genomic.2bit
ADD REPLY
0
Entering edit mode

Yes, sorry, I updated my answer to include instructions for telling rtracklayer (in the current release) that the file is a FastaFile, and cleaning the fasta file names when creating the 2bit file

fa = FastaFile("GCF_001406875.1_SPENNV200_genomic.fna.gz")
dna = import(fa)
names(dna) = sapply(strsplit(names(dna), " "), "[[", 1)
export(dna, "GCF_001406875.1_SPENNV200_genomic.2bit")
ADD REPLY
0
Entering edit mode

Thank you so much, it works now!

ADD REPLY
0
Entering edit mode

Thank you so much, it works now!

ADD REPLY

Login before adding your answer.

Traffic: 871 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6