I am trying to create a full TXDB object including Seqinfo() from a GFF3 and metadata.
REM: The example genome below is present in public DBs but NOT my genome of interest only available as GFF and fasta (denovo assembly).
I could load the TxDB from the GFF but do not succeed to add Seqinfo (and later sequence from the genome fasta)
I found posts explaining the use of chrominfo (https://github.com/lawremi/rtracklayer/issues/13) but in my hands it fails although I made sure my chromosome vector was of type character
Any help would be very welcome as well as for attaching the FASTA genome sequence to this database for future sequence extractions (not found in doc)
Best Stephane
# relates to unsolved https://support.bioconductor.org/p/104198/
# download example gff3 file from
# ftp://ftp.ensemblgenomes.org/pub/plants/release-30/gff3/arabidopsis_lyrata/Arabidopsis_lyrata.v.1.0.30.chr.gff3.gz
library('GenomicFeatures')
# create missing info table from ensemble info
chr.info <- data.frame(chrom = as.character(c(1:8)),
length = as.integer(c(33132434, 19320645, 24464544, 23328337, 21221946, 25113588, 24649068, 22951293)),
is_circular = as.vector(rep(FALSE, 8)),
stringsAsFactors = FALSE)
# load GFF and info
gr <- makeTxDbFromGFF("Arabidopsis_lyrata.v.1.0.30.chr.gff3.gz",
format="gff3",
chrominfo = chr.info,
dataSource="gtf file for Arabidopsis lyrata",
organism="Arabidopsis lyrata")
Import genomic features from the file as a GRanges object ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK
gr
TxDb object:
# Db type: TxDb
# Supporting package: GenomicFeatures
# Data source: gtf file for Arabidopsis lyrata
# Organism: Arabidopsis lyrata
# Taxonomy ID: 59689
# miRBase build ID: NA
# Genome: NA
# transcript_nrow: 31478
# exon_nrow: 170022
# cds_nrow: 154686
# Db created by: GenomicFeatures package from Bioconductor
# Creation time: 2019-05-02 12:31:31 +0200 (Thu, 02 May 2019)
# GenomicFeatures version at creation time: 1.34.8
# RSQLite version at creation time: 2.1.1
# DBSCHEMAVERSION: 1.2
so far so good - but then !
Seqinfo(gr)
Error in .normargSeqlevels(seqnames) : supplied 'seqlevels' must be a character vector
> strchr.info)
'data.frame': 8 obs. of 3 variables:
$ chrom : chr "1" "2" "3" "4" ...
$ length : int 33132434 19320645 24464544 23328337 21221946 25113588 24649068 22951293
$ is_circular: logi FALSE FALSE FALSE FALSE FALSE FALSE ...
but they ARE characters!
not good here
I also create a Seqinfo object (good in itself) that I could not combine with the gff3 import command
seq.info <- Seqinfo(seqnames = as.vector(as.character(c(1:8))),
seqlengths = c(33132434, 19320645, 24464544, 23328337, 21221946, 25113588, 24649068, 22951293),
isCircular = as.vector(rep(FALSE, 8)),
genome = "Arabidopsis_lyrata")
seq.info
Seqinfo object with 8 sequences from Arabidopsis_lyrata genome:
seqnames seqlengths isCircular genome
1 33132434 FALSE Arabidopsis_lyrata
2 19320645 FALSE Arabidopsis_lyrata
3 24464544 FALSE Arabidopsis_lyrata
4 23328337 FALSE Arabidopsis_lyrata
5 21221946 FALSE Arabidopsis_lyrata
6 25113588 FALSE Arabidopsis_lyrata
7 24649068 FALSE Arabidopsis_lyrata
8 22951293 FALSE Arabidopsis_lyrata
gr2 <- makeTxDbFromGFF("Arabidopsis_lyrata.v.1.0.30.chr.gff3.gz",
format="gff3",
chrominfo = seq.info,
dataSource="gtf file for Arabidopsis lyrata",
organism="Arabidopsis lyrata")
Import genomic features from the file as a GRanges object ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK
Seqinfo(gr2)
Error in .normargSeqlevels(seqnames) :
supplied 'seqlevels' must be a character vector
My session info in case you find something there!
> sessionInfo()
R version 3.5.3 (2019-03-11)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13.6
Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets methods base
other attached packages:
[1] rtracklayer_1.42.2 GenomicFeatures_1.34.8 AnnotationDbi_1.44.0 Biobase_2.42.0 GenomicRanges_1.34.0 GenomeInfoDb_1.18.2 IRanges_2.16.0 S4Vectors_0.20.1
[9] BiocGenerics_0.28.0 BiocManager_1.30.4
loaded via a namespace (and not attached):
[1] Rcpp_1.0.1 compiler_3.5.3 XVector_0.22.0 prettyunits_1.0.2 bitops_1.0-6 tools_3.5.3
[7] zlibbioc_1.28.0 progress_1.2.0 biomaRt_2.38.0 digest_0.6.18 bit_1.1-14 lattice_0.20-38
[13] RSQLite_2.1.1 memoise_1.1.0 pkgconfig_2.0.2 rlang_0.3.4 Matrix_1.2-17 DelayedArray_0.8.0
[19] DBI_1.0.0 rstudioapi_0.10 GenomeInfoDbData_1.2.0 stringr_1.4.0 httr_1.4.0 Biostrings_2.50.2
[25] hms_0.4.2 grid_3.5.3 bit64_0.9-7 R6_2.4.0 BiocParallel_1.16.6 XML_3.98-1.19
[31] blob_1.1.1 magrittr_1.5 matrixStats_0.54.0 GenomicAlignments_1.18.1 Rsamtools_1.34.1 SummarizedExperiment_1.12.0
[37] assertthat_0.2.1 stringi_1.4.3 RCurl_1.95-4.12 crayon_1.3.4