create a full TXDB object from GFF3 data for a non-system organism / denovo assembly
1
0
Entering edit mode
@stephaneplaisance-11262
Last seen 5.0 years ago

I am trying to create a full TXDB object including Seqinfo() from a GFF3 and metadata.

REM: The example genome below is present in public DBs but NOT my genome of interest only available as GFF and fasta (denovo assembly).

I could load the TxDB from the GFF but do not succeed to add Seqinfo (and later sequence from the genome fasta)

I found posts explaining the use of chrominfo (https://github.com/lawremi/rtracklayer/issues/13) but in my hands it fails although I made sure my chromosome vector was of type character

Any help would be very welcome as well as for attaching the FASTA genome sequence to this database for future sequence extractions (not found in doc)

Best Stephane

# relates to unsolved https://support.bioconductor.org/p/104198/
# download example gff3 file from
# ftp://ftp.ensemblgenomes.org/pub/plants/release-30/gff3/arabidopsis_lyrata/Arabidopsis_lyrata.v.1.0.30.chr.gff3.gz

library('GenomicFeatures')

# create missing info table from ensemble info
chr.info <- data.frame(chrom = as.character(c(1:8)), 
                       length = as.integer(c(33132434, 19320645, 24464544, 23328337, 21221946, 25113588, 24649068, 22951293)), 
                       is_circular = as.vector(rep(FALSE, 8)), 
                       stringsAsFactors = FALSE)
# load GFF and info
gr <- makeTxDbFromGFF("Arabidopsis_lyrata.v.1.0.30.chr.gff3.gz", 
                      format="gff3", 
                      chrominfo = chr.info,
                      dataSource="gtf file for Arabidopsis lyrata",
                      organism="Arabidopsis lyrata")

Import genomic features from the file as a GRanges object ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK

gr

TxDb object:
  # Db type: TxDb
  # Supporting package: GenomicFeatures
  # Data source: gtf file for Arabidopsis lyrata
  # Organism: Arabidopsis lyrata
  # Taxonomy ID: 59689
  # miRBase build ID: NA
  # Genome: NA
  # transcript_nrow: 31478
  # exon_nrow: 170022
  # cds_nrow: 154686
  # Db created by: GenomicFeatures package from Bioconductor
# Creation time: 2019-05-02 12:31:31 +0200 (Thu, 02 May 2019)
# GenomicFeatures version at creation time: 1.34.8
# RSQLite version at creation time: 2.1.1
# DBSCHEMAVERSION: 1.2

so far so good - but then !

Seqinfo(gr)

Error in .normargSeqlevels(seqnames) : supplied 'seqlevels' must be a character vector

> strchr.info)
'data.frame':   8 obs. of  3 variables:
 $ chrom      : chr  "1" "2" "3" "4" ...
 $ length     : int  33132434 19320645 24464544 23328337 21221946 25113588 24649068 22951293
 $ is_circular: logi  FALSE FALSE FALSE FALSE FALSE FALSE ...

but they ARE characters!

not good here

I also create a Seqinfo object (good in itself) that I could not combine with the gff3 import command

seq.info <- Seqinfo(seqnames = as.vector(as.character(c(1:8))), 
                    seqlengths = c(33132434, 19320645, 24464544, 23328337, 21221946, 25113588, 24649068, 22951293), 
                    isCircular = as.vector(rep(FALSE, 8)), 
                    genome = "Arabidopsis_lyrata")

seq.info

Seqinfo object with 8 sequences from Arabidopsis_lyrata genome:
  seqnames seqlengths isCircular             genome
  1          33132434      FALSE Arabidopsis_lyrata
  2          19320645      FALSE Arabidopsis_lyrata
  3          24464544      FALSE Arabidopsis_lyrata
  4          23328337      FALSE Arabidopsis_lyrata
  5          21221946      FALSE Arabidopsis_lyrata
  6          25113588      FALSE Arabidopsis_lyrata
  7          24649068      FALSE Arabidopsis_lyrata
  8          22951293      FALSE Arabidopsis_lyrata

gr2 <- makeTxDbFromGFF("Arabidopsis_lyrata.v.1.0.30.chr.gff3.gz", 
                      format="gff3", 
                      chrominfo = seq.info,
                      dataSource="gtf file for Arabidopsis lyrata",
                      organism="Arabidopsis lyrata")

Import genomic features from the file as a GRanges object ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK

Seqinfo(gr2)
Error in .normargSeqlevels(seqnames) : 
  supplied 'seqlevels' must be a character vector

My session info in case you find something there!

> sessionInfo()
R version 3.5.3 (2019-03-11)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13.6

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] rtracklayer_1.42.2     GenomicFeatures_1.34.8 AnnotationDbi_1.44.0   Biobase_2.42.0         GenomicRanges_1.34.0   GenomeInfoDb_1.18.2    IRanges_2.16.0         S4Vectors_0.20.1      
 [9] BiocGenerics_0.28.0    BiocManager_1.30.4    

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.1                  compiler_3.5.3              XVector_0.22.0              prettyunits_1.0.2           bitops_1.0-6                tools_3.5.3                
 [7] zlibbioc_1.28.0             progress_1.2.0              biomaRt_2.38.0              digest_0.6.18               bit_1.1-14                  lattice_0.20-38            
[13] RSQLite_2.1.1               memoise_1.1.0               pkgconfig_2.0.2             rlang_0.3.4                 Matrix_1.2-17               DelayedArray_0.8.0         
[19] DBI_1.0.0                   rstudioapi_0.10             GenomeInfoDbData_1.2.0      stringr_1.4.0               httr_1.4.0                  Biostrings_2.50.2          
[25] hms_0.4.2                   grid_3.5.3                  bit64_0.9-7                 R6_2.4.0                    BiocParallel_1.16.6         XML_3.98-1.19              
[31] blob_1.1.1                  magrittr_1.5                matrixStats_0.54.0          GenomicAlignments_1.18.1    Rsamtools_1.34.1            SummarizedExperiment_1.12.0
[37] assertthat_0.2.1            stringi_1.4.3               RCurl_1.95-4.12             crayon_1.3.4
GenomicFeatures rtracklayer • 1.1k views
ADD COMMENT
0
Entering edit mode
@herve-pages-1542
Last seen 15 hours ago
Seattle, WA, United States

Bonjour Stéphane,

When you do Seqinfo(gr) you're calling the wrong function. You need to do seqinfo(gr) instead.

Seqinfo() is the constructor function for Seqinfo objects while seqinfo() is a getter. You're not trying to construct a Seqinfo object out of GRanges object gr, you're trying to get the Seqinfo object stored in it.

Seqinfo() has the following arguments:

> args(Seqinfo)
function (seqnames = NULL, seqlengths = NA, isCircular = NA,  genome = NA) 
NULL

The first argument must be a character vector (see ?Seqinfo for the details). It can't be a GRanges object, hence the error you get.

seqinfo() has only 1 argument:

> args(seqinfo)
function (x) 
NULL

x is the object to extract the Seqinfo from. It can be a GRanges object, or a GRangesList, or a GAlignments, or a SummarizedExperiment, or any object that has a Seqinfo component.

Note that there is also a seqinfo() setter for modifying the Seqinfo component of an object but it's a little bit tricky to use and is rarely needed. People should preferably use the seqlevelsStyle() setter instead if they only need to rename the seqlevels of an object.

Hope this helps,

H.

ADD COMMENT

Login before adding your answer.

Traffic: 727 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6