Dear Herve' or Marc,
I would like to create a TxDb object for the Pig genome Sus Scrofa 11.1 from the Ensembl gtf annotation (ftp://ftp.ensembl.org/pub/release-96/gtf/sus_scrofa) that contains all the columns from the original gtf file.
> library(GenomicFeatures)
> TxDb_ss11.1.96 <- makeTxDbFromGFF("./Sus_scrofa.Sscrofa11.1.96.gtf")
> saveDb(TxDb_ss11.1.96, file="./TxDb_Dbss11.1.96.sqlite")
> columns(TxDb_ss11.1.96)
[1] "CDSCHROM" "CDSEND" "CDSID" "CDSNAME" "CDSPHASE"
[6] "CDSSTART" "CDSSTRAND" "EXONCHROM" "EXONEND" "EXONID"
[11] "EXONNAME" "EXONRANK" "EXONSTART" "EXONSTRAND" "GENEID"
[16] "TXCHROM" "TXEND" "TXID" "TXNAME" "TXSTART"
[21] "TXSTRAND" "TXTYPE"
> keytypes(TxDb_ss11.1.96)
[1] "CDSID" "CDSNAME" "EXONID" "EXONNAME" "GENEID" "TXID" "TXNAME"
In particular I wanted to obtain in the TxDb object the symbol column (in the original gtf below it goes under: "gene_name" ).
> system(command = "head ./Sus_scrofa.Sscrofa11.1.96.gtf")
#!genome-build Sscrofa11.1
#!genome-version Sscrofa11.1
#!genome-date 2016-12
#!genome-build-accession NCBI:GCA_000003025.6
#!genebuild-last-updated 2017-06
1 ensembl gene 3472 18696 . - . gene_id "ENSSSCG00000037372"; gene_version "1"; gene_name "TBP"; gene_source "ensembl"; gene_biotype "protein_coding";
1 ensembl transcript 3472 18546 . - . gene_id "ENSSSCG00000037372"; gene_version "1"; transcript_id "ENSSSCT00000065539"; transcript_version "1"; gene_name "TBP"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_name "TBP-201"; transcript_source "ensembl"; transcript_biotype "protein_coding";
1 ensembl exon 18493 18546 . - . gene_id "ENSSSCG00000037372"; gene_version "1"; transcript_id "ENSSSCT00000065539"; transcript_version "1"; exon_number "1"; gene_name "TBP"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_name "TBP-201"; transcript_source "ensembl"; transcript_biotype "protein_coding"; exon_id "ENSSSCE00000190268"; exon_version "2";
1 ensembl CDS 18493 18546 . - 0 gene_id "ENSSSCG00000037372"; gene_version "1"; transcript_id "ENSSSCT00000065539"; transcript_version "1"; exon_number "1"; gene_name "TBP"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_name "TBP-201"; transcript_source "ensembl"; transcript_biotype "protein_coding"; protein_id "ENSSSCP00000053834"; protein_version "1";
1 ensembl start_codon 18544 18546 . - 0 gene_id "ENSSSCG00000037372"; gene_version "1"; transcript_id "ENSSSCT00000065539"; transcript_version "1"; exon_number "1"; gene_name "TBP"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_name "TBP-201"; transcript_source "ensembl"; transcript_biotype "protein_coding";
Why do I want to do this? Because org.Ss.eg.db does not have Ensembl annotation:
> keytypes(org.Ss.eg.db)
[1] "ACCNUM" "ALIAS" "ENTREZID" "ENZYME" "EVIDENCE"
[6] "EVIDENCEALL" "GENENAME" "GO" "GOALL" "ONTOLOGY"
[11] "ONTOLOGYALL" "PATH" "PMID" "REFSEQ" "SYMBOL"
[16] "UNIGENE" "UNIPROT"
Unfortunately the TxDb object I created there is no symbol (or gene_name) column. Is there a way to explicitly specify additional columns from the ofiginal gtf file to be included in the TxDb object created by makeTxDbFromGFF?
Thank you for any help and best regards. Massimo
> sessionInfo()
R version 3.4.3 (2017-11-30)
Platform: x86_64-conda_cos6-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux
Matrix products: default
BLAS/LAPACK: /home/anaconda/anaconda2/envs/r-343/lib/R/lib/libRblas.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=C
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] GenomicFeatures_1.30.3 AnnotationDbi_1.40.0 Biobase_2.38.0
[4] GenomicRanges_1.30.3 GenomeInfoDb_1.14.0 IRanges_2.12.0
[7] S4Vectors_0.16.0 BiocGenerics_0.24.0
loaded via a namespace (and not attached):
[1] Rcpp_0.12.16 compiler_3.4.3
[3] XVector_0.18.0 prettyunits_1.0.2
[5] bitops_1.0-6 tools_3.4.3
[7] zlibbioc_1.24.0 progress_1.1.2
[9] biomaRt_2.34.2 digest_0.6.15
[11] bit_1.1-12 lattice_0.20-35
[13] RSQLite_2.1.1 memoise_1.1.0
[15] pkgconfig_2.0.1 Matrix_1.2-12
[17] DelayedArray_0.4.1 DBI_1.0.0
[19] GenomeInfoDbData_1.0.0 rtracklayer_1.38.3
[21] stringr_1.3.0 httr_1.3.1
[23] Biostrings_2.46.0 grid_3.4.3
[25] bit64_0.9-7 R6_2.2.2
[27] XML_3.98-1.11 RMySQL_0.10.14
[29] BiocParallel_1.12.0 blob_1.1.1
[31] magrittr_1.5 Rsamtools_1.30.0
[33] matrixStats_0.53.1 GenomicAlignments_1.14.2
[35] assertthat_0.2.0 SummarizedExperiment_1.8.1
[37] stringi_1.2.2 RCurl_1.95-4.10
Thank you James for the reply, and for the clarification, it was really helpful, I'll switch to biomaRt. Best, Massimo