Question

makeTxDbFromGFF, GTF columns not included in the TxDb object

0

Entering edit mode

massimo.x.petretich • 0

@massimoxpetretich-20461

Last seen 3.2 years ago

United Kingdom

Dear Herve' or Marc,

I would like to create a TxDb object for the Pig genome Sus Scrofa 11.1 from the Ensembl gtf annotation (ftp://ftp.ensembl.org/pub/release-96/gtf/sus_scrofa) that contains all the columns from the original gtf file.

> library(GenomicFeatures) 
> TxDb_ss11.1.96 <- makeTxDbFromGFF("./Sus_scrofa.Sscrofa11.1.96.gtf")
> saveDb(TxDb_ss11.1.96, file="./TxDb_Dbss11.1.96.sqlite")

> columns(TxDb_ss11.1.96)
 [1] "CDSCHROM"   "CDSEND"     "CDSID"      "CDSNAME"    "CDSPHASE"
 [6] "CDSSTART"   "CDSSTRAND"  "EXONCHROM"  "EXONEND"    "EXONID"
[11] "EXONNAME"   "EXONRANK"   "EXONSTART"  "EXONSTRAND" "GENEID"
[16] "TXCHROM"    "TXEND"      "TXID"       "TXNAME"     "TXSTART"
[21] "TXSTRAND"   "TXTYPE"

> keytypes(TxDb_ss11.1.96)
[1] "CDSID"    "CDSNAME"  "EXONID"   "EXONNAME" "GENEID"   "TXID"     "TXNAME"

In particular I wanted to obtain in the TxDb object the symbol column (in the original gtf below it goes under: "gene_name" ).

> system(command = "head ./Sus_scrofa.Sscrofa11.1.96.gtf")
#!genome-build Sscrofa11.1
#!genome-version Sscrofa11.1
#!genome-date 2016-12
#!genome-build-accession NCBI:GCA_000003025.6
#!genebuild-last-updated 2017-06
1       ensembl gene    3472    18696   .       -       .       gene_id "ENSSSCG00000037372"; gene_version "1"; gene_name "TBP"; gene_source "ensembl"; gene_biotype "protein_coding";
1       ensembl transcript      3472    18546   .       -       .       gene_id "ENSSSCG00000037372"; gene_version "1"; transcript_id "ENSSSCT00000065539"; transcript_version "1"; gene_name "TBP"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_name "TBP-201"; transcript_source "ensembl"; transcript_biotype "protein_coding";
1       ensembl exon    18493   18546   .       -       .       gene_id "ENSSSCG00000037372"; gene_version "1"; transcript_id "ENSSSCT00000065539"; transcript_version "1"; exon_number "1"; gene_name "TBP"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_name "TBP-201"; transcript_source "ensembl"; transcript_biotype "protein_coding"; exon_id "ENSSSCE00000190268"; exon_version "2";
1       ensembl CDS     18493   18546   .       -       0       gene_id "ENSSSCG00000037372"; gene_version "1"; transcript_id "ENSSSCT00000065539"; transcript_version "1"; exon_number "1"; gene_name "TBP"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_name "TBP-201"; transcript_source "ensembl"; transcript_biotype "protein_coding"; protein_id "ENSSSCP00000053834"; protein_version "1";
1       ensembl start_codon     18544   18546   .       -       0       gene_id "ENSSSCG00000037372"; gene_version "1"; transcript_id "ENSSSCT00000065539"; transcript_version "1"; exon_number "1"; gene_name "TBP"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_name "TBP-201"; transcript_source "ensembl"; transcript_biotype "protein_coding";

Why do I want to do this? Because org.Ss.eg.db does not have Ensembl annotation:

> keytypes(org.Ss.eg.db)
 [1] "ACCNUM"      "ALIAS"       "ENTREZID"    "ENZYME"      "EVIDENCE"
 [6] "EVIDENCEALL" "GENENAME"    "GO"          "GOALL"       "ONTOLOGY"
[11] "ONTOLOGYALL" "PATH"        "PMID"        "REFSEQ"      "SYMBOL"
[16] "UNIGENE"     "UNIPROT"

Unfortunately the TxDb object I created there is no symbol (or gene_name) column. Is there a way to explicitly specify additional columns from the ofiginal gtf file to be included in the TxDb object created by makeTxDbFromGFF?

Thank you for any help and best regards. Massimo

> sessionInfo()
R version 3.4.3 (2017-11-30)
Platform: x86_64-conda_cos6-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux

Matrix products: default
BLAS/LAPACK: /home/anaconda/anaconda2/envs/r-343/lib/R/lib/libRblas.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=C
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets
[8] methods   base

other attached packages:
[1] GenomicFeatures_1.30.3 AnnotationDbi_1.40.0   Biobase_2.38.0
[4] GenomicRanges_1.30.3   GenomeInfoDb_1.14.0    IRanges_2.12.0
[7] S4Vectors_0.16.0       BiocGenerics_0.24.0

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.16               compiler_3.4.3
 [3] XVector_0.18.0             prettyunits_1.0.2
 [5] bitops_1.0-6               tools_3.4.3
 [7] zlibbioc_1.24.0            progress_1.1.2
 [9] biomaRt_2.34.2             digest_0.6.15
[11] bit_1.1-12                 lattice_0.20-35
[13] RSQLite_2.1.1              memoise_1.1.0
[15] pkgconfig_2.0.1            Matrix_1.2-12
[17] DelayedArray_0.4.1         DBI_1.0.0
[19] GenomeInfoDbData_1.0.0     rtracklayer_1.38.3
[21] stringr_1.3.0              httr_1.3.1
[23] Biostrings_2.46.0          grid_3.4.3
[25] bit64_0.9-7                R6_2.2.2
[27] XML_3.98-1.11              RMySQL_0.10.14
[29] BiocParallel_1.12.0        blob_1.1.1
[31] magrittr_1.5               Rsamtools_1.30.0
[33] matrixStats_0.53.1         GenomicAlignments_1.14.2
[35] assertthat_0.2.0           SummarizedExperiment_1.8.1
[37] stringi_1.2.2              RCurl_1.95-4.10

annotation • 1.4k views

ADD COMMENT • link updated 5.5 years ago by James W. MacDonald 67k • written 5.5 years ago by massimo.x.petretich • 0

score 0 · Answer 1 · 2019-04-09

The annotation packages are roughly split into packages that have to do with genomic locations (the TxDb packages), and packages that have to do with annotations (e.g., names for genes, what they do, pathways they are in, etc, which go into OrgDb packages). The gene symbol is part of the latter, so there isn't a place in the DB schema of the former for the gene symbols to go.

The org.Ss.eg.db package is based on NCBI annotations, and if you want to use EBI/EMBL data for the genomic locations, you should also use EBI/EMBL annotations as well, because there are any number of disagreements between the two groups as to what is a gene, where they are, etc, that limit the ability to map say Ensembl IDs to Gene IDs. If you want to use EBI/EMBL data, you could use the GFF file you have in hand, and then do further annotations using the biomaRt package.

Alternatively you could use the AnnotationHub package for the genomic locations:

> library(AnnotationHub)
> hub <- AnnotationHub()
> query(hub, c("sus scrofa","txdb"))
AnnotationHub with 5 records
# snapshotDate(): 2018-10-24 
# $dataprovider: UCSC
# $species: Sus scrofa
# $rdataclass: TxDb
# additional mcols(): taxonomyid, genome, description,
#   coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
#   rdatapath, sourceurl, sourcetype 
# retrieve records with, e.g., 'object[["AH52273"]]' 

            title                                    
  AH52273 | TxDb.Sscrofa.UCSC.susScr3.refGene.sqlite 
  AH61788 | TxDb.Sscrofa.UCSC.susScr11.refGene.sqlite
  AH61800 | TxDb.Sscrofa.UCSC.susScr3.refGene.sqlite 
  AH66180 | TxDb.Sscrofa.UCSC.susScr11.refGene.sqlite
  AH66181 | TxDb.Sscrofa.UCSC.susScr3.refGene.sqlite 
> ssdb <- hub[["AH66180"]]
downloading 1 resources
retrieving 1 resource

> ssdb
TxDb object:
# Db type: TxDb
# Supporting package: GenomicFeatures
# Data source: UCSC
# Genome: susScr11
# Organism: Sus scrofa
# Taxonomy ID: 9823
# UCSC Table: refGene
# UCSC Track: RefSeq Genes
# Resource URL: http://genome.ucsc.edu/
# Type of Gene ID: Entrez Gene ID
# Full dataset: yes
# miRBase build ID: NA
# transcript_nrow: 4570
# exon_nrow: 36566
# cds_nrow: 34704
# Db created by: GenomicFeatures package from Bioconductor
# Creation time: 2018-10-19 16:55:42 +0000 (Fri, 19 Oct 2018)
# GenomicFeatures version at creation time: 1.33.4
# RSQLite version at creation time: 2.1.1
# DBSCHEMAVERSION: 1.2
> genes(ssdb)
GRanges object with 4370 ranges and 1 metadata column:
            seqnames              ranges strand |     gene_id
               <Rle>           <IRanges>  <Rle> | <character>
  100037269     chr2   60204948-60235588      - |   100037269
  100037270    chr10   17365666-17416642      + |   100037270
  100037272     chr3   94167759-94253662      - |   100037272
  100037273     chr1 138417886-138508639      + |   100037273
  100037275     chr8 131387573-131431420      + |   100037275
        ...      ...                 ...    ... .         ...
     780437     chr8   67059284-67068117      + |      780437
     780438     chrX   17889026-17938206      - |      780438
     780439     chr8   18800695-18801429      + |      780439
     780440     chr4   35979827-36007835      + |      780440
     791125     chr1   77585914-77642342      - |      791125
  -------
  seqinfo: 613 sequences (1 circular) from susScr11 genome

Which is based on NCBI Gene IDs, and for which you can use the org.Ss.eg.db package to get the HUGO gene symbols.