Question

Importing Gene Symbols with makeTxDbFromGFF

2

Entering edit mode

Dario Strbenac ★ 1.6k

@dario-strbenac-5916

Last seen 12 weeks ago

Australia

I'd like to import the GENCODE Genes GFF3 file with its gene symbols. By using columns on the TxDb object, it is apparent that only the gene_id field is imported, which has entries such as ENSG00000000003.14.How can I also import the gene_name column, which has values like TSPAN6?

GenomicFeatures GFF3 • 4.0k views

ADD COMMENT • link updated 8.7 years ago by Valerie Obenchain ★ 6.8k • written 8.7 years ago by Dario Strbenac ★ 1.6k

score 1 · Answer 1 · 2017-03-21

1

Entering edit mode

Valerie Obenchain ★ 6.8k

@valerie-obenchain-4275

Last seen 3.9 years ago

United States

The decision was made to not include a gene_name column in the TxDbs. This is explained on the ?transcripts man page:

    Finally, \code{use.names=TRUE} cannot be used when grouping
    by gene \code{by="gene"}. This is because, unlike for the
    other features, the gene ids are external ids (e.g. Entrez
    Gene or Ensembl ids) so the db doesn't have a \code{"gene_name"}
    column for storing alternate gene names.

You can convert from Entrez or Ensembl ids to gene name with an OrgDb package:

> columns(org.Hs.eg.db)
 [1] "ACCNUM"       "ALIAS"        "ENSEMBL"      "ENSEMBLPROT"  "ENSEMBLTRANS"
 [6] "ENTREZID"     "ENZYME"       "EVIDENCE"     "EVIDENCEALL"  "GENENAME"    
[11] "GO"           "GOALL"        "IPI"          "MAP"          "OMIM"        
[16] "ONTOLOGY"     "ONTOLOGYALL"  "PATH"         "PFAM"         "PMID"        
[21] "PROSITE"      "REFSEQ"       "SYMBOL"       "UCSCKG"       "UNIGENE"     
[26] "UNIPROT"

Valerie

ADD COMMENT • link 8.7 years ago Valerie Obenchain ★ 6.8k

5

Entering edit mode

It's obviously unfortunate that the user starts with the gene names but then is forced to discard them and get them back. It would be nice if TxDb supported arbitrary meta columns, e.g., through a NoSQL or EAV approach.

ADD REPLY • link 8.7 years ago Michael Lawrence ★ 11k

1

Entering edit mode

Another solution is to read the file twice, once with makeTxDbFromGFF and a second time with import.gff3. Then, the matching of IDs is easy and doesn't miss those newly discovered genes which GENCODE has annotated with symbols.

ADD REPLY • link 8.7 years ago Dario Strbenac ★ 1.6k