I'd like to import the GENCODE Genes GFF3 file with its gene symbols. By using columns on the TxDb object, it is apparent that only the gene_id field is imported, which has entries such as ENSG00000000003.14.How can I also import the gene_name column, which has values like TSPAN6?
The decision was made to not include a gene_name column in the TxDbs. This is explained on theĀ ?transcripts man page:
Finally, \code{use.names=TRUE} cannot be used when grouping
by gene \code{by="gene"}. This is because, unlike for the
other features, the gene ids are external ids (e.g. Entrez
Gene or Ensembl ids) so the db doesn't have a \code{"gene_name"}
column for storing alternate gene names.
You can convert from Entrez or Ensembl ids to gene name with an OrgDb package:
It's obviously unfortunate that the user starts with the gene names but then is forced to discard them and get them back. It would be nice if TxDb supported arbitrary meta columns, e.g., through a NoSQL or EAV approach.
Another solution is to read the file twice, once with makeTxDbFromGFF and a second time with import.gff3. Then, the matching of IDs is easy and doesn't miss those newly discovered genes which GENCODE has annotated with symbols.
It's obviously unfortunate that the user starts with the gene names but then is forced to discard them and get them back. It would be nice if TxDb supported arbitrary meta columns, e.g., through a NoSQL or EAV approach.
Another solution is to read the file twice, once with
makeTxDbFromGFF
and a second time withimport.gff3
. Then, the matching of IDs is easy and doesn't miss those newly discovered genes which GENCODE has annotated with symbols.