Add gene symbol when making TxDb
1
0
Entering edit mode
Jake ▴ 90
@jake-7236
Last seen 2.3 years ago
United States

Hi,

I am trying to make a TxDb object from the human Gencode GTF file. However, I can't get it to add the gene names/symbols.

GTF contain gene_name:

##description: evidence-based annotation of the human genome (GRCh38), version 34 (Ensembl 100)
##provider: GENCODE
##contact: gencode-help@ebi.ac.uk
##format: gtf
##date: 2020-03-24
chr1    HAVANA  gene    11869   14409   .       +       .       gene_id "ENSG00000223972.5"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; level 2; hgnc_id "HGNC:37102"; havana_gene "OTTHUMG00000000961.1";

Making the TxDb object:

gtf <- makeTxDbFromGFF('~/Downloads/gencode.v34.basic.annotation.gtf')

gtf doesn't have gene name available:

> columns(gtf)
 [1] "CDSCHROM"   "CDSEND"     "CDSID"      "CDSNAME"    "CDSPHASE"   "CDSSTART"   "CDSSTRAND"  "EXONCHROM"  "EXONEND"    "EXONID"     "EXONNAME"   "EXONRANK"  
[13] "EXONSTART"  "EXONSTRAND" "GENEID"     "TXCHROM"    "TXEND"      "TXID"       "TXNAME"     "TXSTART"    "TXSTRAND"   "TXTYPE" 


> sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Catalina 10.15.6

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
 [1] grid      parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] forcats_0.5.0                            stringr_1.4.0                            dplyr_1.0.1                             
 [4] purrr_0.3.4                              readr_1.3.1                              tidyr_1.1.1                             
 [7] tibble_3.0.3                             ggplot2_3.3.2                            tidyverse_1.3.0                         
[10] EnsDb.Hsapiens.v86_2.99.0                ensembldb_2.12.1                         AnnotationFilter_1.12.0                 
[13] TxDb.Hsapiens.UCSC.hg38.knownGene_3.10.0 GenomicFeatures_1.40.1                   AnnotationDbi_1.50.3                    
[16] Biobase_2.48.0                           Gviz_1.32.0                              GenomicRanges_1.40.0                    
[19] GenomeInfoDb_1.24.2                      IRanges_2.22.2                           S4Vectors_0.26.1                        
[22] BiocGenerics_0.34.0                     

loaded via a namespace (and not attached):
  [1] colorspace_1.4-1            ellipsis_0.3.1              biovizBase_1.36.0           htmlTable_2.0.1             XVector_0.28.0             
  [6] fs_1.5.0                    base64enc_0.1-3             dichromat_2.0-0             rstudioapi_0.11             bit64_4.0.2                
 [11] fansi_0.4.1                 lubridate_1.7.9             xml2_1.3.2                  splines_4.0.2               R.methodsS3_1.8.0          
 [16] knitr_1.29                  Formula_1.2-3               jsonlite_1.7.0              Rsamtools_2.4.0             broom_0.7.0                
 [21] cluster_2.1.0               dbplyr_1.4.4                png_0.1-7                   R.oo_1.23.0                 BiocManager_1.30.10        
 [26] compiler_4.0.2              httr_1.4.2                  backports_1.1.8             assertthat_0.2.1            Matrix_1.2-18              
 [31] lazyeval_0.2.2              cli_2.0.2                   acepack_1.4.1               htmltools_0.5.0             prettyunits_1.1.1          
 [36] tools_4.0.2                 gtable_0.3.0                glue_1.4.1                  GenomeInfoDbData_1.2.3      rappdirs_0.3.1             
 [41] tinytex_0.25                Rcpp_1.0.5                  cellranger_1.1.0            styler_1.3.2                vctrs_0.3.2                
 [46] Biostrings_2.56.0           rtracklayer_1.48.0          xfun_0.16                   rvest_0.3.6                 lifecycle_0.2.0            
 [51] XML_3.99-0.5                zlibbioc_1.34.0             scales_1.1.1                BSgenome_1.56.0             VariantAnnotation_1.34.0   
 [56] hms_0.5.3                   ProtGenerics_1.20.0         SummarizedExperiment_1.18.2 RMariaDB_1.0.9              RColorBrewer_1.1-2         
 [61] curl_4.3                    memoise_1.1.0               gridExtra_2.3               biomaRt_2.44.1              rpart_4.1-15               
 [66] latticeExtra_0.6-29         stringi_1.4.6               RSQLite_2.2.0               checkmate_2.0.0             BiocParallel_1.22.0        
 [71] rlang_0.4.7                 pkgconfig_2.0.3             matrixStats_0.56.0          bitops_1.0-6                lattice_0.20-41            
 [76] GenomicAlignments_1.24.0    htmlwidgets_1.5.1           bit_4.0.4                   tidyselect_1.1.0            magrittr_1.5               
 [81] R6_2.4.1                    generics_0.0.2              Hmisc_4.4-0                 DelayedArray_0.14.1         DBI_1.1.0                  
 [86] withr_2.2.0                 pillar_1.4.6                haven_2.3.1                 foreign_0.8-80              survival_3.2-3             
 [91] RCurl_1.98-1.2              nnet_7.3-14                 modelr_0.1.8                crayon_1.3.4                utf8_1.1.4                 
 [96] BiocFileCache_1.12.1        jpeg_0.1-8.1                progress_1.2.2              readxl_1.3.1                data.table_1.13.0          
[101] blob_1.2.1                  reprex_0.3.0                digest_0.6.25               R.cache_0.14.0              R.utils_2.9.2              
[106] openssl_1.4.2               munsell_0.5.0               askpass_1.1
txdb genomicfeatures • 5.2k views
ADD COMMENT
0
Entering edit mode
@james-w-macdonald-5106
Last seen 2 days ago
United States

You won't be able to get the HGNC symbol to go into a TxDb package, as they aren't designed to contain functional information. That's the purpose of the OrgDb package.

You can either do the mapping separately, or better yet use an OrganismDb package, which encapsulates the two and allows cross-db joins:

> library(Homo.sapiens)
> Homo.sapiens
OrganismDb Object:
# Includes GODb Object:  GO.db 
# With data about:  Gene Ontology 
# Includes OrgDb Object:  org.Hs.eg.db 
# Gene data about:  Homo sapiens 
# Taxonomy Id:  9606 
# Includes TxDb Object:  TxDb.Hsapiens.UCSC.hg19.knownGene 
# Transcriptome data about:  Homo sapiens 
# Based on genome:  hg19 
# The OrgDb gene id ENTREZID is mapped to the TxDb gene id GENEID .

## I don't have your TxDb so I use a different one as an example.
> library(TxDb.Hsapiens.UCSC.hg38.knownGene)

## Switch TxDbs
> TxDb(Homo.sapiens) <- TxDb.Hsapiens.UCSC.hg38.knownGene
> Homo.sapiens
OrganismDb Object:
# Includes GODb Object:  GO.db 
# With data about:  Gene Ontology 
# Includes OrgDb Object:  org.Hs.eg.db 
# Gene data about:  Homo sapiens 
# Taxonomy Id:  9606 
# Includes TxDb Object:  TxDb.Hsapiens.UCSC.hg38.knownGene 
# Transcriptome data about:  Homo sapiens 
# Based on genome:  hg38 
# The OrgDb gene id ENTREZID is mapped to the TxDb gene id GENEID .


## get transcripts along with HGNC symbols
> tx <- transcriptsBy(Homo.sapiens, columns = "SYMBOL")
> tx
GRangesList object of length 27363:
$`1`
GRanges object with 8 ranges and 2 metadata columns:
      seqnames            ranges strand |           tx_name          SYMBOL
         <Rle>         <IRanges>  <Rle> |       <character> <CharacterList>
  [1]    chr19 58345178-58347634      - | ENST00000596924.1            A1BG
  [2]    chr19 58345183-58353492      - | ENST00000263100.8            A1BG
  [3]    chr19 58346854-58356225      - | ENST00000600123.5            A1BG
  [4]    chr19 58346858-58353491      - | ENST00000595014.1            A1BG
  [5]    chr19 58346860-58347657      - | ENST00000598345.1            A1BG
  [6]    chr19 58348466-58362751      - | ENST00000599109.5            A1BG
  [7]    chr19 58350594-58353129      - | ENST00000600966.1            A1BG
  [8]    chr19 58353021-58356083      - | ENST00000596636.1            A1BG
  -------
  seqinfo: 595 sequences (1 circular) from hg38 genome

$`10`
GRanges object with 2 ranges and 2 metadata columns:
      seqnames            ranges strand |           tx_name          SYMBOL
         <Rle>         <IRanges>  <Rle> |       <character> <CharacterList>
  [1]     chr8 18391282-18401218      + | ENST00000286479.4            NAT2
  [2]     chr8 18391287-18400993      + | ENST00000520116.1            NAT2
  -------
  seqinfo: 595 sequences (1 circular) from hg38 genome

$`100`
GRanges object with 9 ranges and 2 metadata columns:
      seqnames            ranges strand |           tx_name          SYMBOL
         <Rle>         <IRanges>  <Rle> |       <character> <CharacterList>
  [1]    chr20 44619522-44626491      - | ENST00000464097.5             ADA
  [2]    chr20 44619522-44651699      - | ENST00000372874.9             ADA
  [3]    chr20 44619579-44651681      - | ENST00000536532.5             ADA
  [4]    chr20 44619810-44651691      - | ENST00000492931.5             ADA
  [5]    chr20 44619810-44651691      - | ENST00000537820.1             ADA
  [6]    chr20 44619810-44651691      - | ENST00000539235.5             ADA
  [7]    chr20 44626323-44651661      - | ENST00000545776.5             ADA
  [8]    chr20 44626517-44652114      - | ENST00000536076.1             ADA
  [9]    chr20 44636071-44652233      - | ENST00000535573.1             ADA
  -------
  seqinfo: 595 sequences (1 circular) from hg38 genome

...
<27360 more elements>
>
ADD COMMENT

Login before adding your answer.

Traffic: 520 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6