Question

Coding vs Noncoding Genes in Hit List

3

Entering edit mode

Steve Lowe ▴ 30

@steve-lowe-21801

Last seen 4.9 years ago

Does anyone know of a method I can use to find out which genes are coding and non-coding from my gene list? I'm trying to avoid having to look up each individual gene. Thanks!

--Dr. S

biomaRt coding vs noncoding high throughput • 3.9k views

ADD COMMENT • link updated 4.9 years ago by Martin Morgan 25k • written 4.9 years ago by Steve Lowe ▴ 30

1

Entering edit mode

Martin Morgan 25k

@martin-morgan-1513

Last seen 1 day ago

United States

Another approach uses the ensembldb and resources from AnnotationHub. Load packages

library(ensembldb)
library(AnnotationHub)
library(dplyr)

Discover and retrieve the appropriate database -- for Homo sapiens build 97

hub = AnnotationHub()
query(hub, c("EnsDb", "Homo sapiens", "97"))
edb = hub[["AH73881"]]

Discover fields available for query (keytypes()) and for retrieval (columns()), and map all HGNC symbols to Entrez and Ensembl identifiers and gene biotypes, transforming to a tibble for convenience

keytypes(edb)
columns(edb)
keys = keys(edb, "GENENAME")
columns =  c("GENEID", "ENTREZID", "GENEBIOTYPE")
tbl =
    ensembldb::select(edb, keys, columns, keytype = "GENENAME") %>%
    as_tibble()

The result is

> tbl
# A tibble: 68,027 x 4
   GENENAME  GENEID          ENTREZID GENEBIOTYPE
   <chr>     <chr>              <int> <chr>
 1 A1BG      ENSG00000121410        1 protein_coding
 2 A1BG-AS1  ENSG00000268895       NA lncRNA
 3 A1CF      ENSG00000148584    29974 protein_coding
 4 A2M       ENSG00000175899        2 protein_coding
 5 A2M-AS1   ENSG00000245105   144571 lncRNA
 6 A2ML1     ENSG00000166535   144568 protein_coding
 7 A2ML1-AS1 ENSG00000256661       NA lncRNA
 8 A2ML1-AS2 ENSG00000256904       NA lncRNA
 9 A2MP1     ENSG00000256069        3 transcribed_unprocessed_pseudogene
10 A3GALT2   ENSG00000184389   127550 protein_coding
# … with 68,017 more rows

Filters are a very useful feature, discovered and used to retrieve the same results as above but restricted to protein_coding biotype

supportedFilters()
filter = ~ gene_name %in% keys & gene_biotype == "protein_coding"
tbl =
    ensembldb::select(edb, filter, columns) %>%
    as_tibble()

ADD COMMENT • link 4.9 years ago Martin Morgan 25k

score 3 · Accepted Answer · 2019-09-06

3

Entering edit mode

Kevin Blighe ★ 4.0k

@kevin

Last seen 5 weeks ago

Republic of Ireland

Hey,

biomaRt is one solution. Take a look at this example, starting with HGNC symbols:

genes <- c('BRCA1', 'XIST', 'TXNIP', 'AFG3L1P')

require(biomaRt)
mart <- useMart("ENSEMBL_MART_ENSEMBL", host = "useast.ensembl.org")
mart <- useDataset("hsapiens_gene_ensembl", mart)
annotLookup <- getBM(
  mart = mart,
  attributes = c(
    "hgnc_symbol",
    "entrezgene_id",
    "ensembl_gene_id",
    "gene_biotype"),
  filter = "hgnc_symbol",
  values = genes,
  uniqueRows=TRUE)

annotLookup
  hgnc_symbol entrezgene_id ensembl_gene_id                   gene_biotype
1     AFG3L1P           172 ENSG00000223959 transcribed_unitary_pseudogene
2       BRCA1           672 ENSG00000012048                 protein_coding
3       TXNIP         10628 ENSG00000265972                 protein_coding
4        XIST            NA ENSG00000229807                         lncRNA

Kevin

ADD COMMENT • link 4.9 years ago Kevin Blighe ★ 4.0k

1

Entering edit mode

Hi Kevin,

I used biomart in the following way, to filter out protein-coding genes from my list. But I think it is missing some genes. Any suggestions

 library("biomaRt")
   mart <- useMart(biomart="ensembl", dataset="hsapiens_gene_ensembl")
   all_coding_genes <- getBM(attributes = c( "hgnc_symbol"), filters = c("biotype"), values = list(biotype="protein_coding"), mart = mart)

rawcount <- rawcount[row.names(rawcount) %in%  all_coding_genes$hgnc_symbol,]

In allcodinggenes I got 19391 genes names. Out of which 19,081 matches with my data. but in the non-coding list ( rawcount <- rawcount[!(row.names(rawcount) %in% all_coding_genes$hgnc_symbol),]), I can still find some protein_coding genes (gene card) such as SEPT14, PRR26 etc.

ADD REPLY • link 4.0 years ago thind.amarinder ▴ 10

1

Entering edit mode

There will always be some discrepancies between the different gene annotation databases, considering the fact that these are constantly being updated.

In this case, it looks like SEPT14 is actually there, but has a different symbol:

all_coding_genes <- getBM(attributes = c('ensembl_gene_id', 'hgnc_symbol', 'gene_biotype'),
  mart = mart)
all_coding_genes[grep('ENSG00000154997', all_coding_genes$ensembl_gene_id),]
      ensembl_gene_id hgnc_symbol   gene_biotype
27677 ENSG00000154997    SEPTIN14 protein_coding

ADD REPLY • link 4.0 years ago Kevin Blighe ★ 4.0k

0

Entering edit mode

Thanks for your reply. Do you know, how to deal with it for bulk number of genes?

ADD REPLY • link 4.0 years ago thind.amarinder ▴ 10

1

Entering edit mode

It may be better to deal with this issue from the source of the data. For example, start with Ensembl or Entrez IDs, which are unique, in place of gene symbols. These IDs were obviously introduced due to these discrepancies that can exist with gene symbols. Dealing with the issue now is problematic.

ADD REPLY • link 4.0 years ago Kevin Blighe ★ 4.0k