Question

how to extract "Gene type" from org.Mm.eg.db?

0

Entering edit mode

Guido Hooiveld ★ 4.1k

@guido-hooiveld-2020

Last seen 20 hours ago

Wageningen University, Wageningen, the …

Hi all,

Question: is it possible to extract the "Gene type" information from (e.g.) the org.Mm.eg.db package?

I would like to identify/extract all genes labelled "ncRNA" and "protein coding" in the "Gene type" field at the summary section of the EntrezGene database.

See e.g. (for ncRNA) http://www.ncbi.nlm.nih.gov/gene/?term=102638436 and (for protein coding) http://www.ncbi.nlm.nih.gov/gene/12773

I checked with keytypes, but could not find this "Gene type" field, but likely this is not the way to check for this.

>  keytypes(Mus.musculus)
 [1] "GOID"         "TERM"         "ONTOLOGY"     "DEFINITION"   "ENTREZID"     "PFAM"         "IPI"          "PROSITE"     
 [9] "ACCNUM"       "ALIAS"        "CHR"          "CHRLOC"       "CHRLOCEND"    "ENZYME"       "PATH"         "PMID"        
[17] "REFSEQ"       "SYMBOL"       "UNIGENE"      "ENSEMBL"      "ENSEMBLPROT"  "ENSEMBLTRANS" "GENENAME"     "UNIPROT"     
[25] "GO"           "EVIDENCE"     "GOALL"        "EVIDENCEALL"  "ONTOLOGYALL"  "MGI"          "GENEID"       "TXID"        
[33] "TXNAME"       "EXONID"       "EXONNAME"     "CDSID"        "CDSNAME"     
>

Any hints would be appreciated!

Thanks,

Guido

annotation org.Mm.eg.db • 5.7k views

ADD COMMENT • link 11.2 years ago Guido Hooiveld ★ 4.1k

0

Entering edit mode

Sorry, I obviously should have used the columns function, but that does not reveal this info either...

>  columns(Mus.musculus)
 [1] "GOID"         "TERM"         "ONTOLOGY"     "DEFINITION"   "ENTREZID"     "PFAM"         "IPI"          "PROSITE"     
 [9] "ACCNUM"       "ALIAS"        "CHR"          "CHRLOC"       "CHRLOCEND"    "ENZYME"       "PATH"         "PMID"        
[17] "REFSEQ"       "SYMBOL"       "UNIGENE"      "ENSEMBL"      "ENSEMBLPROT"  "ENSEMBLTRANS" "GENENAME"     "UNIPROT"     
[25] "GO"           "EVIDENCE"     "GOALL"        "EVIDENCEALL"  "ONTOLOGYALL"  "MGI"          "CDSID"        "CDSNAME"     
[33] "CDSCHROM"     "CDSSTRAND"    "CDSSTART"     "CDSEND"       "EXONID"       "EXONNAME"     "EXONCHROM"    "EXONSTRAND"  
[41] "EXONSTART"    "EXONEND"      "GENEID"       "TXID"         "EXONRANK"     "TXNAME"       "TXCHROM"      "TXSTRAND"    
[49] "TXSTART"      "TXEND"       
>

ADD REPLY • link 11.2 years ago Guido Hooiveld ★ 4.1k

score 6 · Answer 1 · 2014-12-11

Hi Guido,

A gene can have more than 1 transcript, and each transcript can be "protein coding" or not. FWIW here is some code that produces a 4-column data.frame with 1 row per gene. The 1st column is the gene id (Entrez Gene), the 2nd column its nb of transcripts, and the 3rd and 4th column the nb of coding and non-coding transcripts, respectively. All the information we need for making this data.frame is extracted from the TxDb.Mmusculus.UCSC.mm10.knownGene package. Transcripts with no CDS are considered to be non-coding.

summarizeProteinCodingGenes <- function(txdb)
{
    stopifnot(is(txdb, "TxDb"))
    protein_coding_tx <- names(cdsBy(txdb, use.names=TRUE))
    all_tx <- mcols(transcripts(txdb, columns=c("gene_id", "tx_name")))
    all_tx$gene_id <- as.character(all_tx$gene_id)
    all_tx$is_coding <- all_tx$tx_name %in% protein_coding_tx
    tmp <- splitAsList(all_tx$is_coding, all_tx$gene_id)
    gene <- names(tmp)
    nb_tx <- unname(elementLengths(tmp))
    nb_coding <- unname(sum(tmp))
    nb_non_coding <- nb_tx - nb_coding
    data.frame(gene, nb_tx, nb_coding, nb_non_coding, stringsAsFactors=FALSE)
}

Then:

library(TxDb.Mmusculus.UCSC.mm10.knownGene)
txdb <- TxDb.Mmusculus.UCSC.mm10.knownGene
df <- summarizeProteinCodingGenes(txdb)
head(df)
##        gene nb_tx nb_coding nb_non_coding
## 1 100009600     2         2             0
## 2 100009609     1         1             0
## 3 100009614     1         1             0
## 4 100009664     1         0             1
## 5    100012     1         1             0
## 6    100017     2         1             1

Hope this helps.

H.

score 2 · Answer 2 · 2014-12-11

No, it isn't possible to get it from org.Mm.eg.db. To get NCBI's "gene type" annotation you need to download the gene information file directly from NCBI:

ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/GENE_INFO/Mammalia/Mus_musculus.gene_info.gz

The org.Mm.eg.db package is closely based on this file but the "Gene type" column has been omitted. No doubt there is a reason for that, but I don't know what it is. I personally would have liked for it to be included.

score 0 · Answer 3 · 2014-12-15

Dear Herve and Gordon,

Thanks very much for your answers, also for the code example.

For now I made use of the gene information downloaded from NCBI, but Herve's approach is also very handy because it easily provides info on the level of transcripts (which I don't need yet).

Guido

My code (for the archive):

# download Gene Info directly from NCBI

ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene_info.gz  #Note that this file contains info for all genes/species, not only Mm


#After downloading, set taxonomy ID for column 1 (e.g. 9606=Hs, 10090=Mm), and extract relevant columns (2=GeneID, 3=Symbol, 10=Type_of_gene). Do this using AWK/*nix.

gzip -cd gene_info.gz | awk 'BEGIN {FS="\t"} $1==10090 {print $2 "\t" $3 "\t" $10}' > geneInfo.txt


# Then load resulting file into R-session:

> genes<-read.table("geneInfo.txt",sep="\t",quote="\"",na.strings="-",fill=TRUE, col.names=c("GeneID","Symbol","TypeOfGene"))
> dim(genes)
[1] 69281     3
>

> head(genes)
  GeneID Symbol     TypeOfGene
1  11287    Pzp protein-coding
2  11298  Aanat protein-coding
3  11302   Aatk protein-coding
4  11303  Abca1 protein-coding
5  11304  Abca4 protein-coding
6  11305  Abca2 protein-coding
> tail(genes)
         GeneID    Symbol TypeOfGene
69276 104795667    Mir935      ncRNA
69277 104795949   Mir9769      ncRNA
69278 104796139 Mir6967-2      ncRNA
69279 104797221   Mir3552      ncRNA
69280 104797311   Mir9768      ncRNA
69281 104797374 Mir466c-3      ncRNA
>