Question: how to extract "Gene type" from org.Mm.eg.db?
0
gravatar for Guido Hooiveld
4.9 years ago by
Guido Hooiveld2.5k
Wageningen University, Wageningen, the Netherlands
Guido Hooiveld2.5k wrote:

Hi all,

Question: is it possible to extract the "Gene type" information from (e.g.) the org.Mm.eg.db package?

I would like to identify/extract all genes labelled "ncRNA" and "protein coding" in the "Gene type" field at the summary section of the EntrezGene database.

See e.g. (for ncRNA) http://www.ncbi.nlm.nih.gov/gene/?term=102638436 and (for protein coding) http://www.ncbi.nlm.nih.gov/gene/12773

I checked with keytypes, but could not find this "Gene type" field, but likely this is not the way to check for this.

>  keytypes(Mus.musculus)
 [1] "GOID"         "TERM"         "ONTOLOGY"     "DEFINITION"   "ENTREZID"     "PFAM"         "IPI"          "PROSITE"     
 [9] "ACCNUM"       "ALIAS"        "CHR"          "CHRLOC"       "CHRLOCEND"    "ENZYME"       "PATH"         "PMID"        
[17] "REFSEQ"       "SYMBOL"       "UNIGENE"      "ENSEMBL"      "ENSEMBLPROT"  "ENSEMBLTRANS" "GENENAME"     "UNIPROT"     
[25] "GO"           "EVIDENCE"     "GOALL"        "EVIDENCEALL"  "ONTOLOGYALL"  "MGI"          "GENEID"       "TXID"        
[33] "TXNAME"       "EXONID"       "EXONNAME"     "CDSID"        "CDSNAME"     
>

Any hints would be appreciated!

Thanks,

Guido

annotation org.mm.eg.db • 2.6k views
ADD COMMENTlink modified 4.9 years ago • written 4.9 years ago by Guido Hooiveld2.5k

Sorry, I obviously should have used the columns function, but that does not reveal this info either...

>  columns(Mus.musculus)
 [1] "GOID"         "TERM"         "ONTOLOGY"     "DEFINITION"   "ENTREZID"     "PFAM"         "IPI"          "PROSITE"     
 [9] "ACCNUM"       "ALIAS"        "CHR"          "CHRLOC"       "CHRLOCEND"    "ENZYME"       "PATH"         "PMID"        
[17] "REFSEQ"       "SYMBOL"       "UNIGENE"      "ENSEMBL"      "ENSEMBLPROT"  "ENSEMBLTRANS" "GENENAME"     "UNIPROT"     
[25] "GO"           "EVIDENCE"     "GOALL"        "EVIDENCEALL"  "ONTOLOGYALL"  "MGI"          "CDSID"        "CDSNAME"     
[33] "CDSCHROM"     "CDSSTRAND"    "CDSSTART"     "CDSEND"       "EXONID"       "EXONNAME"     "EXONCHROM"    "EXONSTRAND"  
[41] "EXONSTART"    "EXONEND"      "GENEID"       "TXID"         "EXONRANK"     "TXNAME"       "TXCHROM"      "TXSTRAND"    
[49] "TXSTART"      "TXEND"       
>
ADD REPLYlink modified 4.9 years ago • written 4.9 years ago by Guido Hooiveld2.5k
Answer: how to extract "Gene type" from org.Mm.eg.db?
6
gravatar for Hervé Pagès
4.9 years ago by
Hervé Pagès ♦♦ 14k
United States
Hervé Pagès ♦♦ 14k wrote:

Hi Guido,

A gene can have more than 1 transcript, and each transcript can be "protein coding" or not. FWIW here is some code that produces a 4-column data.frame with 1 row per gene. The 1st column is the gene id (Entrez Gene), the 2nd column its nb of transcripts, and the 3rd and 4th column the nb of coding and non-coding transcripts, respectively. All the information we need for making this data.frame is extracted from the TxDb.Mmusculus.UCSC.mm10.knownGene package. Transcripts with no CDS are considered to be non-coding.

summarizeProteinCodingGenes <- function(txdb)
{
    stopifnot(is(txdb, "TxDb"))
    protein_coding_tx <- names(cdsBy(txdb, use.names=TRUE))
    all_tx <- mcols(transcripts(txdb, columns=c("gene_id", "tx_name")))
    all_tx$gene_id <- as.character(all_tx$gene_id)
    all_tx$is_coding <- all_tx$tx_name %in% protein_coding_tx
    tmp <- splitAsList(all_tx$is_coding, all_tx$gene_id)
    gene <- names(tmp)
    nb_tx <- unname(elementLengths(tmp))
    nb_coding <- unname(sum(tmp))
    nb_non_coding <- nb_tx - nb_coding
    data.frame(gene, nb_tx, nb_coding, nb_non_coding, stringsAsFactors=FALSE)
}

Then:

library(TxDb.Mmusculus.UCSC.mm10.knownGene)
txdb <- TxDb.Mmusculus.UCSC.mm10.knownGene
df <- summarizeProteinCodingGenes(txdb)
head(df)
##        gene nb_tx nb_coding nb_non_coding
## 1 100009600     2         2             0
## 2 100009609     1         1             0
## 3 100009614     1         1             0
## 4 100009664     1         0             1
## 5    100012     1         1             0
## 6    100017     2         1             1

Hope this helps.

H.

 

ADD COMMENTlink written 4.9 years ago by Hervé Pagès ♦♦ 14k
Answer: how to extract "Gene type" from org.Mm.eg.db?
2
gravatar for Gordon Smyth
4.9 years ago by
Gordon Smyth39k
Walter and Eliza Hall Institute of Medical Research, Melbourne, Australia
Gordon Smyth39k wrote:

No, it isn't possible to get it from org.Mm.eg.db. To get NCBI's "gene type" annotation you need to download the gene information file directly from NCBI:

ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/GENE_INFO/Mammalia/Mus_musculus.gene_info.gz

The org.Mm.eg.db package is closely based on this file but the "Gene type" column has been omitted. No doubt there is a reason for that, but I don't know what it is. I personally would have liked for it to be included.

ADD COMMENTlink modified 4.9 years ago • written 4.9 years ago by Gordon Smyth39k
Answer: how to extract "Gene type" from org.Mm.eg.db?
0
gravatar for Guido Hooiveld
4.9 years ago by
Guido Hooiveld2.5k
Wageningen University, Wageningen, the Netherlands
Guido Hooiveld2.5k wrote:

Dear Herve and Gordon,

Thanks very much for your answers, also for the code example.

For now I made use of the gene information downloaded from NCBI, but Herve's approach is also very handy because it easily provides info on the level of transcripts (which I don't need yet).

Guido

 

My code (for the archive):

# download Gene Info directly from NCBI

ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene_info.gz  #Note that this file contains info for all genes/species, not only Mm


#After downloading, set taxonomy ID for column 1 (e.g. 9606=Hs, 10090=Mm), and extract relevant columns (2=GeneID, 3=Symbol, 10=Type_of_gene). Do this using AWK/*nix.

gzip -cd gene_info.gz | awk 'BEGIN {FS="\t"} $1==10090 {print $2 "\t" $3 "\t" $10}' > geneInfo.txt


# Then load resulting file into R-session:

> genes<-read.table("geneInfo.txt",sep="\t",quote="\"",na.strings="-",fill=TRUE, col.names=c("GeneID","Symbol","TypeOfGene"))
> dim(genes)
[1] 69281     3
>

> head(genes)
  GeneID Symbol     TypeOfGene
1  11287    Pzp protein-coding
2  11298  Aanat protein-coding
3  11302   Aatk protein-coding
4  11303  Abca1 protein-coding
5  11304  Abca4 protein-coding
6  11305  Abca2 protein-coding
> tail(genes)
         GeneID    Symbol TypeOfGene
69276 104795667    Mir935      ncRNA
69277 104795949   Mir9769      ncRNA
69278 104796139 Mir6967-2      ncRNA
69279 104797221   Mir3552      ncRNA
69280 104797311   Mir9768      ncRNA
69281 104797374 Mir466c-3      ncRNA
>

 

ADD COMMENTlink modified 4.9 years ago • written 4.9 years ago by Guido Hooiveld2.5k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 404 users visited in the last hour