Question: is it possible to extract the "Gene type" information from (e.g.) the org.Mm.eg.db package?
I would like to identify/extract all genes labelled "ncRNA" and "protein coding" in the "Gene type" field at the summary section of the EntrezGene database.
A gene can have more than 1 transcript, and each transcript can be "protein coding" or not. FWIW here is some code that produces a 4-column data.frame with 1 row per gene. The 1st column is the gene id (Entrez Gene), the 2nd column its nb of transcripts, and the 3rd and 4th column the nb of coding and non-coding transcripts, respectively. All the information we need for making this data.frame is extracted from the TxDb.Mmusculus.UCSC.mm10.knownGene package. Transcripts with no CDS are considered to be non-coding.
No, it isn't possible to get it from org.Mm.eg.db. To get NCBI's "gene type" annotation you need to download the gene information file directly from NCBI:
The org.Mm.eg.db package is closely based on this file but the "Gene type" column has been omitted. No doubt there is a reason for that, but I don't know what it is. I personally would have liked for it to be included.
Thanks very much for your answers, also for the code example.
For now I made use of the gene information downloaded from NCBI, but Herve's approach is also very handy because it easily provides info on the level of transcripts (which I don't need yet).
Guido
My code (for the archive):
# download Gene Info directly from NCBI
ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene_info.gz #Note that this file contains info for all genes/species, not only Mm
#After downloading, set taxonomy ID for column 1 (e.g. 9606=Hs, 10090=Mm), and extract relevant columns (2=GeneID, 3=Symbol, 10=Type_of_gene). Do this using AWK/*nix.
gzip -cd gene_info.gz | awk 'BEGIN {FS="\t"} $1==10090 {print $2 "\t" $3 "\t" $10}' > geneInfo.txt
# Then load resulting file into R-session:
> genes<-read.table("geneInfo.txt",sep="\t",quote="\"",na.strings="-",fill=TRUE, col.names=c("GeneID","Symbol","TypeOfGene"))
> dim(genes)
[1] 69281 3
>
> head(genes)
GeneID Symbol TypeOfGene
1 11287 Pzp protein-coding
2 11298 Aanat protein-coding
3 11302 Aatk protein-coding
4 11303 Abca1 protein-coding
5 11304 Abca4 protein-coding
6 11305 Abca2 protein-coding
> tail(genes)
GeneID Symbol TypeOfGene
69276 104795667 Mir935 ncRNA
69277 104795949 Mir9769 ncRNA
69278 104796139 Mir6967-2 ncRNA
69279 104797221 Mir3552 ncRNA
69280 104797311 Mir9768 ncRNA
69281 104797374 Mir466c-3 ncRNA
>
Sorry, I obviously should have used the columns function, but that does not reveal this info either...