Aliases of drosophila gene names
1
0
Entering edit mode
Chise ▴ 10
@9cb59de3
Last seen 2.4 years ago
United States

Hello, I am using a gtf file in the homepage of iGenomes for bulk RNA-seq of the whole brain of drosophila (Drosophila_melanogaster/UCSC/dm6/Annotation/Genes/genes.gtf).

I did the annotation using Rsubread and got a file with gene symbol. However, there are some genes of one spelling but the first letter is either uppercase or lowercase. They are with different gene ID (e.g. Crc and crc).

However, when I search NCBI's homepage for "Crc", I am not sure whether it means "cryptocephal" or "Calreticulin". I found a description that the gene name of drosophila starts with lowercase if named for recessive mutant and uppercase if named for dominant mutant. But it was hard to tell when I searched the homepage...

Is there any good way to detect the correct official full name? Or is there any way to get an annotated file with both gene symbol and gene ID? I would appreciate it if someone could let me informed.

Drosophila Rsubread dm6 • 1.3k views
ADD COMMENT
1
Entering edit mode
@gordon-smyth
Last seen 48 minutes ago
WEHI, Melbourne, Australia

I like to download the gene_info file directly from NCBI: https://ftp.ncbi.nlm.nih.gov/gene/DATA/GENE_INFO/Invertebrates/Drosophila_melanogaster.gene_info.gz

Then

> library(limma)
> Dm <- read.delim("Drosophila_melanogaster.gene_info.gz",sep="\t")
> Cols <- c("GeneID","Symbol","description","chromosome","type_of_gene")
> Alias <- c("Crc", "crc")
> alias2SymbolUsingNCBI(Alias, Dm, required.columns=Cols)
      GeneID Symbol  description chromosome   type_of_gene
8372   41166   Calr Calreticulin         3R protein-coding
12543  47767    crc cryptocephal         2L protein-coding

alias2SymbolUsingNCBI produces an annotation data.frame with one row for each input alias, so the data.frame it produces can be combined with your existing annotation without any resorting. If the input alias is the official symbol for one gene but also a synonym for another gene, then only the official symbol is output. If the alias can't be found, then a row of NAs is included.

Following on from your previous question a couple of days ago, you can create a DGEList object with all the desired annotation like this:

library(Rsubread)
library(edgeR)
targets <- readTargets("wholeflyseq.txt")
fc <- featureCounts(files=targets$OutputFile, annot.ext="genes.gtf",
+      isGTFAnnotationFile=TRUE, isPairedEnd=TRUE)
y <- featureCounts2DGEList(fc)
Ann <- alias2SymbolUsingNCBI(row.names(y), Dm, required.columns=Cols)
y$genes <- data.frame(y$genes, Ann)
ADD COMMENT
0
Entering edit mode

Dear Dr. Gordon Smyth,

Thank you so much for the detailed answer, the problem was completely solved.

The "alias2SymbolUsingNCBI" was amazing indeed. Also, with edgeR, limma, and Rsubread, I was able to complete my PhD work. I deeply appreciate your creating such great packages.

Sincerely,

Chise

ADD REPLY

Login before adding your answer.

Traffic: 478 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6