Hi everyone!
I did differential gene expression analysis on an RNA-seq dataset. In my top genes, I am getting a lot of genes with no symbol or that do not map to a chromosome and I am not sure if this is normal? I also get mitochondrial DNA-encoded genes ... I feel like something is wrong i.e maybe I am using the wrong annotation files etc?
For example
Gm4724 NA
Gm14434 NA
Gm2007 NA
0610010B08Rik NA
Gm14326 chr2
Gm14325 chr2
Gm14305 chr2
Prdm16 chr4
Gm4631 NA
Gm14308 chr2
Briefly, I started off with the BAM files (alignment was done by sequencing facility using STAR) and
1) Generated read counts with RSubread featureCounts
counts<-featureCounts(files=c("WGTR_0001_Le_n_WT_OT1DTAM1.merged.sorted.bam", "WGTR_0001_Le_n_WT_OT1DTAM2.merged.sorted.bam", "WGTR_0001_Le_n_WT_OT1DTAM3.merged.sorted.bam", "WGTR_0003_Le_n_WT_OT1GITRKODTAM1.merged.sorted.bam", "WGTR_0003_Le_n_WT_OT1GITRKODTAM2.merged.sorted.bam", "WGTR_0003_Le_n_WT_OT1GITRKODTAM3.merged.sorted.bam"),annot.inbuilt="mm10", isPairedEnd=TRUE)
2) Created DGE list and added symbols and chromosomes
DGEListGITR <- DGEList(counts=counts$counts, genes=counts$annotation[,c("GeneID", "Length")], group=group)
geneid <- rownames(DGEListGITR)
genes <- select(Mus.musculus, keys=geneid, columns=c("SYMBOL", "TXCHROM"), keytype="ENTREZID")
genes <- genes[!duplicated(genes$ENTREZID),]
DGEListGITR$genes <- genes
3) Performed differential gene expression analysis using the QLF test method
Melanie
The Mus.musculus package isn't buggy. As a developer I would imagine you might find disrespectful if someone said that about one of your packages, particularly if it wasn't a bug, but instead was something documented in the help page that you apparently ignored.
From ?select
This isn't a bug - it's how databases work. If you ask for multiple joins, and there are many-to-one matches, you will get ever increasing numbers of rows.
Hi Jim,
It's always good to see Bioconductor documentation but I fail to see the relevance in this case. For one thing, there was no criticism of the
select
function in my post. For another, the problems reported by OP could not have arisen by any of the issues that you mention. In particular, it is impossible to disassociate chromosome from symbol for any gene by usingselect
with two columns, so OP's main problem, which is getting gene symbols without a chromosome, could not possibly has arisen that way. The advice in the help page you quote, to runselect
multiple times, would simply complicate OP's workflow for no purpose.The problem is rather with the content of the Mus.musculus package itself.
I use Entrez Gene Ids in most of my workflows and I like to use the canonical gene-level annotation provided by the NCBI through its Entrez Gene website or its gene_info files. Every Entrez Id has a gene symbol and a chromosome. There is a 1-1 correspondence between Entrez Ids and symbols. For mouse, every gene is found on exactly one chromosome except for seven genes that are on both the X and Y chromosomes. Exactly the same mouse gene information as provided by NCBI can be found on the MGI website.
The NCBI/MGI mouse gene information is faithfully provided by the Bioconductor organism package org.Mm.eg.db. For example, for the genes shown by the OP we get:
and all Entrez Ids have both symbols and chromosomes. It is easy to verify that all the SYMBOL and CHR values returned do indeed correspond to current NCBI/MGI annotation.
On the other hand, using the Mus.musculus package for the same genes gives:
and now we have NA values as shown by the OP.
I see no logical explanation for why a geneid should ever have a valid symbol but not a chromosome. Any gene model, even a predicted gene, must be associated with a chromosome. If the Mus.musculus package didn't find any information for a particular Entrez Id, then I might put the problem down to differences in annotation databases. On the other hand, finding a symbol for a gene but not a chromosome seems to be a clear mistake. You don't like the word "buggy" (and fair enough, I have edited my post to remove the word), but this behaviour really does seem to me to be wrong.
As a developer, I find this issue very frustrating. For many years, I have been singing the praises of the org.Mm.eg.db and org.Hs.eg.db packages. But for the past few years they have issued a deprecation warning:
pushing users to packages like Mus.musculus instead. And then I have to spend my time explaining to users why they have NA chromosomes. I think it is unfortunate that users doing a gene-level analysis are being dissuaded from using the well-recognized and well understood NCBI/MGI gene-level annotation that is contained in org.Mm.eg.db.
Best wishes
Gordon
Hi Gordon,
You make a fair point - I didn't read the OP's post as closely as I should have, and seeing your use of
mapIds
and the OP's use ofduplicated
I assumed that the issue was loss of information due to multi-mapping probes.That's clearly not the case, and it is unfortunate that the mappings provided by the Mus.musculus package omit the chromosomes.. I do appreciate your edits, as this isn't a bug, but is due to mappings done at UCSC.
When asking the Mus.musculus package for both the symbol and the chromosome, one mapping (Gene ID to symbol) is handled by the org.Mm.eg.db package, and the other (Gene ID to chrom) is handled by the TxDb package. And therein lies the rub. The TxDb package is based on UCSC's knownGene table, which paradoxically uses Ensembl transcript IDs for the central ID, rather than NCBI transcript identifiers. So there is an extra step required when generating the TxDb package, where the Ensembl IDs have to be mapped to NCBI Gene IDs (maybe via RefSeq IDs? I haven't looked closely enough to know that for sure). The central IDs for the TxDb.Mmusculus.UCSC.m10.knownGene package are the Gene IDs, but that is after mapping the knownGene table IDs to Gene IDs.
One might assume that simply replacing the knownGene based TxDb with a RefSeq TxDb might help, but not so much.
And this is because the refGene table doesn't have mappings for the transcripts that map to those Gene IDs
But this is not true for other genes
The location data has been deprecated (but not removed, and may never be?) is due to a couple of things. First, the TxDb packages are intended to provide positional information, so the OrgDb packages are duplicating data, and second, the OrgDb packages have to do with functional annotations, and chromosomal locations are not functional descriptions.
Best,
Jim