Question: Why am I getting many differentially expressed genes that have no symbol or do not map to a chromosome (RNA-seq)
0
3 months ago by
melanie.girard0 wrote:

Hi everyone!

I did differential gene expression analysis on an RNA-seq dataset. In my top genes, I am getting a lot of genes with no symbol or that do not map to a chromosome and I am not sure if this is normal? I also get mitochondrial DNA-encoded genes ... I feel like something is wrong i.e maybe I am using the wrong annotation files etc?

For example

Gm4724                  NA
Gm14434         NA
Gm2007                  NA
0610010B08Rik   NA
Gm14326        chr2
Gm14325        chr2
Gm14305        chr2
Prdm16                 chr4
Gm4631                 NA
Gm14308        chr2


Briefly, I started off with the BAM files (alignment was done by sequencing facility using STAR) and

counts<-featureCounts(files=c("WGTR_0001_Le_n_WT_OT1DTAM1.merged.sorted.bam", "WGTR_0001_Le_n_WT_OT1DTAM2.merged.sorted.bam", "WGTR_0001_Le_n_WT_OT1DTAM3.merged.sorted.bam", "WGTR_0003_Le_n_WT_OT1GITRKODTAM1.merged.sorted.bam", "WGTR_0003_Le_n_WT_OT1GITRKODTAM2.merged.sorted.bam", "WGTR_0003_Le_n_WT_OT1GITRKODTAM3.merged.sorted.bam"),annot.inbuilt="mm10", isPairedEnd=TRUE)


2) Created DGE list and added symbols and chromosomes

DGEListGITR <- DGEList(counts=counts$counts, genes=counts$annotation[,c("GeneID", "Length")], group=group)
geneid <- rownames(DGEListGITR)
genes <- select(Mus.musculus, keys=geneid, columns=c("SYMBOL", "TXCHROM"), keytype="ENTREZID")
genes <- genes[!duplicated(genes$ENTREZID),] DGEListGITR$genes <- genes


3) Performed differential gene expression analysis using the QLF test method

Melanie

edger • 174 views
modified 3 months ago by Gordon Smyth38k • written 3 months ago by melanie.girard0
Answer: Why am I getting many differentially expressed genes that have no symbol or do n
0
3 months ago by
Gordon Smyth38k
Walter and Eliza Hall Institute of Medical Research, Melbourne, Australia
Gordon Smyth38k wrote:

It appears that Rsubread and edgeR have worked correctly, but (1) the Mus.musculus package is giving some missing annotation results and (2) you have not matched up ENTREZIDs from Mus.musculus with the corresponding row.names in your DGEList object (so your annotation could potentially be out of order).

You will have no such problems if you add annotation using the method that we recommend in the edgeR User's Guide or in the edgeR QL workflow:

DGEListGITR <- DGEList( as you did )
DGEListGITR$genes$SYMBOL <- mapIds(org.Mm.eg.db, row.names(DGEListGITR),
keytype="ENTREZID", column="SYMBOL")
DGEListGITR$genes$CHR <- mapIds(org.Mm.eg.db, row.names(DGEListGITR),
keytype="ENTREZID", column="CHR")


If you get a Deprecation warning, just ignore it.

There will still be a few NA symbols. Just delete these because they represent old Entrez Ids that have been retired in the past 2-3 years. You will still get entries like Gm4724 because that is a valid official Symbol, but you won't get 0610010B08Rik and you will never get an NA chromosome unless the Symbol is also missing.

The org.Mm.eg.db package contains canonical gene-level information from the NCBI for each Entrez Id and I use it myself gene-level analyses. I don't know why what Mus.musculus returns is in conflict with the org package and, mostly importantly, why it returns NA chromosome values even when it has connected an Entrez Id to the correct gene. I do not personally use transcript annotation packages for gene-level annotation.

BTW, it is always best to update to the latest version of R (which is 3.6.1, but 3.6.0 would also be ok) and the latest version of Bioconductor. A couple of the results you list (e.g. 0610010B08Rik) suggest that you are not using the latest version of the Mus.musculus package.

The Mus.musculus package isn't buggy. As a developer I would imagine you might find disrespectful if someone said that about one of your packages, particularly if it wasn't a bug, but instead was something documented in the help page that you apparently ignored.

From ?select


select will retrieve the data as a data.frame based on parameters for selected keys columns and keytype arguments. Users should be warned that if you call select and request columns that have multiple matches for your keys, select will return a data.frame with one row for each possible match. This has the effect that if you request multiple columns and some of them have a many to one relationship to the keys, things will continue to multiply accordingly. So it's not a good idea to request a large number of columns unless you know that what you are asking for should have a one to one relationship with the initial set of keys. In general, if you need to retrieve a column (like GO) that has a many to one relationship to the original keys, it is most useful to extract that separately.



This isn't a bug - it's how databases work. If you ask for multiple joins, and there are many-to-one matches, you will get ever increasing numbers of rows.

Hi Jim,

It's always good to see Bioconductor documentation but I fail to see the relevance in this case. For one thing, there was no criticism of the select function in my post. For another, the problems reported by OP could not have arisen by any of the issues that you mention. In particular, it is impossible to disassociate chromosome from symbol for any gene by using select with two columns, so OP's main problem, which is getting gene symbols without a chromosome, could not possibly has arisen that way. The advice in the help page you quote, to run select multiple times, would simply complicate OP's workflow for no purpose.

The problem is rather with the content of the Mus.musculus package itself.

I use Entrez Gene Ids in most of my workflows and I like to use the canonical gene-level annotation provided by the NCBI through its Entrez Gene website or its gene_info files. Every Entrez Id has a gene symbol and a chromosome. There is a 1-1 correspondence between Entrez Ids and symbols. For mouse, every gene is found on exactly one chromosome except for seven genes that are on both the X and Y chromosomes. Exactly the same mouse gene information as provided by NCBI can be found on the MGI website.

The NCBI/MGI mouse gene information is faithfully provided by the Bioconductor organism package org.Mm.eg.db. For example, for the genes shown by the OP we get:

> EntrezId
[1] "100043915" "668039"    "102639653" "665211"    "329575"
[6] "100043387" "70673"     "100043764" "100043381"
> select(org.Mm.eg.db,keys=EntrezId,columns=c("SYMBOL","CHR"),keytype="ENTREZID")
'select()' returned 1:1 mapping between keys and columns
ENTREZID  SYMBOL CHR
1 100043915  Gm4724   2
2    668039 Gm14434   2
3 102639653  Gm2007   2
4    665211 Gm14326   2
5    329575 Gm14325   2
6 100043387 Gm14305   2
7     70673  Prdm16   4
8 100043764  Gm4631   2
9 100043381 Gm14308   2


and all Entrez Ids have both symbols and chromosomes. It is easy to verify that all the SYMBOL and CHR values returned do indeed correspond to current NCBI/MGI annotation.

On the other hand, using the Mus.musculus package for the same genes gives:

> select(Mus.musculus,keys=EntrezId,columns=c("SYMBOL","TXCHROM"),keytype="ENTREZID")
'select()' returned 1:1 mapping between keys and columns
ENTREZID  SYMBOL TXCHROM
1 100043915  Gm4724    <NA>
2    668039 Gm14434    chr2
3 102639653  Gm2007    <NA>
4    665211 Gm14326    chr2
5    329575 Gm14325    chr2
6 100043387 Gm14305    chr2
7     70673  Prdm16    chr4
8 100043764  Gm4631    <NA>
9 100043381 Gm14308    chr2


and now we have NA values as shown by the OP.

I see no logical explanation for why a geneid should ever have a valid symbol but not a chromosome. Any gene model, even a predicted gene, must be associated with a chromosome. If the Mus.musculus package didn't find any information for a particular Entrez Id, then I might put the problem down to differences in annotation databases. On the other hand, finding a symbol for a gene but not a chromosome seems to be a clear mistake. You don't like the word "buggy" (and fair enough, I have edited my post to remove the word), but this behaviour really does seem to me to be wrong.

As a developer, I find this issue very frustrating. For many years, I have been singing the praises of the org.Mm.eg.db and org.Hs.eg.db packages. But for the past few years they have issued a deprecation warning:

Warning message:
In .deprecatedColsMessage() : Accessing gene location information via
'CHR','CHRLOC','CHRLOCEND' is deprecated. Please use a range
based accessor like genes(), or select() with columns values
like TXCHROM and TXSTART on a TxDb or OrganismDb object


pushing users to packages like Mus.musculus instead. And then I have to spend my time explaining to users why they have NA chromosomes. I think it is unfortunate that users doing a gene-level analysis are being dissuaded from using the well-recognized and well understood NCBI/MGI gene-level annotation that is contained in org.Mm.eg.db.

Best wishes
Gordon

Hi Gordon,

You make a fair point - I didn't read the OP's post as closely as I should have, and seeing your use of mapIds and the OP's use of duplicated I assumed that the issue was loss of information due to multi-mapping probes.

That's clearly not the case, and it is unfortunate that the mappings provided by the Mus.musculus package omit the chromosomes.. I do appreciate your edits, as this isn't a bug, but is due to mappings done at UCSC.

When asking the Mus.musculus package for both the symbol and the chromosome, one mapping (Gene ID to symbol) is handled by the org.Mm.eg.db package, and the other (Gene ID to chrom) is handled by the TxDb package. And therein lies the rub. The TxDb package is based on UCSC's knownGene table, which paradoxically uses Ensembl transcript IDs for the central ID, rather than NCBI transcript identifiers. So there is an extra step required when generating the TxDb package, where the Ensembl IDs have to be mapped to NCBI Gene IDs (maybe via RefSeq IDs? I haven't looked closely enough to know that for sure). The central IDs for the TxDb.Mmusculus.UCSC.m10.knownGene package are the Gene IDs, but that is after mapping the knownGene table IDs to Gene IDs.

One might assume that simply replacing the knownGene based TxDb with a RefSeq TxDb might help, but not so much.

> Mus.musculus
OrganismDb Object:
# Includes GODb Object:  GO.db
# With data about:  Gene Ontology
# Includes OrgDb Object:  org.Mm.eg.db
# Gene data about:  Mus musculus
# Taxonomy Id:  10090
# Includes TxDb Object:  TxDb.Mmusculus.UCSC.mm10.refGene  <--- refGene TxDb
# Transcriptome data about:  Mus musculus
# Based on genome:  mm10
# The OrgDb gene id ENTREZID is mapped to the TxDb gene id GENEID .
> select(Mus.musculus, Entrezid, c("SYMBOL","TXCHROM"), "ENTREZID")
'select()' returned 1:1 mapping between keys and columns
ENTREZID  SYMBOL TXCHROM
1 100043915  Gm4724    chr2
2    668039 Gm14434    chr2
3 102639653  Gm2007    <NA>
4    665211 Gm14326    chr2
5    329575 Gm14325    chr2
6 100043387 Gm14305    chr2
7     70673  Prdm16    chr4
8 100043764  Gm4631    <NA>
9 100043381 Gm14308    chr2



And this is because the refGene table doesn't have mappings for the transcripts that map to those Gene IDs

> select(org.Mm.eg.db, "102639653", "REFSEQ")
'select()' returned 1:many mapping between keys and columns
ENTREZID       REFSEQ
1 102639653 XM_017319407
2 102639653 XP_017174896
> con <- dbConnect(MySQL(), user="genome", host="genome-mysql.soe.ucsc.edu", dbname="mm10")
> dbGetQuery(con, "select * from refGene where name in ('XM_017319407','XP_017174896');")
[1] bin          name         chrom        strand       txStart      txEnd        cdsStart     cdsEnd       exonCount    exonStarts   exonEnds
[12] score        name2        cdsStartStat cdsEndStat   exonFrames
<0 rows> (or 0-length row.names)


But this is not true for other genes

> select(org.Mm.eg.db, "100043915", "REFSEQ")
'select()' returned 1:many mapping between keys and columns
ENTREZID       REFSEQ
1 100043915 NM_001256480
2 100043915 NP_001243409
3 100043915 XM_017314489
4 100043915 XP_017169978

> dbGetQuery(con, "select name, chrom, strand from refGene where name in ('NM_001256480','NP_001243409','XM_017314489','XP_017169978');")
name chrom strand
1 NM_001256480  chr2      -
2 NM_001256480  chr2      -
3 NM_001256480  chr2      +
4 NM_001256480  chr2      +
5 NM_001256480  chr2      -
6 NM_001256480  chr2      -



The location data has been deprecated (but not removed, and may never be?) is due to a couple of things. First, the TxDb packages are intended to provide positional information, so the OrgDb packages are duplicating data, and second, the OrgDb packages have to do with functional annotations, and chromosomal locations are not functional descriptions.

Best,

Jim