DeSeq2 results converting ENSG IDs to Gene Symbols, more than one ENSG ID per Gene symbol
1
4
Entering edit mode
casey.rimland ▴ 150
@caseyrimland-14915
Last seen 4.0 years ago
University of Cambridge, National Insti…

I am working with DeSeq2 and trying to convert ENSG IDs to Gene symbols after calling the results function in DeSeq2. When doing this however, I end up with duplicated (or sometimes triplicated or more) gene symbols each corresponding to a different ENSG ID. These duplicated symbols often have the same exact P-adj, logFC values, etc but sometimes they are slightly different.

I am a novice at RNA-Seq analysis so apologies if this is a silly question! I have tried looking on bioconductor, DeSeq vignette, etc and have not been able to figure out a reason or work around for this issue.

An example of one of the genes this is happening with is: AADACL2 which shows up with two ENSG IDs, ENSG00000197953 and ENSG00000261846.

My code is (hopefully) posted here:

#Creating a DeSeq Matrix from Tximport

samples <- read.table(file.path(dir,"samples_v3.txt"), header=TRUE)
files <- file.path(dir,"salmon", paste("salmon_",samples$run, ".txt", sep='') ) load("/Users/rimlandca/Desktop/R files/tx2gene_GRCh38.80.RData") txi <- tximport(files, type="salmon", tx2gene=tx2gene) ddsTxi <- DESeqDataSetFromTximport(txi, colData = samples, design = ~ type)  #Collapsing Technical Replicates ddsColl<- collapseReplicates(ddsTxi, samples$name)

#Pre-Filter

ddsColl <- ddsColl[ rowSums(counts(ddsColl)) > 1, ]

#Subset for protein coding genes (we are only interested in protein coding genes)

ddsCollinfo <- data.frame(getBM(attributes=c("ensembl_gene_id","hgnc_symbol", "transcript_biotype"), filters="ensembl_gene_id",values=rownames(ddsColl), mart=ensembl))

ddsCollinfocoding <- subset(ddsCollinfo, ddsCollinfo$transcript_biotype == "protein_coding") ddsCollcoding <- subset(ddsColl, rownames(ddsColl) %in% ddsCollinfocoding$ensembl_gene_id)

#Run DeSeq on just a subset of the biological groups

ddsCollcoding_CBDonly<-ddsCollcoding[, ddsCollcoding$origin == "CBD"] ddsCollcoding_CBDonly$type<-droplevels(ddsCollcoding_CBDonly$type) dds_DGE_CBD <- DESeq(ddsCollcoding_CBDonly) resultsCtCc <- results(dds_DGE_CBD, contrast=c("type", "CBD_t", "CBD_c"), alpha=0.05) summary(resultsCtCc) #Add Gene Symbols to Results output and export library( org.Hs.eg.db ) library(AnnotationDbi) resultsCtCc$symbol <- mapIds(org.Hs.eg.db, keys=row.names(resultsCtCc), column="SYMBOL", keytype="ENSEMBL", multiVals="first")

resultsCtCc.df <- as.data.frame(resultsCtCc)

write.csv(resultsGtGc.df, file="/Users/rimlandca/Desktop/R Files/resultsGtGc.csv")​

deseq2 ensembl • 7.9k views
9
Entering edit mode
@james-w-macdonald-5106
Last seen 16 hours ago
United States

You will often find things like this, because the human genome is a complicated place and simple things like a gene symbol are not sufficient to uniquely identify a gene. Using the example you give, we can explore things.

> select(org.Hs.eg.db, c("ENSG00000197953","ENSG00000261846"), c("SYMBOL","GENENAME","ENTREZID"),"ENSEMBL")
'select()' returned 1:1 mapping between keys and columns
ENSEMBL  SYMBOL                         GENENAME ENTREZID
1 ENSG00000197953 AADACL2 arylacetamide deacetylase like 2   344752
2 ENSG00000261846 AADACL2 arylacetamide deacetylase like 2   344752

So it appears that NCBI thinks this is one single gene, and EBI thinks it's two. This is a consistent issue - different people think different things, and NCBI and EBI are different groups that don't always agree on things. For example, if we use the Homo.sapiens package to see where this gene lives, we get this:

> select(Homo.sapiens, c("ENSG00000197953","ENSG00000261846"), c("CDSCHROM","CDSSTART","CDSEND"), "ENSEMBL")
'select()' returned 1:many mapping between keys and columns
ENSEMBL CDSCHROM  CDSSTART    CDSEND
1  ENSG00000197953     chr3 151451824 151451961
2  ENSG00000197953     chr3 151458434 151458656
3  ENSG00000197953     chr3 151461881 151461950
4  ENSG00000197953     chr3 151463297 151463468
5  ENSG00000197953     chr3 151474780 151475382
6  ENSG00000197953     chr3 151474816 151475382
7  ENSG00000261846     chr3 151451824 151451961
8  ENSG00000261846     chr3 151458434 151458656
9  ENSG00000261846     chr3 151461881 151461950
10 ENSG00000261846     chr3 151463297 151463468
11 ENSG00000261846     chr3 151474780 151475382
12 ENSG00000261846     chr3 151474816 151475382

But if we use an EnsDb package (I just have an old one - you would do well to get an updated version), we get this:

> select(EnsDb.Hsapiens.v79, c("ENSG00000197953","ENSG00000261846"), c("GENEID","SEQNAME","GENESEQSTART","GENESEQEND"), "GENEID")
GENEID             SEQNAME GENESEQSTART GENESEQEND
1 ENSG00000197953                   3    151733916  151761339
2 ENSG00000261846 CHR_HSCHR3_1_CTG2_1    151744454  151770036

So EBI/Ensembl think this gene lives on an unplaced scaffold or haplotype or some such thing as well. In which case they evidently want to give it a different Ensembl ID. But if the sequence is identical, you may get most or all of your reads aligning to both places, in which case you are likely to get very similar fold changes and p-values.

0
Entering edit mode

Hi James,

Thank you for the detailed answer. Do you know what is the best to handle these situations when looking at differential expression between two groups? For instance I would like to state that biological group A has X number of unique genes differentially expressed from biological group B. Then do follow up analyses on a significant gene list using GO. And also make heat maps. However, when I go to make heatmaps, etc the duplicated genes cause issues as they show up more than once if using the gene symbols.

Is it best to delete the duplicated gene symbols? Or is there a way to merge them together? Or is it best to keep them as is?

This was a code I had tried to delete any gene symbols which were duplicated. When I do that, it is taking 20,000 protein coding genes down to 18,000 or so however, which seems like an awful lot of duplicates.

Thanks again!

resultsCtCc.df <- as.data.frame(resultsCtCc) !duplicated(resultsCtCc.df$symbol) index<- which(!duplicated(resultsCtCc.df$symbol)) resultsCtCc_subset <- resultsCtCc.df[index, ]

1
Entering edit mode

Any reasonable GO analysis software will automatically remove any duplicates, so that shouldn't be an issue. As for having duplicate symbols in a heatmap, you could simply remove any of the dups before you generate the heatmap.

I am in general not concerned with duplicate gene symbols because I don't really expect them to be unique. So dups may mean you measured the same gene twice, or it may mean (in the case of AADACL2) that at least one group thinks there are two versions of the gene in the genome that are different enough to require different IDs. That to me isn't something to be ignored, or 'fixed' but instead something to be explored and explained.

In other words, there are multiple groups of people who spend much of their time deciding how many genes there are, where they are located in the genome, etc, etc. And for somewhere around 2000 genes they have decided (for maybe 2000 different reasons) that things with the same gene symbol should have more than one (in this case, Ensembl) ID. Since I haven't spent any time investigating any of those genes, I would be hesitant to simply decide that I could simply 'fix' the problem by removing one of the duplicates. So in my analyses I don't.

0
Entering edit mode

Fair enough, that makes sense. So in your own analyses if you were to report on the total number of genes that are differentially expressed between groups, would you consider these duplicated gene instances as unique genes? Or as one gene?

For instance in my data set Tissue A has 700 genes up-regulated compared to Tissue B, however if I took out duplicates it may be 620 lets say (made up these numbers, may be higher or lower in reality). Would you report the 700 number? Or the 640? Or would you just go and investigate what the significant, duplicated genes were if its a manageable number?

1
Entering edit mode

I would report the 700 genes. But I consider the Ensembl IDs the genes, and not the HUGO symbols. That's probably because I'm a statistician. Seems to me that biologists tend to identify with gene symbols, to their detriment.

If one of my colleagues were to then come to me and ask 'wassup with all these dups', I would explain pretty much what I have already explained to you, and for those genes that were of interest, I would then provide more information to help figure out how different the duplicated symbols really are, and if it is worth it (for my colleague) to consider them separately or not.

0
Entering edit mode

Okay, thank you. That seems reasonable and helps clarify the issue a lot. It also makes me feel better to know that the duplicates are not arising due to something incorrect in my code which was another worry I had!

1
Entering edit mode

hehe, I would totally come to you and say: 'wassup with all these dups'!

Thanks again for taking the time to answer :D