Question

ensembl to hgnc symbols creates duplicates

1

Entering edit mode

Matan G. ▴ 60

@matan-g-22483

Last seen 2.9 years ago

Hi all,

My data is a data frame of estimated TPM counts, where rows are the genes and columns are the samples. I'm using "library(biomaRt)" to get ensembl symbols and hgnc symbols. When trying to change the rownames from enemble to hgnc symbols I get an error which stems from duplicates in hgnc symbols the way I understand it.

The error I get:

"Error in .rowNamesDF<-(x, value = value) : duplicate 'row.names' are not allowed In addition: Warning message: non-unique values when setting 'row.names': ‘’, ‘ABCF2’, ‘LINC01238’, ‘POLR2J3’, ‘POLR2J4’, ‘TBCE"

How can I solve this issue? EDITED: using .rowNamesDF(TPM_countdata, make.names=TRUE) I've managed to force row names to be hgnc coded but I don't understand the reason it creates duplicates initially and not unique names of hgnc.

Thanks and all the best

data screenshot https://ibb.co/8M853bm

r biomart genemap TPM • 2.2k views

ADD COMMENT • link updated 4.0 years ago by Kevin Blighe ★ 4.0k • written 4.0 years ago by Matan G. ▴ 60

score 1 · Answer 1 · 2020-08-07

Hey Matan,

It is expected for this to happen when comparing across annotation systems, in this case, Ensembl to HGNC. To understand why, please look at these answers on biostars, one from the Ensembl Outreach Project Leader:

What I usually do is merge the Ensembl and HGNC IDs via an underscore '_', which can be removed when it comes to exporting your final result or generating plots.

Note that, while we define a gene as a static unit, the genome does not behave this way. Transcription is a pervasive process whereby, over millions of years of evolution, certain parts of the genome are transcribed more frequently under certain cellular / environmental conditions, and then translated into proteins. The vast majority of the genome is still transcribed to some level, but can be regarded as background 'transcriptional noise'.

Kevin