Filtering read counts matrix: how to deal with duplicated gene symbols, different ENSEMBL ids
Entering edit mode
es874 ▴ 20
Last seen 6.1 years ago

I am analyzing RNA-seq data using edgeR, and I have my read counts matrix annotated with gene symbols and the ENSEMBL id. I am finding that there are multiple gene symbols assigned to different ENSEMBL ids. Some are ncRNAs on different locations of the chromosome. I am wondering how you deal with duplicates prior to CPM filtering, TMM normalization, and designing the matrix - do you sum or average the read counts per duplicated gene across all samples or do you remove all instances of duplicates and keep the gene with the highest read count total? What is the best practice?

edger • 4.5k views
Entering edit mode
Aaron Lun ★ 28k
Last seen 12 hours ago
The city by the bay

As far as I'm concerned - and others may disagree - the ENSEMBL IDs are the reference identifiers. The gene symbols are nice but are only fit for human consumption. If two genes have different ENSEMBL (or Entrez) IDs, then for the purpose of a DE analysis, they are different genes. This generally helps with interpretation, because then you know exactly which gene locus was differentially expressed. Otherwise, if you pool them together and the gene ends up being DE, you'll then have to try to figure out which locus should be targeted for further study; this is especially critical for lncRNAs, where the genomic context is important to the function of the transcript.

The flipside is that you might gain some power if you pool together counts from multiple locations. However, I'd imagine that there wouldn't be many counts for these genes in the first place; if they were truly duplicate sequences, then nothing should have aligned uniquely to them. And, of course, if reads did align uniquely, then there's clearly some differences between the loci, so that's an argument for treating them separately.

In summary, unless the duplicated gene symbols are seriously interfering with the interpretation of your DE results (e.g., the top 100 genes are all duplicates of each other), I'd be inclined to let them be.

Entering edit mode

Thanks for the clarification. As I mentioned above, I used STAR for the alignment and am not interested in transcript variants. The duplicates are definitely not interfering with interpretation of the DE results, so I am inclined to just leave them be, as you suggested.

Entering edit mode
Last seen 3 hours ago
United States

This sort of depends on how you generated your count matrix, and why exactly you have multiple Ensembl IDs. If it's just a case of ncRNAs that are found in different positions (and have +/- the same transcript length), then you could probably use rowsum to just collapse the counts to a unique set of gene symbols.

If you have generated read counts using a transcript aware aligner, and have multiple Ensembl IDs because these are Ensembl transcript IDs, then you can't really assume the lengths are all the same, so a simple sum might not be the way to go. IF that is the case, you should consider using the tximport package to do the between-transcript summing. See the vignette for more information.

Entering edit mode

Thanks, James. I used STAR for the alignment and am doing a basic DE analysis. Some are ncRNAs, but there are a few others that are protein coding genes. For example, EMG1 with ENSEMBL ids ENSG00000268439 and ENSG00000126749.


Login before adding your answer.

Traffic: 464 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6