Question

Filtering read counts matrix: how to deal with duplicated gene symbols, different ENSEMBL ids

1

Entering edit mode

es874 ▴ 20

@es874-11802

Last seen 7.4 years ago

I am analyzing RNA-seq data using edgeR, and I have my read counts matrix annotated with gene symbols and the ENSEMBL id. I am finding that there are multiple gene symbols assigned to different ENSEMBL ids. Some are ncRNAs on different locations of the chromosome. I am wondering how you deal with duplicates prior to CPM filtering, TMM normalization, and designing the matrix - do you sum or average the read counts per duplicated gene across all samples or do you remove all instances of duplicates and keep the gene with the highest read count total? What is the best practice?

edger • 8.9k views

ADD COMMENT • link updated 7.4 years ago by Aaron Lun ★ 28k • written 7.4 years ago by es874 ▴ 20

score 3 · Answer 1 · 2016-12-09

As far as I'm concerned - and others may disagree - the ENSEMBL IDs are the reference identifiers. The gene symbols are nice but are only fit for human consumption. If two genes have different ENSEMBL (or Entrez) IDs, then for the purpose of a DE analysis, they are different genes. This generally helps with interpretation, because then you know exactly which gene locus was differentially expressed. Otherwise, if you pool them together and the gene ends up being DE, you'll then have to try to figure out which locus should be targeted for further study; this is especially critical for lncRNAs, where the genomic context is important to the function of the transcript.

The flipside is that you might gain some power if you pool together counts from multiple locations. However, I'd imagine that there wouldn't be many counts for these genes in the first place; if they were truly duplicate sequences, then nothing should have aligned uniquely to them. And, of course, if reads did align uniquely, then there's clearly some differences between the loci, so that's an argument for treating them separately.

In summary, unless the duplicated gene symbols are seriously interfering with the interpretation of your DE results (e.g., the top 100 genes are all duplicates of each other), I'd be inclined to let them be.

score 0 · Answer 2 · 2016-12-09

0

Entering edit mode

James W. MacDonald 65k

@james-w-macdonald-5106

Last seen 1 hour ago

United States

This sort of depends on how you generated your count matrix, and why exactly you have multiple Ensembl IDs. If it's just a case of ncRNAs that are found in different positions (and have +/- the same transcript length), then you could probably use rowsum to just collapse the counts to a unique set of gene symbols.

If you have generated read counts using a transcript aware aligner, and have multiple Ensembl IDs because these are Ensembl transcript IDs, then you can't really assume the lengths are all the same, so a simple sum might not be the way to go. IF that is the case, you should consider using the tximport package to do the between-transcript summing. See the vignette for more information.

ADD COMMENT • link 7.4 years ago James W. MacDonald 65k

0

Entering edit mode

Thanks, James. I used STAR for the alignment and am doing a basic DE analysis. Some are ncRNAs, but there are a few others that are protein coding genes. For example, EMG1 with ENSEMBL ids ENSG00000268439 and ENSG00000126749.

ADD REPLY • link 7.4 years ago es874 ▴ 20