Question: Filtering read counts matrix: how to deal with duplicated gene symbols, different ENSEMBL ids
0
gravatar for es874
2.8 years ago by
es87410
es87410 wrote:

I am analyzing RNA-seq data using edgeR, and I have my read counts matrix annotated with gene symbols and the ENSEMBL id. I am finding that there are multiple gene symbols assigned to different ENSEMBL ids. Some are ncRNAs on different locations of the chromosome. I am wondering how you deal with duplicates prior to CPM filtering, TMM normalization, and designing the matrix - do you sum or average the read counts per duplicated gene across all samples or do you remove all instances of duplicates and keep the gene with the highest read count total? What is the best practice?

edger • 1.2k views
ADD COMMENTlink modified 2.8 years ago by Aaron Lun24k • written 2.8 years ago by es87410
Answer: Filtering read counts matrix: how to deal with duplicated gene symbols, differen
3
gravatar for Aaron Lun
2.8 years ago by
Aaron Lun24k
Cambridge, United Kingdom
Aaron Lun24k wrote:

As far as I'm concerned - and others may disagree - the ENSEMBL IDs are the reference identifiers. The gene symbols are nice but are only fit for human consumption. If two genes have different ENSEMBL (or Entrez) IDs, then for the purpose of a DE analysis, they are different genes. This generally helps with interpretation, because then you know exactly which gene locus was differentially expressed. Otherwise, if you pool them together and the gene ends up being DE, you'll then have to try to figure out which locus should be targeted for further study; this is especially critical for lncRNAs, where the genomic context is important to the function of the transcript.

The flipside is that you might gain some power if you pool together counts from multiple locations. However, I'd imagine that there wouldn't be many counts for these genes in the first place; if they were truly duplicate sequences, then nothing should have aligned uniquely to them. And, of course, if reads did align uniquely, then there's clearly some differences between the loci, so that's an argument for treating them separately.

In summary, unless the duplicated gene symbols are seriously interfering with the interpretation of your DE results (e.g., the top 100 genes are all duplicates of each other), I'd be inclined to let them be.

ADD COMMENTlink modified 2.8 years ago • written 2.8 years ago by Aaron Lun24k

Thanks for the clarification. As I mentioned above, I used STAR for the alignment and am not interested in transcript variants. The duplicates are definitely not interfering with interpretation of the DE results, so I am inclined to just leave them be, as you suggested.

ADD REPLYlink written 2.8 years ago by es87410
Answer: Filtering read counts matrix: how to deal with duplicated gene symbols, differen
0
gravatar for James W. MacDonald
2.8 years ago by
United States
James W. MacDonald51k wrote:

This sort of depends on how you generated your count matrix, and why exactly you have multiple Ensembl IDs. If it's just a case of ncRNAs that are found in different positions (and have +/- the same transcript length), then you could probably use rowsum to just collapse the counts to a unique set of gene symbols.

If you have generated read counts using a transcript aware aligner, and have multiple Ensembl IDs because these are Ensembl transcript IDs, then you can't really assume the lengths are all the same, so a simple sum might not be the way to go. IF that is the case, you should consider using the tximport package to do the between-transcript summing. See the vignette for more information.

ADD COMMENTlink written 2.8 years ago by James W. MacDonald51k

Thanks, James. I used STAR for the alignment and am doing a basic DE analysis. Some are ncRNAs, but there are a few others that are protein coding genes. For example, EMG1 with ENSEMBL ids ENSG00000268439 and ENSG00000126749.

ADD REPLYlink written 2.8 years ago by es87410
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 131 users visited in the last hour