I am analyzing RNA-seq data using edgeR, and I have my read counts matrix annotated with gene symbols and the ENSEMBL id. I am finding that there are multiple gene symbols assigned to different ENSEMBL ids. Some are ncRNAs on different locations of the chromosome. I am wondering how you deal with duplicates prior to CPM filtering, TMM normalization, and designing the matrix - do you sum or average the read counts per duplicated gene across all samples or do you remove all instances of duplicates and keep the gene with the highest read count total? What is the best practice?
Thanks for the clarification. As I mentioned above, I used STAR for the alignment and am not interested in transcript variants. The duplicates are definitely not interfering with interpretation of the DE results, so I am inclined to just leave them be, as you suggested.