DESeq2: removal of duplicated genes during statistical analysis
1
0
Entering edit mode
thkapell ▴ 10
@tkapell-14647
Last seen 15 months ago
Helmholtz Center Munich, Germany

Hi all,

in my latest analysis with the DESeq2 package, I noticed that I had a few paralog genes (~30) which shared the same statistics and ensembl ID. I can use only the unique genes, but I was wondering whether it would be more appropriate to remove the duplicated genes before the statistical analysis since they contain redundant information which worsens my statistics. Would you recommend doing this or would you still include them in the analysis and maybe discard them later in downstream visualization? And if so, would you remove them before running results() or DESeq()? Hope this makes sense.

 

deseq2 gene symbol ensembl • 1.9k views
ADD COMMENT
0
Entering edit mode
@mikelove
Last seen just now
United States

What is your quantification setup? Why do you end up with multiple rows with the same ID?

ADD COMMENT
0
Entering edit mode

I did a gene differential expression analysis using transcriptome levels therefore my count table has transcript version IDs (e.g. ENSG00000000003.14). When I convert those to ensembl IDs (e.g. ENSG00000000003), there are about 30 genes which share the same ensembl ID because they are paralogs in the Y chromosome (their transcript version ID ende in _PAR_Y, but have the same ensembl ID).

ADD REPLY
0
Entering edit mode

I haven't thought about what to do with these, but generally if the sequence is near identical, I would collapse the redundant transcripts by adding their counts together. Salmon does this by default for identical transcripts (where otherwise the counts would be split equally among the identical sequences).

ADD REPLY

Login before adding your answer.

Traffic: 863 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6