I'm using DESeq2 for analysis on salmon-derived counts. It's a bit of a niche question, but advice would be appreciated.
I am trying to compare expression of orthologous genes across multiple closely related species (obviously taking care over the various pitfalls that entails). One issue I have, however, is that one of these species has a reference genome that has been assembled into diploid chromosomes, whereas the others have not. In other words, two separate ORFs are provided for each gene, where there is only one for the other species. I wondered how best to treat the data in order to allow direct comparison.
So far I have either:
1) edited the reference trancriptome to give me only one type of ORF (they are all denoted "_A" or "_B")
2) aligned to the full transcriptome and then summed the counts across the two ORF versions
Whilst the counts are pretty similar for the most part, option 1 seems to underestimate counts for a small subset of genes. On the other hand, I feel like option 2 is more likely to flout the expectations of subsequent analysis e.g. how features with very low read counts are handled. After this step I am comparing by using shared gene IDs. If anyone had any advice on which of these two options (or any other approaches) might be better, that would be very helpful.
P.S. it's worth noting that since the other species' transcriptomes are not resolved into two haplotypes, one can assume this scenario is actually pretty similar to my edited transcriptome with only one ORF sequence per gene.
P.P.S. Alternative splicing isn't a consideration for this organism.