Hi there,
I'm using DESeq2 for analysis on salmon-derived counts. It's a bit of a niche question, but advice would be appreciated.
I am trying to compare expression of orthologous genes across multiple closely related species (obviously taking care over the various pitfalls that entails). One issue I have, however, is that one of these species has a reference genome that has been assembled into diploid chromosomes, whereas the others have not. In other words, two separate ORFs are provided for each gene, where there is only one for the other species. I wondered how best to treat the data in order to allow direct comparison.
So far I have either:
1) edited the reference trancriptome to give me only one type of ORF (they are all denoted "_A" or "_B")
or
2) aligned to the full transcriptome and then summed the counts across the two ORF versions
Whilst the counts are pretty similar for the most part, option 1 seems to underestimate counts for a small subset of genes. On the other hand, I feel like option 2 is more likely to flout the expectations of subsequent analysis e.g. how features with very low read counts are handled. After this step I am comparing by using shared gene IDs. If anyone had any advice on which of these two options (or any other approaches) might be better, that would be very helpful.
Kind regards,
Andrew
P.S. it's worth noting that since the other species' transcriptomes are not resolved into two haplotypes, one can assume this scenario is actually pretty similar to my edited transcriptome with only one ORF sequence per gene.
P.P.S. Alternative splicing isn't a consideration for this organism.
Rob has tested this and it works reasonably well.
I think I'd go with (2), just to have higher counts. You may have lower counts for genes in the other species if there were so many SNPs that it trips up the read mapping, but do you have a rough idea how many SNPs are in the genes?
Hi Mike,
Thanks for getting back to me. I haven't looked at SNP frequency, it's my first time working with these genomes. I think there's a reasonable amount of divergence between haplotypes but haven't quantified. Something to look into. My feeling is that the genes that are substantially underestimated in (1) vs (2) number in the dozens out of several thousand, just by eyeballing a plot of the two. As for the reference genomes for the other species, I have no idea about the degree of heterozygosity.
Where possible I am trying to avoid directly comparing counts for individual between species though, and focusing on condition effects and species/condition interactions - hopefully this at least partially avoids issues arising from differing mapping efficiences/biases.
Thanks,
Andrew
Hi Ryan,
Thanks for your response. Yes that's definitely worth a try to compare. I'm assuming the tx2gene functionality in tximport just sums alternative transcripts for the final gene count, I'll check up on this.
Thanks,
Andrew
Yes it does.
Thanks for clarifying!
Andrew