Handling diploid vs haploid reference genomes in salmon > DESeq2 analysis
1
0
Entering edit mode
@andrewpountain-15816
Last seen 6.6 years ago

Hi there,

I'm using DESeq2 for analysis on salmon-derived counts. It's a bit of a niche question, but advice would be appreciated.

I am trying to compare expression of orthologous genes across multiple closely related species (obviously taking care over the various pitfalls that entails). One issue I have, however, is that one of these species has a reference genome that has been assembled into diploid chromosomes, whereas the others have not. In other words, two separate ORFs are provided for each gene, where there is only one for the other species. I wondered how best to treat the data in order to allow direct comparison.

So far I have either:

1) edited the reference trancriptome to give me only one type of ORF (they are all denoted "_A" or "_B")

or

2) aligned to the full transcriptome and then summed the counts across the two ORF versions

Whilst the counts are pretty similar for the most part, option 1 seems to underestimate counts for a small subset of genes. On the other hand, I feel like option 2 is more likely to flout the expectations of subsequent analysis e.g. how features with very low read counts are handled. After this step I am comparing by using shared gene IDs. If anyone had any advice on which of these two options (or any other approaches) might be better, that would be very helpful.

Kind regards,

Andrew

P.S. it's worth noting that since the other species' transcriptomes are not resolved into two haplotypes, one can assume this scenario is actually pretty similar to my edited transcriptome with only one ORF sequence per gene.

P.P.S. Alternative splicing isn't a consideration for this organism.

deseq2 salmon • 1.3k views
ADD COMMENT
0
Entering edit mode
@ryan-c-thompson-5618
Last seen 9 weeks ago
Icahn School of Medicine at Mount Sinaiā€¦

You could probably treat the two haplotypes of each gene as if they were alternative splice forms of a single gene, and Salmon should give you a sane answer for the gene counts and effective lengths.

ADD COMMENT
0
Entering edit mode

Rob has tested this and it works reasonably well.

I think I'd go with (2), just to have higher counts. You may have lower counts for genes in the other species if there were so many SNPs that it trips up the read mapping, but do you have a rough idea how many SNPs are in the genes?

ADD REPLY
0
Entering edit mode

Hi Mike,

Thanks for getting back to me. I haven't looked at SNP frequency, it's my first time working with these genomes. I think there's a reasonable amount of divergence between haplotypes but haven't quantified. Something to look into. My feeling is that the genes that are substantially underestimated in (1) vs (2) number in the dozens out of several thousand, just by eyeballing a plot of the two. As for the reference genomes for the other species, I have no idea about the degree of heterozygosity.

Where possible I am trying to avoid directly comparing counts for individual between species though, and focusing on condition effects and species/condition interactions - hopefully this at least partially avoids issues arising from differing mapping efficiences/biases.

Thanks,

Andrew

ADD REPLY
0
Entering edit mode

Hi Ryan,

Thanks for your response. Yes that's definitely worth a try to compare. I'm assuming the tx2gene functionality in tximport just sums alternative transcripts for the final gene count, I'll check up on this.

Thanks,

Andrew

ADD REPLY
0
Entering edit mode

Yes it does.

ADD REPLY
0
Entering edit mode

Thanks for clarifying!

Andrew

ADD REPLY

Login before adding your answer.

Traffic: 977 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6