Dear Prof. Love,
I was wondering if you could help me with a query about the DESeq2 package in R please.
I have sequenced RNA transcripts from six separate species and am comparing gene expression between pairs of species. The issue I am facing is that I cannot work out how to account for gene length when normalising the read count data, but I understand this is essential to do so when comparing two different species.
My homemade solution was to calculate the transcripts per kilobase million (TPM) values of each orthogroup (OG), then feed these data into DESeq2 as if they were raw read counts. To do so, I calculated the reads per kilobase (RPK) using:
read count / gene length
Then the TPM using:
(RPK / sum of all RPK of sample) * 10^6
Then, because DESeq2 is expecting raw counts, and therefore integers, I multiplied all TPM values by 100 and rounded to the nearest integer. I then built the DESeq object using these data, which I understand has built-in normalisation based on library size. Therefore, as far as I can tell, I have normalised the count data by gene length and library size, and should be OK to proceed with DEG analysis. So my question is, does that all sound OK, or am I violating some internal assumptions/mechanisms of the DESeq2 package that will punish me later down the line?
Thank you very much for taking the time to read my query, any advice you can offer will be greatly appreciated!
Kind regards, Callum
I think it's not the DTU issue here (which is covered by any if the tximport methods -- counts+offset, scaled TPM, etc.) but something harder to deal with. When you compare different species there are all kinds of reasons the counts are different, from biology, to differential mis-mapping because of different quality of reference, to splicing/gene length changes, and many more you can dream up.
Also tximport won't allow you to import data across two species because the index is different, it will complain (as it should).
I think we need more information to give advice on this one.
Thank you for your and James' responses.
The six species are all Heliconiini butterflies, and each of the three pairs that I am comparing are close relatives. All of the brain tissue samples (5 samples per species) were collected at the same time, from the same place, and the same structures (brain and suboesophageal ganglion) were used in the RNA extraction. I also used corresponding genomes for each species for mapping, all from the same source (a colleague in my lab) and of similar quality.
I was hoping that the relatively short phylogenetic distance between each of my species pairs meant that accounting for gene length (and library size) would be sufficient to perform accurate DEG analyses. Does this sound like the case to you? Please let me know if you require any further information. Many thanks.
If you are trying to compare gene expression directly between species, then as Mike already noted, there are any number of not really biological reasons that you might get different counts/gene (even if the amount of transcript were +/- the same!), so that comparison will be fraught. I can't imagine trying to sell that idea to a skeptical reviewer.
The
cqn
package is meant to normalize out GC-bias and gene length bias, so you might be able to sell the idea that aligning to separate genomes and then computing RPKM usingcqn
'fixed' the issue, and that might get some traction. You might also align all the samples to the same genome if they are that closely related (I've used closely related genomes to align non-model species for which there isn't a credible genome, and it worked OK), and then usecqn
to further remove technical biases.I wouldn't use
DESeq2
regardless, instead I would use limma-trend. And if you are just trying to generate hypotheses that you will test more rigorously later, then you could probably just use the RPKM you already have with limma-trend.Thank you for your help. I'll have a look into the cqn and limma packages then and go from there.
With regards to your suggestion of aligning samples to a common genome, do you mean a seventh, separate genome (e.g. whichever butterfly species has the most complete annotated genome), or pick one genome of my current six species and align all other five to this?
I meant the latter, but you know more about the available genomes and the former might be something to do as well.
I have a similar study underway (comparing gene expression between human/mouse/acomys), and our collaborator just pointed out this paper, which might be useful.
OK will check it out. Thanks for all your help James!
I've found that, no matter how well we try, species-specific mapping issues persist. This will be solved eventually with long reads I suppose (each species can have its own T2T custom genome from reads spanning problematic regions), but for now I wouldn't assume that the mis-mapping rate for a given gene due to genome quality is equal across the species. So you'll never know what is DE and what is differential mapping.