Question

DESeq2 gene length normalisation

0

Entering edit mode

cm15245 • 0

@66ac6c52

Last seen 6 months ago

United Kingdom

Dear Prof. Love,

I was wondering if you could help me with a query about the DESeq2 package in R please.

I have sequenced RNA transcripts from six separate species and am comparing gene expression between pairs of species. The issue I am facing is that I cannot work out how to account for gene length when normalising the read count data, but I understand this is essential to do so when comparing two different species.

My homemade solution was to calculate the transcripts per kilobase million (TPM) values of each orthogroup (OG), then feed these data into DESeq2 as if they were raw read counts. To do so, I calculated the reads per kilobase (RPK) using:

read count / gene length

Then the TPM using:

(RPK / sum of all RPK of sample) * 10^6

Then, because DESeq2 is expecting raw counts, and therefore integers, I multiplied all TPM values by 100 and rounded to the nearest integer. I then built the DESeq object using these data, which I understand has built-in normalisation based on library size. Therefore, as far as I can tell, I have normalised the count data by gene length and library size, and should be OK to proceed with DEG analysis. So my question is, does that all sound OK, or am I violating some internal assumptions/mechanisms of the DESeq2 package that will punish me later down the line?

Thank you very much for taking the time to read my query, any advice you can offer will be greatly appreciated!

Kind regards, Callum

Normalization DESeq2 • 1.0k views

ADD COMMENT • link updated 6 months ago by Michael Love 41k • written 6 months ago by cm15245 • 0

score 0 · Answer 1 · 2023-10-11

0

Entering edit mode

James W. MacDonald 65k

@james-w-macdonald-5106

Last seen 23 minutes ago

United States

If you are aligning using kallisto or salmon, you can read in the transcript counts using tximport and use the lengthScaledTPM argument to adjust for the gene lengths. Alternatively, you could use the cqn package to normalize for GC content as well as gene length, convert to RPKM and then use limma (probably limma-trend).

ADD COMMENT • link 6 months ago James W. MacDonald 65k

0

Entering edit mode

how to account for gene length when normalising the read count data, but I understand this is essential to do so when comparing two different species

I think it's not the DTU issue here (which is covered by any if the tximport methods -- counts+offset, scaled TPM, etc.) but something harder to deal with. When you compare different species there are all kinds of reasons the counts are different, from biology, to differential mis-mapping because of different quality of reference, to splicing/gene length changes, and many more you can dream up.

Also tximport won't allow you to import data across two species because the index is different, it will complain (as it should).

I think we need more information to give advice on this one.

ADD REPLY • link 6 months ago Michael Love 41k

0

Entering edit mode

Thank you for your and James' responses.

The six species are all Heliconiini butterflies, and each of the three pairs that I am comparing are close relatives. All of the brain tissue samples (5 samples per species) were collected at the same time, from the same place, and the same structures (brain and suboesophageal ganglion) were used in the RNA extraction. I also used corresponding genomes for each species for mapping, all from the same source (a colleague in my lab) and of similar quality.

I was hoping that the relatively short phylogenetic distance between each of my species pairs meant that accounting for gene length (and library size) would be sufficient to perform accurate DEG analyses. Does this sound like the case to you? Please let me know if you require any further information. Many thanks.

ADD REPLY • link 6 months ago cm15245 • 0

0

Entering edit mode

If you are trying to compare gene expression directly between species, then as Mike already noted, there are any number of not really biological reasons that you might get different counts/gene (even if the amount of transcript were +/- the same!), so that comparison will be fraught. I can't imagine trying to sell that idea to a skeptical reviewer.

The cqn package is meant to normalize out GC-bias and gene length bias, so you might be able to sell the idea that aligning to separate genomes and then computing RPKM using cqn 'fixed' the issue, and that might get some traction. You might also align all the samples to the same genome if they are that closely related (I've used closely related genomes to align non-model species for which there isn't a credible genome, and it worked OK), and then use cqn to further remove technical biases.

I wouldn't use DESeq2 regardless, instead I would use limma-trend. And if you are just trying to generate hypotheses that you will test more rigorously later, then you could probably just use the RPKM you already have with limma-trend.

ADD REPLY • link 6 months ago James W. MacDonald 65k

0

Entering edit mode

Thank you for your help. I'll have a look into the cqn and limma packages then and go from there.

With regards to your suggestion of aligning samples to a common genome, do you mean a seventh, separate genome (e.g. whichever butterfly species has the most complete annotated genome), or pick one genome of my current six species and align all other five to this?

ADD REPLY • link 6 months ago cm15245 • 0

0

Entering edit mode

I meant the latter, but you know more about the available genomes and the former might be something to do as well.

ADD REPLY • link 6 months ago James W. MacDonald 65k

1

Entering edit mode

I have a similar study underway (comparing gene expression between human/mouse/acomys), and our collaborator just pointed out this paper, which might be useful.

ADD REPLY • link 6 months ago James W. MacDonald 65k

0

Entering edit mode

OK will check it out. Thanks for all your help James!

ADD REPLY • link 6 months ago cm15245 • 0

0

Entering edit mode

I also used corresponding genomes for each species for mapping, all from the same source (a colleague in my lab) and of similar quality.

I've found that, no matter how well we try, species-specific mapping issues persist. This will be solved eventually with long reads I suppose (each species can have its own T2T custom genome from reads spanning problematic regions), but for now I wouldn't assume that the mis-mapping rate for a given gene due to genome quality is equal across the species. So you'll never know what is DE and what is differential mapping.

ADD REPLY • link 6 months ago Michael Love 41k

score 0 · Answer 2 · 2023-10-12

0

Entering edit mode

Michael Love 41k

@mikelove

Last seen 5 hours ago

United States

If you had fixed gene lengths, just wanted to provide a pointer to this recent answer:

Proper way to implement DESeq2 when comparing samples with trisomies?

ADD COMMENT • link 6 months ago Michael Love 41k