Question: DESeq2 biais when genes are missing from the annotation?
8 months ago by
corend0
corend0 wrote:

As this concerns bioinformatics in general, I also posted here.

I am working on RNAseq data,

I made my count table using kallisto and then tximport to work with DESeq2.

My genes are a set of cDNAs, (supposed to be corresponding to all the genes of my species), but the annotation is quite bad, when I align on these cDNAs I get 60% of mapping, instead of 95% on total genome.

I have 2 conditions: (A and B) and 3 replicates in each condition.

My fear is: If a gene is over-expressed in A, not expressed in B, and not in my cDNA list, I expect to have less reads in A than is B and when the normalization by DESeq2 occurs, it could create a bias ?

Example:

A: 1 1 1 1 2 2 2 2 3 3

B: 1 1 1 1 2 3 3 3 3 3

3 is not annotated, then after normalization by DESeq2:

A: 1 1 1 1 1 2 2 2 2 2

B: 1 1 1 1 1 1 1 1 2 2

1 over-expressed in B, but it is not true.

How can I deal with this kind of problem?

Should I add a line in my table with "unmapped reads" to have a better normalization?

modified 8 months ago by Michael Love18k • written 8 months ago by corend0

Do you expect or observe that the proportion of unmapped reads is different across groups or samples?

Yes indeed, I map 65% of my reads in condition B and 55% in condition A.

And how about at the genomic level?

90% condition B

93% condition A

8 months ago by
Michael Love18k
United States
Michael Love18k wrote:

If I understand your question correctly, you are assuming that DESeq2 uses total count normalization, but it does not. DESeq2 (and all other methods in Bioconductor I can think of) use a robust method to estimate the scaling factors for each sample. You can read about the scaling method ("median ratio" normalization) in the DESeq2 paper.