Question

DESeq2 biais when genes are missing from the annotation?

0

Entering edit mode

corend • 0

@corend-14293

Last seen 7.1 years ago

As this concerns bioinformatics in general, I also posted here.

I am working on RNAseq data,

I made my count table using kallisto and then tximport to work with DESeq2.

My genes are a set of cDNAs, (supposed to be corresponding to all the genes of my species), but the annotation is quite bad, when I align on these cDNAs I get 60% of mapping, instead of 95% on total genome.

I have 2 conditions: (A and B) and 3 replicates in each condition.

My fear is: If a gene is over-expressed in A, not expressed in B, and not in my cDNA list, I expect to have less reads in A than is B and when the normalization by DESeq2 occurs, it could create a bias ?

Example:

A: 1 1 1 1 2 2 2 2 3 3

B: 1 1 1 1 2 3 3 3 3 3

3 is not annotated, then after normalization by DESeq2:

A: 1 1 1 1 1 2 2 2 2 2

B: 1 1 1 1 1 1 1 1 2 2

1 over-expressed in B, but it is not true.

How can I deal with this kind of problem?

Should I add a line in my table with "unmapped reads" to have a better normalization?

rnaseq deseq2 • 1.6k views

ADD COMMENT • link updated 7.0 years ago by Michael Love 43k • written 7.0 years ago by corend • 0

0

Entering edit mode

Do you expect or observe that the proportion of unmapped reads is different across groups or samples?

ADD REPLY • link 7.0 years ago Sean Davis 21k

0

Entering edit mode

Yes indeed, I map 65% of my reads in condition B and 55% in condition A.

ADD REPLY • link 7.0 years ago corend • 0

0

Entering edit mode

And how about at the genomic level?

ADD REPLY • link 7.0 years ago Sean Davis 21k

0

Entering edit mode

90% condition B

93% condition A

ADD REPLY • link 7.0 years ago corend • 0

score 3 · Accepted Answer · 2017-11-14

If I understand your question correctly, you are assuming that DESeq2 uses total count normalization, but it does not. DESeq2 (and all other methods in Bioconductor I can think of) use a robust method to estimate the scaling factors for each sample. You can read about the scaling method ("median ratio" normalization) in the DESeq2 paper.