Question

Clarify some basic concepts about DESeq normalization

0

Entering edit mode

xavi • 0

@xavi-15151

Last seen 6.0 years ago

Hi there,

I would like to clarify some concepts about DESeq normalization:

1)How does DESeq account for RNA compostion in the "median of ratios" method (if so, in which step of the code/formula)?

2)Kind of stats question, why is the geometric mean more appropiate than arithmetic mean for the
pseudo-sample created to obtain the size factors?

3)How reliable is DESeq on ~10 samples per condition with no biological nor technical
replicates. I've read in a previous post there is no support for this scenario. Sequencing is
expensive, does it means that it doesn't make sense to use DESeq2 for DEG in this case,
what would you recommend?

Thank you so much in advance,
Xavi

deseq2 rnaseq normalization • 906 views

ADD COMMENT • link written 6.2 years ago by xavi • 0

0

Entering edit mode

Thank you for your explanations Michael, this would be the design for instance for 10 samples per condition, healthy/patient; no biological nor technical replicates:

        Condition
        --------------
S1   HEALTHY
S2   HEALTHY
S3   HEALTHY
S4   HEALTHY
S5   HEALTHY
S6   ...
S10   HEALTHY
S11   PATIENT
S12   PATIENT
S13   PATIENT
S14   PATIENT
S15   PATIENT
S16   ...
S20   PATIENT

As you mentioned in other threads this way DESeq can only be used for exploratory analysis, which would be the degree of confidence using a scale from 0 to 100% if we find something statistically significant??

Many thanks,

Xavi

ADD REPLY • link 6.1 years ago xavi • 0

0

Entering edit mode

You can use DESeq2 here. You would use a design of ~condition. The healthy donors and patients would be considered "biological replicates" for this analysis, although I see that the terminology is strange. If you had paired samples, you could control for individual donor variation, but here I would use ~condition.

ADD REPLY • link 6.1 years ago Michael Love 41k

score 0 · Answer 1 · 2018-03-05

The median of ratio normalization corrects for a scaling factor affecting the total amount of sequencing. It does so by looking at the ratio of all genes between a sample and a reference sample. The reason for using the geometric mean for the reference sample is because the arithmetic mean is too influenced by the sample with the largest sequencing depth. The geometric mean is a better "middle value" for count data that has a large range (another way to make this argument would be to say that there are multiplicative effects on counts and so the geometric mean is the arithmetic mean in the log scale). If you have specific genes you would like to use for normalization (spike-in or housekeeping genes, etc.) you can use the controlGenes argument of estimateSizeFactors.

What do you mean by 10 samples per condition with no biological replicates? Can you show your design in more detail?