Clarify some basic concepts about DESeq normalization
2
0
Entering edit mode
xavi • 0
@xavi-15151
Last seen 3.4 years ago

Hi there,

I would like to clarify some concepts about DESeq normalization:

1)How does DESeq account for RNA compostion in the "median of ratios" method (if so, in which step of the code/formula)?

2)Kind of stats question, why is the geometric mean more appropiate than arithmetic mean for the
pseudo-sample created to obtain the size factors?

3)How reliable is DESeq on ~10 samples per condition with no biological nor technical
replicates. I've read in a previous post there is no support for this scenario. Sequencing is
expensive, does it means that it doesn't make sense to use DESeq2 for DEG in this case,
what would you recommend?

Thank you so much in advance,
Xavi

deseq2 rnaseq normalization • 390 views
0
Entering edit mode

Thank you for your explanations Michael, this would be the design for instance for 10 samples per condition, healthy/patient; no biological nor technical replicates:

Condition
--------------
S1    HEALTHY
S2    HEALTHY
S3    HEALTHY
S4    HEALTHY
S5    HEALTHY
S6    ...
S10    HEALTHY
S11    PATIENT
S12    PATIENT
S13    PATIENT
S14    PATIENT
S15    PATIENT
S16    ...
S20    PATIENT

As you mentioned in other threads this way DESeq can only be used for exploratory analysis, which would be the degree of confidence using a scale from 0 to 100% if we find something statistically significant??

Many thanks,

Xavi

0
Entering edit mode

You can use DESeq2 here. You would use a design of ~condition. The healthy donors and patients would be considered "biological replicates" for this analysis, although I see that the terminology is strange. If you had paired samples, you could control for individual donor variation, but here I would use ~condition.

0
Entering edit mode
@mikelove
Last seen 1 day ago
United States

The median of ratio normalization corrects for a scaling factor affecting the total amount of sequencing. It does so by looking at the ratio of all genes between a sample and a reference sample. The reason for using the geometric mean for the reference sample is because the arithmetic mean is too influenced by the sample with the largest sequencing depth. The geometric mean is a better "middle value" for count data that has a large range (another way to make this argument would be to say that there are multiplicative effects on counts and so the geometric mean is the arithmetic mean in the log scale). If you have specific genes you would like to use for normalization (spike-in or housekeeping genes, etc.) you can use the controlGenes argument of estimateSizeFactors.

What do you mean by 10 samples per condition with no biological replicates? Can you show your design in more detail?