Question

DESeq2: which normalized data matrix should I take?

0

Entering edit mode

Xiaokuan Wei ▴ 230

@xiaokuan-wei-4052

Last seen 9.0 years ago

United States

Hi,

I want to extract the normalized the data matrix (reads matrix) to do differential gene expression analysis by myself instead of using wald test the package provided as I don't have replicates in each comparison group. so, there is no statistical calculation but only fold change between two samples. I am going to use vsn normalized matrix to do this work. My question is what is the advantages and drawbacks of using normalized data matrix instead of using the raw counts? What datamatrix (count) should be appropriate for this type analysis.

Thank you

-W

deseq2 • 4.4k views

ADD COMMENT • link 10.3 years ago Xiaokuan Wei ▴ 230

0

Entering edit mode

Xiaokuan Wei ▴ 230

@xiaokuan-wei-4052

Last seen 9.0 years ago

United States

Ryan, Michael:

Thank you for your informative answers to my question. I just have another question regarding this process.

As to my understanding, the fold changes obtained from DESeq() is using raw counts instead of normalized one (rlog or vsn).

So, if I extracted normalized matrix then do the fold changes calculation between two samples, the results will be slightly different from the fold changes obtained by DESeq(). Is this right?

Thank you.

-W

ADD COMMENT • link 10.3 years ago Xiaokuan Wei ▴ 230

1

Entering edit mode

The fold changes calculated by DESeq(), either the moderated (default) or unmoderated fold changes (using addMLE or betaPrior=FALSE), will not be the same as the ones obtained from rlog or VST data. The moderated fold changes are calculated as described in the paper. The unmoderated fold changes in a simple group comparison are equal to (if you allow some pseudo latex):

mean_{j in group B}(K_ij / s_j) / mean_{j in group A}(K_ij / s_j)

ADD REPLY • link 10.3 years ago Michael Love 43k

0

Entering edit mode

Got it, I think so. Thank you Michael. -W

ADD REPLY • link 10.3 years ago Xiaokuan Wei ▴ 230

score 3 · Accepted Answer · 2015-03-07

If you're just going to be doing a descriptive analysis using fold changes, then you probably want to just do variance stabilization using the regularized log transformation in DESeq2, which will give you normalized and variance-stabilized counts-per-million. See the help page for the rlog function. Since you have no replicates, you'll need to use blind=TRUE. You can also use varianceStabilizingTransformation for similar purposes, but it is more sensitive to differences in sequencing depth and sample complexity.

The primary effect of variance stabilization in RNA-seq data is to reduce the magnitude of fold changes for low-count genes. This counteracts the tendency of low-count genes to have very large fold changes (and high variance) since small random variations in low-count genes are larger relative to the counts themselves. You can see an example of how the regularization affects the data here: http://www.sthda.com/english/wiki/rna-seq-differential-expression-work-flow-using-deseq2#the-rlog-transform

score 2 · Accepted Answer · 2015-03-08

In addition to Ryan's suggestions, note that you can just run DESeq() to calculate fold changes. It will detect there are no replicates, and automatically perform "blind" dispersion estimation by treating the different samples as replicates (it will print a warning that this was done). At the results stage, you can use the moderated fold changes (log2FoldChange column) or if you set addMLE=TRUE to results(), you can compare the moderated fold changes to the MLE (maximum likelihood estimate / unmoderated) fold changes.