Question

DESeq2: Sample outlier patterns 'switching' after normalization

0

Entering edit mode

wunderl ▴ 40

@wunderl-20805

Last seen 5.3 years ago

::Edit:: The embedded images are not formatting well, so I have changed them to external links.

I have been following the DESeq2 vignette for bulk RNA-Seq data and got the to section where it is recommended you make a heatmap of the euclidian distances between samples.

In addition to the recommended heat map using vst-transformed counts, I decided to also make heat maps using the raw and DESeq-normalized counts. What I found, however, did not make sense.

When using the raw counts to generate the sample distances, I found that one sample stood out as an outlier (S1). As shown below, this sample has high distances to the other samples, with distances to two samples in particular being especially high (S12 and S6).

This was expected since I knew that sample showed elevated levels of RNA degradation compared to the others.

Raw Count Heat Map Here

However, when using the normalized counts, the result did not make sense. Instead of sample S1 continuing to look like an outlier, it looked normal. Instead, a new sample, S13, now looks like an outlier with a nearly identical pattern (overall higher distances with 2 samples that are especially high, S6 and S7).

I found this to be concerning as it almost looks like the normalization procedure swapped the sample labels (after some testing, however, I determined this was not the case). This is not what I would have expected from sample normalization, especially with the outlier pattern 'switching' to a different sample.

Normalized Count Heat Map Here

Does anyone have an idea what might be going on? Or is there something I could do to better understand what is happening here?

deseq2 normalization • 1.1k views

ADD COMMENT • link updated 5.4 years ago by Michael Love 43k • written 5.4 years ago by wunderl ▴ 40

score 3 · Accepted Answer · 2019-07-26

3

Entering edit mode

Michael Love 43k

@mikelove

Last seen 1 day ago

United States

hi,

So if you don't use VST or log transformed counts, the major factor driving the clustering will be the genes with the very highest counts. If you don't perform scaling the clustering will be roughly based on the library size. If you do perform scaling, but don't transform, then the clustering will be based on whatever patterns remain in the expression levels of the top 1% of genes (just making up a percent here, but the distances and therefore cluster diagrams will not be representative of the whole experiment, because the variance hasn't been stabilized).

ADD COMMENT • link 5.4 years ago Michael Love 43k

0

Entering edit mode

Thanks, that makes sense.

I can confirm from my data that the 'outlier' sample from the raw count plot had both the largest library size and highest variance.

What fully convinced me was doing a simple normalization by library size. This ended up producing the same pattern of sample distances that DESeq's median ratio method did. I was actually quite surprised how similar the sample distance heat maps were between the two methods.

Library Size Normalized Heat Map

DESeq Normalized Count Heat Map

Also, when you are referring to the top 1% of genes after normalization driving sample differences, are you referring to top genes by expression or by variance?

ADD REPLY • link 5.3 years ago wunderl ▴ 40

0

Entering edit mode

If you do perform scaling but don't transform the data (so data is heteroskedastic), clustering will be driven by the differences in the top genes by expression level.

ADD REPLY • link 5.3 years ago Michael Love 43k