Question: DESeq2: Sample outlier patterns 'switching' after normalization
gravatar for wunderl
3 months ago by
wunderl20 wrote:

::Edit:: The embedded images are not formatting well, so I have changed them to external links.

I have been following the DESeq2 vignette for bulk RNA-Seq data and got the to section where it is recommended you make a heatmap of the euclidian distances between samples.

In addition to the recommended heat map using vst-transformed counts, I decided to also make heat maps using the raw and DESeq-normalized counts. What I found, however, did not make sense.

When using the raw counts to generate the sample distances, I found that one sample stood out as an outlier (S1). As shown below, this sample has high distances to the other samples, with distances to two samples in particular being especially high (S12 and S6).

This was expected since I knew that sample showed elevated levels of RNA degradation compared to the others.

Raw Count Heat Map Here

However, when using the normalized counts, the result did not make sense. Instead of sample S1 continuing to look like an outlier, it looked normal. Instead, a new sample, S13, now looks like an outlier with a nearly identical pattern (overall higher distances with 2 samples that are especially high, S6 and S7).

I found this to be concerning as it almost looks like the normalization procedure swapped the sample labels (after some testing, however, I determined this was not the case). This is not what I would have expected from sample normalization, especially with the outlier pattern 'switching' to a different sample.

Normalized Count Heat Map Here

Does anyone have an idea what might be going on? Or is there something I could do to better understand what is happening here?

normalization deseq2 • 160 views
ADD COMMENTlink modified 3 months ago by Michael Love26k • written 3 months ago by wunderl20
Answer: DESeq2: Sample outlier patterns 'switching' after normalization
gravatar for Michael Love
3 months ago by
Michael Love26k
United States
Michael Love26k wrote:


So if you don't use VST or log transformed counts, the major factor driving the clustering will be the genes with the very highest counts. If you don't perform scaling the clustering will be roughly based on the library size. If you do perform scaling, but don't transform, then the clustering will be based on whatever patterns remain in the expression levels of the top 1% of genes (just making up a percent here, but the distances and therefore cluster diagrams will not be representative of the whole experiment, because the variance hasn't been stabilized).

ADD COMMENTlink written 3 months ago by Michael Love26k

Thanks, that makes sense.

I can confirm from my data that the 'outlier' sample from the raw count plot had both the largest library size and highest variance.

What fully convinced me was doing a simple normalization by library size. This ended up producing the same pattern of sample distances that DESeq's median ratio method did. I was actually quite surprised how similar the sample distance heat maps were between the two methods.

Library Size Normalized Heat Map

DESeq Normalized Count Heat Map

Also, when you are referring to the top 1% of genes after normalization driving sample differences, are you referring to top genes by expression or by variance?

ADD REPLYlink written 3 months ago by wunderl20

If you do perform scaling but don't transform the data (so data is heteroskedastic), clustering will be driven by the differences in the top genes by expression level.

ADD REPLYlink written 3 months ago by Michael Love26k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 433 users visited in the last hour