::Edit:: The embedded images are not formatting well, so I have changed them to external links.
I have been following the DESeq2 vignette for bulk RNA-Seq data and got the to section where it is recommended you make a heatmap of the euclidian distances between samples.
In addition to the recommended heat map using vst-transformed counts, I decided to also make heat maps using the raw and DESeq-normalized counts. What I found, however, did not make sense.
When using the raw counts to generate the sample distances, I found that one sample stood out as an outlier (S1). As shown below, this sample has high distances to the other samples, with distances to two samples in particular being especially high (S12 and S6).
This was expected since I knew that sample showed elevated levels of RNA degradation compared to the others.
However, when using the normalized counts, the result did not make sense. Instead of sample S1 continuing to look like an outlier, it looked normal. Instead, a new sample, S13, now looks like an outlier with a nearly identical pattern (overall higher distances with 2 samples that are especially high, S6 and S7).
I found this to be concerning as it almost looks like the normalization procedure swapped the sample labels (after some testing, however, I determined this was not the case). This is not what I would have expected from sample normalization, especially with the outlier pattern 'switching' to a different sample.
Normalized Count Heat Map Here
Does anyone have an idea what might be going on? Or is there something I could do to better understand what is happening here?
Thanks, that makes sense.
I can confirm from my data that the 'outlier' sample from the raw count plot had both the largest library size and highest variance.
What fully convinced me was doing a simple normalization by library size. This ended up producing the same pattern of sample distances that DESeq's median ratio method did. I was actually quite surprised how similar the sample distance heat maps were between the two methods.
Library Size Normalized Heat Map
DESeq Normalized Count Heat Map
Also, when you are referring to the top 1% of genes after normalization driving sample differences, are you referring to top genes by expression or by variance?
If you do perform scaling but don't transform the data (so data is heteroskedastic), clustering will be driven by the differences in the top genes by expression level.