Entering edit mode
Dear all, I am new to the field. I have recently worked with two datasets and noticed that in both cases (in my hands) data transformation appears to change the sample clustering result. I wonder if this is expected? If so, which transformation method would be best suited for an agnostic clustering of samples? Many thanks for your time and reply beforehand.
Please see here for an example of my clustering and heatmaps
Thank you for your prompt reply. Yes, I am following your recent (Oct 2019) package guideline and set the blind to False during the analysis.
As you might have noticed, in my data set, different transformation methods might change the story; are there any specific guidelines ( criteria/references/resources you could recommend) on how to pick a normalization technique agnostically during experiment design/preliminary analysis?
Also this would be a good time for me to thank you for your activities throughout the years on this platform. I come across your replies/comments frequently and they have been really helpful.
I’d guess the PCA plot is similar here.
The hclust result on the distances is not very stable, so I wouldn’t base too much on that alone. It may be more consistent across transformations if you subset to the 500 or 1000 genes with highest variance.
You are right. Using the top 500 most variable seems to improve PCA and clustering; and clustering of different transformation results look more similar (links to generated images below). Nevertheless, the narrative still would be somewhat different. PCA: https://ibb.co/NLHHF12 clustering: https://ibb.co/DtfT4w7
I also found below publication. I think with their dataset and setting, log transformation performed relatively better (I do not think the authors have included rlog transformation in their comparison). doi: 10.1371/journal.pone.0191758
That paper didn't use
vst
orrlog
from DESeq2 (or DESeq), so may not help make your decision here at all.Overall, all transformations agree on the major axis of variation being Y vs X/Z.
The fact that simple log on scaled counts separates X and Z could be meaningful or it could reflect that X and Z differ by a technical artifact which the other two transformations deal with more appropriately (e.g. total read count).
Sorry I don't have a definitely answer here. Maybe if you think the Y vs X/Z difference is a good thing to show in the heatmap then go with
rlog
.This is an exploratory experiment for the effects of the treatment, so any results would be of interest as long as I can justify the analysis performed when reporting the results. Given the circumstances, I would probably include the rld for the main figures and include the vsd and ntd as supplemental data (or just mention them as a line or two in the text) then.