Question

Does DESeq2 Data_transformation (vsd or rld) Change Clustering (and PCA_plot) Results?

0

Entering edit mode

Arctic • 0

@arctic-22506

Last seen 16 months ago

United States

Dear all, I am new to the field. I have recently worked with two datasets and noticed that in both cases (in my hands) data transformation appears to change the sample clustering result. I wonder if this is expected? If so, which transformation method would be best suited for an agnostic clustering of samples? Many thanks for your time and reply beforehand.

Please see here for an example of my clustering and heatmaps

DESeq2 rld vsd data transformation ntd • 2.5k views

ADD COMMENT • link updated 4.4 years ago by Michael Love 41k • written 4.4 years ago by Arctic • 0

score 2 · Accepted Answer · 2019-12-08

2

Entering edit mode

Michael Love 41k

@mikelove

Last seen 17 hours ago

United States

Yes its expected, for example we show different performance in the 2014 paper across these methods.

Are you using blind=FALSE? this is recommended if you have large treatment or batch effects.

ADD COMMENT • link 4.4 years ago Michael Love 41k

0

Entering edit mode

Thank you for your prompt reply. Yes, I am following your recent (Oct 2019) package guideline and set the blind to False during the analysis.

As you might have noticed, in my data set, different transformation methods might change the story; are there any specific guidelines ( criteria/references/resources you could recommend) on how to pick a normalization technique agnostically during experiment design/preliminary analysis?

Also this would be a good time for me to thank you for your activities throughout the years on this platform. I come across your replies/comments frequently and they have been really helpful.

ADD REPLY • link 4.4 years ago Arctic • 0

1

Entering edit mode

I’d guess the PCA plot is similar here.

The hclust result on the distances is not very stable, so I wouldn’t base too much on that alone. It may be more consistent across transformations if you subset to the 500 or 1000 genes with highest variance.

ADD REPLY • link 4.4 years ago Michael Love 41k

0

Entering edit mode

You are right. Using the top 500 most variable seems to improve PCA and clustering; and clustering of different transformation results look more similar (links to generated images below). Nevertheless, the narrative still would be somewhat different. PCA: https://ibb.co/NLHHF12 clustering: https://ibb.co/DtfT4w7

I also found below publication. I think with their dataset and setting, log transformation performed relatively better (I do not think the authors have included rlog transformation in their comparison). doi: 10.1371/journal.pone.0191758

ADD REPLY • link 4.4 years ago Arctic • 0

1

Entering edit mode

That paper didn't use vst or rlog from DESeq2 (or DESeq), so may not help make your decision here at all.

Overall, all transformations agree on the major axis of variation being Y vs X/Z.

The fact that simple log on scaled counts separates X and Z could be meaningful or it could reflect that X and Z differ by a technical artifact which the other two transformations deal with more appropriately (e.g. total read count).

Sorry I don't have a definitely answer here. Maybe if you think the Y vs X/Z difference is a good thing to show in the heatmap then go with rlog.

ADD REPLY • link 4.4 years ago Michael Love 41k

0

Entering edit mode

This is an exploratory experiment for the effects of the treatment, so any results would be of interest as long as I can justify the analysis performed when reporting the results. Given the circumstances, I would probably include the rld for the main figures and include the vsd and ntd as supplemental data (or just mention them as a line or two in the text) then.

ADD REPLY • link 4.4 years ago Arctic • 0