I want to transform my count data for clustering (with mfuzz
).
From my other question and web-search it seems that 'vst' (Variance Stabilising Transformation) is good.
So I assumed that transformation have to be done on entire count matrix, ie. not only on counts for DE genes (according to DESeq
function with LRT
).
Shortly: should I make vst
transformation on filtered (removing low-count) data? It is not explained in the vignette and for me the result shown below - changing zeroes (second row of data) to values very similar for genes with above zero counts is strange
Or should I transform count-matrix for DE genes only?
So here is an extract of my count matrix, original:
> head(assay(dds2), 3)
m15.1 m15.2 m22.1 m22.2 m29.1 m29.2 m36.1 m36.2 w15.1 w15.2
Zm00001eb015280 56 67 49 52 25 12 12 7 84 30
Zm00001eb000610 0 0 2 0 0 0 0 0 0 0
Zm00001eb033210 139 101 106 104 85 129 99 96 99 170
w22.1 w22.2 w29.1 w29.2 w36.1 w36.2
Zm00001eb015280 24 40 14 10 14 17
Zm00001eb000610 0 0 0 0 0 0
Zm00001eb033210 113 111 82 106 62 93
And after vst
> vsd <- vst(dds2, blind=T)
> head(assay(vsd), 3)
m15.1 m15.2 m22.1 m22.2 m29.1 m29.2 m36.1
Zm00001eb015280 6.501523 6.661304 6.531083 6.580195 6.171981 5.747250 5.747564
Zm00001eb000610 4.757653 4.757653 5.137827 4.757653 4.757653 4.757653 4.757653
Zm00001eb033210 7.323597 7.026938 7.222472 7.205562 7.177450 7.589318 7.310662
m36.2 w15.1 w15.2 w22.1 w22.2 w29.1 w29.2
Zm00001eb015280 5.505651 6.874850 6.111009 6.047239 6.381226 5.809937 5.654703
Zm00001eb000610 4.757653 4.757653 4.757653 4.757653 4.757653 4.757653 4.757653
Zm00001eb033210 7.240777 7.026773 7.610598 7.305663 7.275373 7.096502 7.358659
w36.1 w36.2
Zm00001eb015280 5.841384 5.944786
Zm00001eb000610 4.757653 4.757653
Zm00001eb033210 6.892187 7.279655
Thank you for response. However it is still not clear for me - I understand that for distance calculation
vst
result is ok but after all clustering gives plots and original values like 0 and 56 are transformed to 4.75 and 6.5, respectively. So value "nothing" is changed to 4.75 which is roughly 60% of transformed count value of 56 which equals 6.7It touches more general problem, for DE genes raw counts with size factors is used whereas for plotting either log2(FPKM + 1) or
vst
orrlog
is used. I haven't tested that but I can imagine that expression profile (calculated for example withvsd
) for DE gene (as deemed byDESeq
function) could look like profile for not DE gene.Of course genes with values 0 for all or majority of samples will be excluded by
DESeq
function. But it all make me think that I should rather use at least filtered data forvst
or maybe just matrix for DE genes (so majority of values would be not zero). I've done filtering with following code.The point of the VST is that the values near 0 are anyway so high variance that they cannot be used to inform distance between samples. Rather than have users pick an arbitrary threshold, we compute an approximate variance stabilizing transformation that obviates that decision/heuristic.
You can feel free to use something else if you like, like a threshold that makes sense to you and log2 of scaled counts or something else.
Thank you Michael for your time and explanations. I think I'll test both
log2
andvst
and check results of clustering.Clustering is done using a per-gene Z-score so it only matters how different samples are from the gene mean. Hence, it does not matter if vst puts zero counts to 0, 4 or 10 simce it is the relative deviation that counts.