Question

vst transformation and zero counts

0

Entering edit mode

boczniak767 ▴ 720

@maciej-jonczyk-3945

Last seen 4 weeks ago

Poland

I want to transform my count data for clustering (with mfuzz). From my other question and web-search it seems that 'vst' (Variance Stabilising Transformation) is good.

So I assumed that transformation have to be done on entire count matrix, ie. not only on counts for DE genes (according to DESeq function with LRT).

Shortly: should I make vst transformation on filtered (removing low-count) data? It is not explained in the vignette and for me the result shown below - changing zeroes (second row of data) to values very similar for genes with above zero counts is strange

Or should I transform count-matrix for DE genes only?

So here is an extract of my count matrix, original:

> head(assay(dds2), 3)
                m15.1 m15.2 m22.1 m22.2 m29.1 m29.2 m36.1 m36.2 w15.1 w15.2
Zm00001eb015280    56    67    49    52    25    12    12     7    84    30
Zm00001eb000610     0     0     2     0     0     0     0     0     0     0
Zm00001eb033210   139   101   106   104    85   129    99    96    99   170
                w22.1 w22.2 w29.1 w29.2 w36.1 w36.2
Zm00001eb015280    24    40    14    10    14    17
Zm00001eb000610     0     0     0     0     0     0
Zm00001eb033210   113   111    82   106    62    93

And after vst

> vsd <- vst(dds2, blind=T)
> head(assay(vsd), 3)
                   m15.1    m15.2    m22.1    m22.2    m29.1    m29.2    m36.1
Zm00001eb015280 6.501523 6.661304 6.531083 6.580195 6.171981 5.747250 5.747564
Zm00001eb000610 4.757653 4.757653 5.137827 4.757653 4.757653 4.757653 4.757653
Zm00001eb033210 7.323597 7.026938 7.222472 7.205562 7.177450 7.589318 7.310662
                   m36.2    w15.1    w15.2    w22.1    w22.2    w29.1    w29.2
Zm00001eb015280 5.505651 6.874850 6.111009 6.047239 6.381226 5.809937 5.654703
Zm00001eb000610 4.757653 4.757653 4.757653 4.757653 4.757653 4.757653 4.757653
Zm00001eb033210 7.240777 7.026773 7.610598 7.305663 7.275373 7.096502 7.358659
                   w36.1    w36.2
Zm00001eb015280 5.841384 5.944786
Zm00001eb000610 4.757653 4.757653
Zm00001eb033210 6.892187 7.279655

DESeq2 • 344 views

ADD COMMENT • link updated 9 weeks ago by ATpoint ★ 4.0k • written 9 weeks ago by boczniak767 ▴ 720

score 0 · Answer 1 · 2024-02-21

0

Entering edit mode

Michael Love 41k

@mikelove

Last seen 17 hours ago

United States

changing zeroes (second row of data) to values very similar for genes with above zero counts is strange

https://bioconductor.org/packages/release/workflows/vignettes/rnaseqGene/inst/doc/rnaseqGene.html#the-variance-stabilizing-transformation-and-the-rlog

See scatterplot.

VST is useful for computing things like distances or clustering. The location doesn't matter when you compute a distance. The transformed data are asymptotically log2.

ADD COMMENT • link 9 weeks ago Michael Love 41k

0

Entering edit mode

Thank you for response. However it is still not clear for me - I understand that for distance calculation vst result is ok but after all clustering gives plots and original values like 0 and 56 are transformed to 4.75 and 6.5, respectively. So value "nothing" is changed to 4.75 which is roughly 60% of transformed count value of 56 which equals 6.7

It touches more general problem, for DE genes raw counts with size factors is used whereas for plotting either log2(FPKM + 1) or vst or rlog is used. I haven't tested that but I can imagine that expression profile (calculated for example with vsd) for DE gene (as deemed by DESeq function) could look like profile for not DE gene.

Of course genes with values 0 for all or majority of samples will be excluded by DESeq function. But it all make me think that I should rather use at least filtered data for vst or maybe just matrix for DE genes (so majority of values would be not zero). I've done filtering with following code.

keep <- rowSums(counts(dds2)) > 1
dds2.f = dds2[keep,]`)

ADD REPLY • link 9 weeks ago boczniak767 ▴ 720

1

Entering edit mode

The point of the VST is that the values near 0 are anyway so high variance that they cannot be used to inform distance between samples. Rather than have users pick an arbitrary threshold, we compute an approximate variance stabilizing transformation that obviates that decision/heuristic.

You can feel free to use something else if you like, like a threshold that makes sense to you and log2 of scaled counts or something else.

ADD REPLY • link 9 weeks ago Michael Love 41k

0

Entering edit mode

Thank you Michael for your time and explanations. I think I'll test both log2 and vst and check results of clustering.

ADD REPLY • link 9 weeks ago boczniak767 ▴ 720

0

Entering edit mode

Clustering is done using a per-gene Z-score so it only matters how different samples are from the gene mean. Hence, it does not matter if vst puts zero counts to 0, 4 or 10 simce it is the relative deviation that counts.

ADD REPLY • link 9 weeks ago ATpoint ★ 4.0k