Question

Variance stabilization transformation (VST), blind=TRUE

0

Entering edit mode

Jayesh Kumar • 0

@49806f54

Last seen 14 months ago

United States

I have early embryonic development time series (normal, mutant & treated) RNA-seq counts data from multiple studies which I am planning to use for clustering genes. I have to remove the study/batch effects for which I am using Combat-seq using study ID as batches. Then for normalization and transformation, I am using VST with blind=TRUE option. I see that mean expression of genes is no longer correlated with its variance - which is good. The thing with early embryonic development transcriptome data is that a lot of genes change in their expression levels. Given this huge changes in expression in this kind of data, I am worried about using VST with blind=TRUE option. I am kind of having a feeling that the gene dispersions are being overestimated.

Simply, I looked at the number of genes which are down-regulated from early to late time point. I got around 1600 genes having a log fold change <= -1. On the other hand, if I perform log2(CPM+0.5) normalization, the number of genes down-regulated is around 4000 or so. (log fold change <= -1). I understand that VST penalizes the low expressed genes more to reduce the noise in general. But, I am not so sure whether what I see is a huge reduction in number of genes down-regulated and I am killing lot of genes just because they are highly variable in the general embryonic development time course. Do you people think it is okay? How should I determine whether blind=TRUE is an okay option? Or should I try to do VST with blind=FALSE option? - The few information I have about these samples are Study, time point of development, Treatment. The issue is that I might have only one replicate sample for a treatment. I am not sure how to use them as covariates for the analysis. I will be happy to hear any suggestion or feedback.

Just to mention, I think my results (clustering) are better in general when I perform VST normalization than some of the other things I have tried. But I wanted to sure whether I am doing something really wrong with VST and killing some biological variance in the data.

DESeq2 Transcriptomics RNASeq • 1.2k views

ADD COMMENT • link updated 15 months ago by Michael Love 41k • written 15 months ago by Jayesh Kumar • 0

score 0 · Answer 1 · 2023-01-12

0

Entering edit mode

Michael Love 41k

@mikelove

Last seen 9 hours ago

United States

I recommend blind=FALSE, it's not passing much information about the design to the transformation. It only looks at the design to know the _global_ distribution of dispersion values. It doesn't use the design in the transformation itself.

ADD COMMENT • link 15 months ago Michael Love 41k