varianceStabilizingTransformation for clustering, deconvolution
1
0
Entering edit mode
bruce.moran ▴ 30
@brucemoran-8388
Last seen 15 months ago
Ireland

Hi,

I have been using varianceStabilizingTransformation() for normailsation of gene expression data to allow use in deconvolution and clustering methods. However, having looked more closely I now see that splitting samples based on grouping (e.g. tissue type, or other phenotypic classification) and then conducting transformations creates a different dataset than when transformations are conducted on the entire set.

I have tried using the design matrix to include group information, and setting blind=FALSE with VST, but I still get large variations between groups. This is likely because I am looking at tumour vs. normal tissues, and between disease types also.

My question is should I be splitting by group, disease prior to VST? I had initially done this, but am questioning the decision now.

By way of example (N.B. I estimateSizeFactors(), estimateDispersions() in real data):

counts <- t(data.frame("GENEX" = csample.int(2000, 10, replace = TRUE),
sample.int(200, 10, replace = TRUE)),
"GENEY" = csample.int(1000, 10, replace = TRUE),
sample.int(100, 10, replace = TRUE))))
colnames(counts) <- c(paste0("T_",1:10), paste0("N_",1:10))
conds <- data.frame("sampleID" = c(paste0("T_",1:10), paste0("N_",1:10)),
"Type" = c(rep("Tumour", 10), rep("Normal", 10)))
dds <- DESeqDataSetFromMatrix(countData = counts,
colData = conds,
design =~ Type)
vst.all <- assay(varianceStabilizingTransformation(dds[,1:20]))[,11:20]
vst.norm <- assay(varianceStabilizingTransformation(dds[,11:20]))
vst.all; vst.norm

deseq2 deconvolution normalization limma • 868 views
0
Entering edit mode
@mikelove
Last seen 5 hours ago
United States

The VST is a function, like f(x) = log(x+1), which is applied to the normalized counts. However it takes the global dispersion into account. This looks different If you know the design or not. The dispersion seems much higher if you do not allow for differences across groups. Does this answer your question?

0
Entering edit mode

Yes, sorry I keep referring to VST as normalisation when it is a transformation. Your reply does address the central issue, of global dispersion and it's affect on the transformation.

What I really want to get is an opinion as to which is 'more appropriate':

1. create a single dds object which contains all samples with blind=FALSE; theoretically more samples gives a better estimation of dispersion, and as you say design can be used to account for group dispersion.

2. subset on groups (e.g. normal, disease types) and create multiple dds objects; there is then absolutely no influence on dispersion from other groups which are essentially different.

N.B. this is specifically for deconvolution and clustering analysis, for which I believe VST to be appropriate.

Appreciate any thoughts on this.

0
Entering edit mode

I recommend (1).

(2) is actually more problematic if you are concerned about influence, because you are applying f(x) to some samples and g(x) to other samples.

0
Entering edit mode

Great, didn't think of it that way, thanks.