Hi,
I have been using varianceStabilizingTransformation()
for normailsation of gene expression data to allow use in deconvolution and clustering methods. However, having looked more closely I now see that splitting samples based on grouping (e.g. tissue type, or other phenotypic classification) and then conducting transformations creates a different dataset than when transformations are conducted on the entire set.
I have tried using the design matrix to include group information, and setting blind=FALSE
with VST, but I still get large variations between groups. This is likely because I am looking at tumour vs. normal tissues, and between disease types also.
My question is should I be splitting by group, disease prior to VST? I had initially done this, but am questioning the decision now.
By way of example (N.B. I estimateSizeFactors()
, estimateDispersions()
in real data):
counts <- t(data.frame("GENEX" = csample.int(2000, 10, replace = TRUE),
sample.int(200, 10, replace = TRUE)),
"GENEY" = csample.int(1000, 10, replace = TRUE),
sample.int(100, 10, replace = TRUE))))
colnames(counts) <- c(paste0("T_",1:10), paste0("N_",1:10))
conds <- data.frame("sampleID" = c(paste0("T_",1:10), paste0("N_",1:10)),
"Type" = c(rep("Tumour", 10), rep("Normal", 10)))
dds <- DESeqDataSetFromMatrix(countData = counts,
colData = conds,
design =~ Type)
vst.all <- assay(varianceStabilizingTransformation(dds[,1:20]))[,11:20]
vst.norm <- assay(varianceStabilizingTransformation(dds[,11:20]))
vst.all; vst.norm
Yes, sorry I keep referring to VST as normalisation when it is a transformation. Your reply does address the central issue, of global dispersion and it's affect on the transformation.
What I really want to get is an opinion as to which is 'more appropriate':
create a single
dds
object which contains all samples withblind=FALSE
; theoretically more samples gives a better estimation of dispersion, and as you say design can be used to account for group dispersion.subset on groups (e.g. normal, disease types) and create multiple
dds
objects; there is then absolutely no influence on dispersion from other groups which are essentially different.N.B. this is specifically for deconvolution and clustering analysis, for which I believe VST to be appropriate.
Appreciate any thoughts on this.
I recommend (1).
(2) is actually more problematic if you are concerned about influence, because you are applying f(x) to some samples and g(x) to other samples.
Great, didn't think of it that way, thanks.