Hi,
I have been using variancestabilizingtransformation for normalizing single cell RNA-seq data. I have assembled a large data set of >1500 cells (samples) that contain information on about 40,000 genes. variancestabilizingtransformation has been running for about 24 hours. Will this process finish? previously I have used variancestabilizingtransformation on a sample set of 406 cells/samples and 32000 genes, which took 6 hours. If the increase in processor time is not linear how bad is this? The DESEQ manual stated that variancestabilizingtransformation should be faster than rlog, but they test on a sample data set of 20 samples and 1000 genes.
Another question, vst does not run as a parallel process, is there a method to do this is this not possible?
thanks
Brian
Example of what I ran/running:
#build the DESEQ2 object
ddsEmbF<- DESeqDataSetFromMatrix(countData = embryoF, colData=condEmbF, design = ~Characteristics.developmental.stage.)
> ddsEmbF
class: DESeqDataSet
dim: 42761 1528
metadata(1): version
assays(1): counts
rownames(42761): 5S_rRNA 7SK ... snoZ6 yR211F11.2
rowData names(0):
colnames(1528): E3.1.443 E3.1.444 ... E7.9.573 E7.9.574
colData names(21): names Comment.ENA_SAMPLE. ...
Characteristics.inferred.pseudo.time. Factor.Value.cell.
#normalize the data
vstEmbF<-varianceStabilizingTransformation(ddsEmbF,blind=FALSE)
Thanks for the reply. I saw the new vst() function now. It increases speed through sub-sampling. I was concerned about this for the performance as it is already an estimate and it was not clear if the sampling routine would also need to be optimized for the data set. A 1000 genes might be representative of a small data set with low diversity. How would the 1000 genes be picked based on variance?
The vst() function uses 1000 genes which are equally spaced on a grid of mean normalized count. This is sufficient for bulk RNA-seq datasets to estimate the dispersion trend line (which is all that is used by the variance stabilizing transformation).
However, depending on what input data you use, you may want to custom fit the dispersion trend line using some subset of rows you pick to be representative. By choosing say 100-1000 rows, you obtain a large speedup in performing the VST. Note that the varianceStabilizingTranformation() function itself should take no time once you've already estimated the trend line and specify blind=FALSE.
You can do so like this: