DESEQ2 VST sample number limits?
3
0
Entering edit mode
b.cox • 0
@bcox-11191
Last seen 8.4 years ago

Hi,

I have been using variancestabilizingtransformation for normalizing single cell RNA-seq data. I have assembled a large data set of >1500 cells (samples) that contain information on about 40,000 genes. variancestabilizingtransformation has been running for about 24 hours. Will this process finish? previously I have used variancestabilizingtransformation on a sample set of 406 cells/samples and 32000 genes, which took 6 hours. If the increase in processor time is not linear how bad is this? The DESEQ manual stated that variancestabilizingtransformation should be faster than rlog, but they test on a sample data set of 20 samples and 1000 genes.

Another question, vst does not run as a parallel process, is there a method to do this is this not possible?

thanks

Brian

 

Example of what I ran/running:

#build the DESEQ2 object
ddsEmbF<- DESeqDataSetFromMatrix(countData = embryoF, colData=condEmbF, design = ~Characteristics.developmental.stage.)

> ddsEmbF
class: DESeqDataSet
dim: 42761 1528
metadata(1): version
assays(1): counts
rownames(42761): 5S_rRNA 7SK ... snoZ6 yR211F11.2
rowData names(0):
colnames(1528): E3.1.443 E3.1.444 ... E7.9.573 E7.9.574
colData names(21): names Comment.ENA_SAMPLE. ...
  Characteristics.inferred.pseudo.time. Factor.Value.cell.


#normalize the data
vstEmbF<-varianceStabilizingTransformation(ddsEmbF,blind=FALSE)

 

deseq2 variancestabilizingtransformation limits size • 2.8k views
ADD COMMENT
1
Entering edit mode
@mikelove
Last seen 3 days ago
United States

From the DESeq2 vignette section on transformations: "...a new function vst provides an even faster version of the varianceStabilizingTransformation..."

vst() may still take longer that you'd like. I'd try vst() on increasingly large subset of the columns to get a sense of timing, how long for 100 columns, 200 columns, etc. 

Even faster still would be normTransform(), while trying different pseudocounts to see which best stabilize the variance across the mean (see the vignette for more details).

 

ADD COMMENT
0
Entering edit mode

Thanks for the reply. I saw the new vst() function now. It increases speed through sub-sampling. I was concerned about this for the performance as it is already an estimate and it was not clear if the sampling routine would also need to be optimized for the data set. A 1000 genes might be representative of a small data set with low diversity. How would the 1000 genes be picked based on variance?

ADD REPLY
0
Entering edit mode

The vst() function uses 1000 genes which are equally spaced on a grid of mean normalized count. This is sufficient for bulk RNA-seq datasets to estimate the dispersion trend line (which is all that is used by the variance stabilizing transformation).

However, depending on what input data you use, you may want to custom fit the dispersion trend line using some subset of rows you pick to be representative. By choosing say 100-1000 rows, you obtain a large speedup in performing the VST. Note that the varianceStabilizingTranformation() function itself should take no time once you've already estimated the trend line and specify blind=FALSE.

You can do so like this:

dds.sub <- dds[idx,]
dds.sub <- estimateDispersionsGeneEst(dds.sub)
dds.sub <- estimateDispersionsFit(dds.sub)
dispersionFunction(dds) <- dispersionFunction(dds.sub)
# will not re-estimage dispsersions:
vsd <- varianceStabilizingTranformation(dds, blind=FALSE) 

 

ADD REPLY
1
Entering edit mode
b.cox • 0
@bcox-11191
Last seen 8.4 years ago

Thinking of Michael's answer above. The example in the DESeq2 manual stated that 20 samples by 1000 genes took on the order of 5 seconds. this is 20K data features or about 4K/sec. My smaller data set of 12 million features took about 4 hours to process, but if vst were linear it would take 0.8 hours. so four times longer than linear. I now have 60 million features that should take 4.1 hours if linear.

ADD COMMENT
0
Entering edit mode

Would be interesting to compare the results from that vs. using normTransform with a relatively high pseudocount (ie. pc=5, or pc=10) ... after you save the vst object, of course :-)

ADD REPLY
0
Entering edit mode

The dispersion estimation routine may not be optimized for the intercept-only case when blind=TRUE. I could potentially get some speed up there, but I've also got a number of other ideas for transformations for new kinds of experimental data that I'd like to work on.

When blind=FALSE, you need to form the design matrix and perform matrix multiplication to calculate Cox-Reid dispersion estimates, so you should not expect time to scale linearly in that case.

ADD REPLY
0
Entering edit mode
b.cox • 0
@bcox-11191
Last seen 8.4 years ago

variancestabilizingtransformation() crashed and produced no object returne4d. I am trying with vst(). if I can find the time I will try some different sizes and setting to determine what the cost is to increase the sample and gene dimensions, as it could be good to know.

ADD COMMENT

Login before adding your answer.

Traffic: 322 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6