Hi everyone,
We are about to update MLSeq package. We are indecisive on some point and your comments are very valuable.
Here, the objective is to split the raw mapped RNA-seq read count data into training and test sets, apply deseq normalization, use vst transformation to make the data hierarchically closer to microarrays. Then, we use conventional algorithms for classification purpose.
The question is the dispersion estimation in deseq. Which dispersion estimation method is most suitable for classification purpose? Test data may contain only one sample, is 'blind' method a better choice here?
Another point is that whether we should use information about dispersion estimation from training data for dispersion estimation of test data. The problem is similar to z-score standardization as how we use the training feature means and standard deviations to standardize test data. To estimate test set size factors, we use the geometric means of training data and divide each test count to these means. Should we use a similar strategy for dispersion estimation?
Below are some sample codes. Thanks,
##
library(MLSeq)
library(DESeq)
#Creation of example training and test sets
data(cervical)
tr = cervical[1:15, c(1:5, 30:34)]
ts = cervical[1:15, c(6:7, 35:36)]
tr = tr + 1
ts = ts + 1
train.cond = rep(1:2, c(5,5))
test.cond = rep(1:2, c(2,2))
#Calculation of test set size factors using geometric means from training data
geomts = ts / exp(rowMeans(log(tr)))
sizeF.ts = apply(geomts, 2, function(x) median(x))
#deseq normalization and vst transformation to training data
trS4 = newCountDataSet(tr, train.cond)
trS4 = estimateSizeFactors(trS4)
trS4 = estimateDispersions(trS4, method = "blind", sharingMode="fit-only", fitType="local")
#should we use a different method rather than 'blind'?
trS4exp = varianceStabilizingTransformation(trS4)
dataexpTrain = t(exprs(trS4exp))
#deseq normalization and vst transformation to test data
tsS4 = newCountDataSet(ts, test.cond)
tsS4 = estimateSizeFactors(tsS4)
sizeFactors(tsS4) = sizeF.ts #Test set size factors are calculated above
tsS4 = estimateDispersions(tsS4, method = "blind", sharingMode="fit-only", fitType="local")
#How to estimate the test set dispersions? And which method, 'blind'?
tsS4exp = varianceStabilizingTransformation(tsS4)
dataexpTest = t(exprs(tsS4exp))
##
sessionInfo()
R version 3.1.1 (2014-07-10)
Platform: x86_64-apple-darwin10.8.0 (64-bit)
locale:
[1] C
attached base packages:
[1] parallel stats graphics grDevices utils datasets methods base
other attached packages:
[1] DESeq_1.16.0 locfit_1.5-9.1 BiocInstaller_1.14.3
[4] MLSeq_1.0.0 edgeR_3.6.1 randomForest_4.6-7
[7] DESeq2_1.4.1 RcppArmadillo_0.4.300.0 Rcpp_0.11.1
[10] GenomicRanges_1.16.3 GenomeInfoDb_1.0.2 IRanges_1.22.6
[13] caret_6.0-24 ggplot2_0.9.3.1 lattice_0.20-29
[16] limma_3.20.1 Biobase_2.24.0 BiocGenerics_0.10.0
loaded via a namespace (and not attached):
[1] AnnotationDbi_1.26.0 DBI_0.2-7 MASS_7.3-33
[4] RColorBrewer_1.0-5 RSQLite_0.11.4 XML_3.98-1.1
[7] XVector_0.4.0 annotate_1.42.0 car_2.0-20
[10] codetools_0.2-8 colorspace_1.2-4 digest_0.6.4
[13] foreach_1.4.2 genefilter_1.46.0 geneplotter_1.42.0
[16] grid_3.1.1 gtable_0.1.2 iterators_1.0.7
[19] munsell_0.4.2 nnet_7.3-8 plyr_1.8.1
[22] proto_0.3-10 reshape2_1.4 scales_0.2.4
[25] splines_3.1.1 stats4_3.1.1 stringr_0.6.2
[28] survival_2.37-7 tools_3.1.1 xtable_1.7-3
Mike wrote:
> Then, in the example of applying the dispersion-mean-trend function from the training set to the test set, and then applying the exact same transformation, you should use blind=FALSE again.
In the code in OP's question, estimateDispersion is called again, on a new DESeqDataSet created from the test data. Mike seems to assume that you have one DESeqDataSet object with all data, whcih you subset after estimating dispersions. If it is two independent data sets, then you have to copy the the fitted dispersion function from one object to the other. Mike, could you give the code for this? (Not sure I remember correctly how to do this.)
Also, maybe consider using the rlogTransformation instead of the varianceStabilizingTransformation. This one is more robust if sequencing depth varies more than only a bit.
There is code giving an example of this in the examples under:
Thank you Michael and Simon, that is very helpful. Setter option makes it very easier. rlog will also be available in the package for transformation. Best,
Gokmen Zararsiz