Search
Question: DESeq Dispersion Estimation for Classification Purpose
0
3.9 years ago by
Turkey
gokmenzararsiz10 wrote:

Hi everyone,

We are about to update MLSeq package. We are indecisive on some point and your comments are very valuable.

Here, the objective is to split the raw mapped RNA-seq read count data into training and test sets, apply deseq normalization, use vst transformation to make the data hierarchically closer to microarrays. Then, we use conventional algorithms for classification purpose.

The question is the dispersion estimation in deseq. Which dispersion estimation method is most suitable for classification purpose? Test data may contain only one sample, is 'blind' method a better choice here?

Another point is that whether we should use information about dispersion estimation from training data for dispersion estimation of test data. The problem is similar to z-score standardization as how we use the training feature means and standard deviations to standardize test data. To estimate test set size factors, we use the geometric means of training data and divide each test count to these means. Should we use a similar strategy for dispersion estimation?

Below are some sample codes. Thanks,

##

library(MLSeq)
library(DESeq)

#Creation of example training and test sets
data(cervical)
tr = cervical[1:15, c(1:5, 30:34)]
ts = cervical[1:15, c(6:7, 35:36)]

tr = tr + 1
ts = ts + 1

train.cond = rep(1:2, c(5,5))
test.cond = rep(1:2, c(2,2))

#Calculation of test set size factors using geometric means from training data
geomts = ts / exp(rowMeans(log(tr)))
sizeF.ts = apply(geomts, 2, function(x) median(x))

#deseq normalization and vst transformation to training data
trS4 = newCountDataSet(tr, train.cond)
trS4 = estimateSizeFactors(trS4)
trS4 = estimateDispersions(trS4, method = "blind", sharingMode="fit-only", fitType="local")

#should we use a different method rather than 'blind'?
trS4exp = varianceStabilizingTransformation(trS4)
dataexpTrain = t(exprs(trS4exp))

#deseq normalization and vst transformation to test data
tsS4 = newCountDataSet(ts, test.cond)
tsS4 = estimateSizeFactors(tsS4)
sizeFactors(tsS4) = sizeF.ts #Test set size factors are calculated above
tsS4 = estimateDispersions(tsS4, method = "blind", sharingMode="fit-only", fitType="local")
#How to estimate the test set dispersions? And which method, 'blind'?
tsS4exp = varianceStabilizingTransformation(tsS4)
dataexpTest = t(exprs(tsS4exp))

##

sessionInfo()
R version 3.1.1 (2014-07-10)
Platform: x86_64-apple-darwin10.8.0 (64-bit)

locale:
[1] C

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] DESeq_1.16.0            locfit_1.5-9.1          BiocInstaller_1.14.3
[4] MLSeq_1.0.0             edgeR_3.6.1             randomForest_4.6-7
[10] GenomicRanges_1.16.3    GenomeInfoDb_1.0.2      IRanges_1.22.6
[13] caret_6.0-24            ggplot2_0.9.3.1         lattice_0.20-29
[16] limma_3.20.1            Biobase_2.24.0          BiocGenerics_0.10.0

loaded via a namespace (and not attached):
[1] AnnotationDbi_1.26.0 DBI_0.2-7            MASS_7.3-33
[4] RColorBrewer_1.0-5   RSQLite_0.11.4       XML_3.98-1.1
[7] XVector_0.4.0        annotate_1.42.0      car_2.0-20
[10] codetools_0.2-8      colorspace_1.2-4     digest_0.6.4
[13] foreach_1.4.2        genefilter_1.46.0    geneplotter_1.42.0
[16] grid_3.1.1           gtable_0.1.2         iterators_1.0.7
[19] munsell_0.4.2        nnet_7.3-8           plyr_1.8.1
[22] proto_0.3-10         reshape2_1.4         scales_0.2.4
[25] splines_3.1.1        stats4_3.1.1         stringr_0.6.2
[28] survival_2.37-7      tools_3.1.1          xtable_1.7-3

modified 3.9 years ago • written 3.9 years ago by gokmenzararsiz10
0
3.9 years ago by
Michael Love19k
United States
Michael Love19k wrote:

hi,

I would transform the test data using the same transformation as was applied to the training data. You can do this in DESeq2, see the examples under ?varianceStabilizingTransformation.

The setting of 'blind' is a question whether the dispersion estimation should know that samples are in different groups. For the training dataset, I'd guess that you want this (as otherwise, differences between groups appear as extra biological variance), so to set blind=FALSE. Then, in the example of applying the dispersion-mean-trend function from the training set to the test set, and then applying the exact same transformation, you should use blind=FALSE again. This time, the usage of blind=FALSE means that the VST function will not re-estimate the dispersions with ~ 1, but just use the supplied trend function stored in dispersionFunction(dds). (This is not obvious from the help files, I will make a note to make this more clear.)

You can supply the geometric means from the training set to the function estimateSizeFactors(), or use your code above, it should be the same result.

Mike wrote:

> Then, in the example of applying the dispersion-mean-trend function from the training set to the test set, and then applying the exact same transformation, you should use blind=FALSE again.

In the code in OP's question, estimateDispersion is called again, on a new DESeqDataSet created from the test data. Mike seems to assume that you have one DESeqDataSet object with all data, whcih you subset after estimating dispersions. If it is two independent data sets, then you have to copy the the fitted dispersion function from one object to the other. Mike, could you give the code for this? (Not sure I remember correctly how to do this.)

Also, maybe consider using the rlogTransformation instead of the varianceStabilizingTransformation. This one is more robust if sequencing depth varies more than only a bit.

There is code giving an example of this in the examples under:

?varianceStabilizingTransformation
1

Thank you Michael and Simon, that is very helpful. Setter option makes it very easier. rlog will also be available in the package for transformation. Best,

Gokmen Zararsiz