Question

Variance estimation with DESeq2 and rLog transformation.

0

Entering edit mode

mattia.furlan776 • 0

@mattiafurlan776-11207

Last seen 7.7 years ago

Hello,

my name is Mattia Furlan, I am a student majoring in Physics of Complex Systems at the University of Turin (IT).

As my master thesis project, I am currently working on the inference of transcription rates in eukaryotic cells from time course RNA-seq data.

To pursue this aim I have recourse to DESeq2, in order to "prepare" the raw counts data coming from the sequencing experiments, according to the following framework:

- creation of a DESeq data set, function DESeqDataSetFromMatrix
- execution of the DESeq protocol, function DESeq
- normalization of the counts, function counts(... , normalized = TRUE)
- evaluation of the variance for each gene and for each time point,

mu + alpha*mu^2 with mu<-assays( ... )[['mu']] , alpha<-dispersions( ... )

I am seeking for feedbacks on the correctness of this procedure.

I am also evaluating the possibility of applying the rLog transformation to my data modifying the previous framework in the following manner:

- creation of a DESeq data set
- execution of the DESeq protocol

- rLog transformation, function rlog(... ,blind = TRUE)

- reversion of the transformation as 2^assay( "rlog data" )
- evaluation of the variance in the way described above

I wonder if this modification is licit, is still this specific variance a good estimation for the real counts dispersion after the rLog transformation?

Thank you all in advice for your attention, best regards.

Mattia F.

deseq2 rlog transformation • 1.3k views

ADD COMMENT • link updated 7.7 years ago by Wolfgang Huber ★ 13k • written 7.7 years ago by mattia.furlan776 • 0

score 0 · Answer 1 · 2016-08-01

0

Entering edit mode

Wolfgang Huber ★ 13k

@wolfgang-huber-3550

Last seen 11 days ago

EMBL European Molecular Biology Laborat…

Dear Mattia

The first procedure seems reasonable. Regarding the second, it is not the case that 2 ^ assay( "rlog data" ) is a reversion of the transformation.

However, there is a more important point: if you are looking at time course data, and if the number of sampled time points is such that you can expect smoothness, then you should use that for your inference and variance estimation. Namely, fit a smooth function (e.g. using a spline basis, you can do this with DESeq2), and use the residuals for variance estimation, in addition to the replicates.

Next time, please consider reporting the code as literally as possible, the more precisely you formulate your question, the more precisely others can answer.

Wolfgang

ADD COMMENT • link 7.7 years ago Wolfgang Huber ★ 13k

0

Entering edit mode

Dear Dr. Huber,

thank you for your prompt and kind reply.

Regarding the rLog transformation, I thought that the following expressions would be analogue ( even if not equivalent )

tpts <- c(0,0,0,1,1,1,2,2,2)
dds <- DESeqDataSetFromMatrix(countData = countData,
colData = colData,
design = ~ tpts)
dds <- DESeq(dds)

1)
normCountsExt<-counts(dds,normalized=TRUE)

2)
rlogDds<-rlog(dds,blind = TRUE)
normCountsExt<-2^assay(rlogDds)

because, according to the documentation, rLog(K_ij) = log2(q_ij) ( 2^log2(q_ij) = q_ij ) while counts( ... , normalized = TRUE) = K_ij/s_ij ~ q_ij.

In other words, I believed that the previous expressions would be indicative of the same quantity, specifically "proportional to the concentration of cDNA fragments from the gene in the sample, scaled by a normalization factor".

Numerically speaking I found that these two transformations return, indeed, very similar profiles.

Did I commit some conceptual mistake founding just a fortuitous similarity among profiles?
If not, is in your opinion the following variance evaluation a good estimation for both the procedures, even after count regularization?

normCounts<-t(apply(normCountsExt,1,function(x)tapply(x, tpts, mean)))

mu<-assays(dds)[['mu']]
alpha<-dispersions(dds)
varCounts<-t(sapply(1:nrow(dds),function(gene){mu[gene,]+alpha[gene]*mu[gene,]^2}))
varCounts<-t(apply(varCounts,1,function(x)tapply(x, tpts, mean)))

Regarding the approach which you suggested, fitting the time course data with a smooth function to estimate the variance using the residuals, is it any reference material available which I may consult in order to understand how to proceed through DESeq2?

Thank you very much for your consideration Dr. Huber; best regards.

ADD REPLY • link 7.7 years ago mattia.furlan776 • 0

0

Entering edit mode

Dear Mattia

If 1) and 2) were equivalent, we would hardly need a new function and a new name for it - rlog would be just the same as log2. It is not. Please have a look at the DESeq2 paper to read about shrinkage of the coefficients (betas). For highly expressed genes, it can (should) be that the results are very similar; but not so for genes with small counts.

Reference for smoothing: http://www.pnas.org/content/102/36/12837.full (John Storey et al.), and the limma User's Guide Section 9.6 Time Course Experiments, Subsection 9.6.2 Many time points -- the code suggested there can be used analogously for the GLM in DESeq2.

ADD REPLY • link 7.7 years ago Wolfgang Huber ★ 13k