DESeq2, plotting residuals vs.fitted values
1
0
Entering edit mode
Mike Miller ▴ 70
@mike-miller-6388
Last seen 9.6 years ago

Dear All,

I would like to plot residuals vs. fitted values for some of the candidate genes, after fitting the model and testing with DESeq2 package. Can someone tell me how I could extract the computed residuals and the fitted values per gene? Thank you in advance!

Best, Mike

deseq2 • 4.3k views
ADD COMMENT
0
Entering edit mode

Hello Mike,

Thank you so much for taking the time to provide this answer.  I have a simple question, but unfortunately I lack the necessary statistical expertise to ask this question correctly, but I will try to do my best.

Using the equations above, I extracted the residuals and plotted a histogram of them.  The distribution of residuals did not appear to be normal -- they were quite leptokurtic (perhaps to be expected from a Poisson distribution)?

The quick question I had was if you were using a specialized wald test -- I thought wald test required normal data?

I tried to dig into the wald function in your fitBeta function, but got a little lost in the c implementation.

    betaRes <- fitBeta(ySEXP = countsMatrix, xSEXP = modelMatrix,

        nfSEXP = normalizationFactors, alpha_hatSEXP = alpha_hat,

        contrastSEXP = contrast, beta_matSEXP = beta_mat, lambdaSEXP = lambda,

        tolSEXP = 1e-08, maxitSEXP = 0, useQRSEXP = FALSE)

 

Perhaps the above described residuals (assays(dds)[["mu"]])  are different to modelMatrix described in the fitBeta function?  (My apologies if I'm completely off base here) -- I'm just taking a shot in the dark.  Any illumination you could provide would be greatly appreciated.

ADD REPLY
0
Entering edit mode

hi Philip,

"Using the equations above, I extracted the residuals and plotted a histogram of them.  The distribution of residuals did not appear to be normal -- they were quite leptokurtic (perhaps to be expected from a Poisson distribution)?"

DESeq2 uses a Negative Binomial GLM (see full details of our method in the preprint which is the citation of the package) and the residuals in a GLM are not in general expected to be Normally distributed. Take a look at one of the classic GLM texts:

McCullagh P. and Nelder, J. A. (1989) Generalized Linear Models. London: Chapman and Hall.

"The quick question I had was if you were using a specialized wald test -- I thought wald test required normal data?"

The Wald test is a test of the estimated coefficient divided by its estimated standard error, this is also described in the preprint, so I'd check there first. The Wald test does not require that the response variable (typically the 'y' or 'K' as we write in our preprint) be Normally distributed.

ADD REPLY
0
Entering edit mode

Thanks Mike for your quick response -- I'll be sure to read up on these and get back to you with any additional questions I might have!

ADD REPLY
2
Entering edit mode
@mikelove
Last seen 1 day ago
United States

On the scale of the counts, the fitted values are:

assays(dds)[["mu"]]

to get the fitted values on the common scale, you just need to divide each column by the size factor:

fitted.common.scale = t(t(assays(dds)[["mu"]])/sizeFactors(dds))

So then the residuals are

counts(dds, normalized=TRUE) - fitted.common.scale 

Updated: however, it's important to note that "residuals" make less sense in a GLM context than in an LM context. For example, the likelihood ratio test does not involve the sum of squared residuals but instead the log likelihood of the data given the coefficient estimates.

ADD COMMENT
0
Entering edit mode

Follow up question - it appears that sizeFactors for DESeqDataSet objects made with tximport are NULL, even after running the complete DESeq pipeline. This is presumably because the information in assays(dds)[["avgTxLength"]] somehow substitutes for this step. But in that case, how would one go about making the fitted.common.scale matrix created above? Or perhaps there is some more general way to extract residuals from DESeqDataSet objects? 

 

Thanks,

David

ADD REPLY
0
Entering edit mode

hi David,

You are right, tximport places the average transcript length information in the assays. This information is then used to adjust for sequencing depth using estimateSizeFactors, and then the combined information about normalization per gene and per sample is a matrix in normalizationFactors(dds).  This then accounts for both sequencing depth and potential changes to feature length across samples for a given gene.

So you could just divide the mu matrix by normalizationFactors(dds).

Note that residuals are not typically used in the GLM setting (see my updated note above).

ADD REPLY

Login before adding your answer.

Traffic: 949 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6