Question: How to get the regression slope coefficient from DESeq2 analysis of continuous data
0
Edie Crosse0 wrote:

Hello,

I have used DESeq2 to find genes where the expression level correlates with the size of a continous variable - in this case, the number of cells in a population (pop). We noticed that we have a batch effect in our data depending on what day we processed the experiment (day). So my design set up was as follows:

cData=data.frame(day=as.factor(df\$day), pop=df[,x]) rownames(cData)<-colnames(d) dds<-DESeqDataSetFromMatrix(countData=d, colData=cData, design=~day+pop) dds<-DESeq(d.deseq)

The output of the results gives you log2FoldChange which is the "per unit of change of that variable." I am wondering whether there is a way to extract or infer from the data the coefficient of the regression slope? We would like to be able to deduce the size of a cell population from gene expression levels within a tissue sample.

Many thanks! Edie

MacDonald is right...the value returned is the slope, no matter what the label is. You can verify this yourself by plotting the log2 of the normalized counts against your cell number; the number DESeq gives you should be the slope of that line. I've done that check myself against my own data, and it works out.

Answer: How to get the regression slope coefficient from DESeq2 analysis of continuous d
1
James W. MacDonald50k wrote:

The canonical analysis for RNA-Seq is ANOVA, so the default for most software, including DESeq2 is to label the coefficient 'logFC', which is definitely not the 'per unit change of that variable', but instead is the log fold change between groups.

If you use a continuous variable, you still get the coefficient (in this case the regression slope, but still labeled 'logFC'), but now the interpretation is the log change in expression for a unit change in your continuous variable.

Thank you very much for your response. We still have some questions regarding how to exctract correlation coefficients that will allow us to predict cell population numbers from gene expression.

1) We need R or R2 (correlation coefficient) and significance of this coefficient? How does this relate to the logFC, which you mentioned is the slope coefficient? If we have a log2FC of 0.02, this gives a logFC of 1.014 - how does this correspond to the correlation coefficient which should be between 0 and 1?

2) Does the adjusted p value shows us how significantly our data fits to the linear regression model?

3) Considering that the linear regression model is Y=B0 + B1 * X (where Y=cell number, B0 is the y intercept, B1 is the slope and X is the gene expression) is there a way to extract B0 from the data? And is B1 corresponding to the logFC?

We don’t provide a correlation coefficient in DESeq2. The adjusted pvalue helps you find a set of genes where the FDR is bounded, given the specified model. It’s not a model fit statistic.

You may want to discuss with a statistician