Hi. I have recently worked with both microarray and RNA-Seq data. For differential expression analysis of microarrays, I used
limma with log2 transformed intensities as input. For RNA-Seq, I used
DESeq2 with raw counts (derived from
salmon) as input.
limma want/require log2 transformed intensities but
DESeq2 wants untransformed counts?
I'm aware that
limma uses a different regression model compared to
DESeq2 (linear model vs negative binomial GLM) due to different types of data (intensities vs counts), so I'm more interested in the reason why input data should or shouldn't be transformed before regression analysis.
DESeq2 doesn't want log2 transformed values because the negative binomial GLM already handles heteroscedasticity and the log link function ensures the model coefficients are log2 fold changes.
limma's choice to use a regular linear regression model means to meet homooscedasticity it needs to reduce the heteroscedasticity of the intensity data with a log transform. If this is true, why doesn't
limma avoid requiring log2 transformed input data and simply use an appropriate GLM with a log link function? Or alternatively, why doesn't
DESeq2 avoid the extra computation required to fit a GLM and simply log2 transform counts that are then used as input to a linear regression.
I ask because I'm pretty sure these approaches give different results. Using untransformed values as input to a linear regression with a log link should be different than using log2 transformed values as input to a linear regression with the identity link function. Does one of these approaches to data transformation have more justification than the other?