I want to use limma removeBatchEffect(covariate = tumor_purity) to correct my protein intensity values (log2 transformed) for the continuous variable of tumor purity.
I am wondering if limma is built to do this properly and I am asking because when I corrected a particular gene (log2 transformed before correction), i noticed that the expression values were pulled up tremendously.
If slopes of the regression lines are extreme, are data correction values always extreme?
Any tips for how this can this be overcome?
Thank you.
I understand that it is not statistically defensible to use removeBatchEffect and then fit a linear model to the resulting data. But, would it be statistically defensible to use removeBatchEffect and then apply tests like t-tests, enrichment scores, etc. using the resulting
data? These tests calculate p-values for the data but do not fit a linear model to the data.
In addition, do you have experience with using removeBatchEffect only to genes whose regression line showed statistical significance?
Thanks
The main purpose for removeBatchEffect is to adjust data for nuisance variables prior to plotting (a nuisance variable is meant to control for something you think may affect the apparent gene expression, but is not itself of interest. For example, differences in cell composition or sex of the subjects, etc). It is not statistically defensible to use removeBatchEffect and then fit a linear model to the resulting data, because you would be incorrectly determining the degrees of freedom for your model. Instead you should include any nuisance variables (like tumor purity) in your linear model along with whatever coefficients you do care to interpret.
By including the nuisance variables in your model, your results can be interpreted as differences between the tumor (and normal adjacent tissue or whatever) after adjusting for tumor purity. An obvious alternative is to use the sva or RUVseq packages (assuming this is RNA-Seq data) to estimate surrogate variables that are meant to control for unobserved variability like the tumor purity and then including those surrogate variables in your model.
As explained by James, to adjust for purity, you simply include purity in the linear model. removeBatchEffect() is only for plotting.
If you do use removeBatchEffect, you need to specify the design argument. It would also be a good idea to mean-correct the purity variable if using it with removeBatchEffect, which will result in the same batch correction but prevent the base level of the expression values being changed. Please see ?removeBatchEffect.
Thank you.
I understand that it is not statistically defensible to use removeBatchEffect and then fit a linear model to the resulting data. But, would it be statistically defensible to use removeBatchEffect and then apply tests like t-tests, enrichment scores, etc. using the resulting
data? These tests calculate p-values for the data but do not fit a linear model to the data.
In addition, do you have experience with using removeBatchEffect only to genes whose regression line showed statistical significance?
Thanks
The issue here has nothing to do with t-tests versus linear models. It is not statistically defensible to use removeBatchEffects and then to apply any statistical test of any sort. In any case, a t-test is exactly equivalent to fitting a linear model with two groups.
I am not understanding what is stopping you from doing the analysis properly, by including any necessary covariates in the linear model, which is easier and better than cooking the data with complicated and statistically dubious pre-processing.
Here's why: I am not ONLY interested in doing differential expression analysis with my data. While that is one test that I am going to do using Limma, I want to have a dataset that is corrected for tumor purity that I can use for other types of testing such as KEGG enrichment analysis, etc that is NOT preformed on differential expression results.
I am using removeBatchEffect with design to preserve the variance associated with my biological variables (molecular subtype, grade) and then using tumor purity as the covariate in the model. I wish to receive a corrected matrix that I can use for downstream analysis that is not just DE.
I guess I'd like some further information of the use of covariate within limma for a continuous variable, when there is no batch being corrected for.
Thank you. I understand that it is not statistically defensible to use removeBatchEffect and then fit a linear model to the resulting data. But, would it be statistically defensible to use removeBatchEffect and then apply tests like t-tests, enrichment scores, etc. using the resulting data? These tests calculate p-values for the data but do not fit a linear model to the data. In addition, do you have experience with using removeBatchEffect only to genes whose regression line showed statistical significance? Thanks