I have a dataset where I am interested in looking at differential expression of genes in a singular body fluid and the relationship with a histochemical outcome that is a repeated measure. I would prefer not to collapse this repeated measure into one variable and use it for input into limma due to missing data which would make a composite variable biased.
Therefore, I was wondering whether there was a statistical issue with modeling the genes as independent variables rather than a dependent variable so that my repeated measure could serve as the outcome? That way I could then run a mixed model using dream in the variance partition package or use duplicate correlation in limma?
The data that is repeated is a quantitative metric of post-mortem pathology across multiple sections. Not all patients have the same number of sections assessed due to availability, etc. The genes/proteins are measured once in the serum. The goal is to identify serum biomarkers of this pathological hallmark. Given missing values are present in the pathological data, I was concerned about generating a composite score to use as an independent variable.
Therefore, my question related to how to address this and I wondered whether the genes/proteins could not be tested one by one as an independent variable in a mixed model and p-values adjusted by FDR? Any insight into why this would not make statistical sense would be helpful for me to understand. Other suggestions/options are much appreciated. Apologies for the naivety.
Such an analysis cannot be done in limma. Sorry, I cannot tell you how to do it or even whether it is possible.
Also, the reason why gene abundance is the dependent variable and is iterated over is that there are far more genes than there are samples in the majority of data sets. So, it is not possible to fit a linear model with all genes as covariates.
Thanks. The proposal was not to include all proteins as covariates in one model but to iterate with one protein serving as the independent variable in each model and then correcting the p-values from all models.
An alternative to that is to use the
glmnet
package to fit a regularized regression using all proteins at once.Thanks! I was thinking about this but do not believe glmnet allows for regularized mixed models unless I am mistaken?
Does not allow mixed models. And is for a different purpose to what your aims seem to be: it's purely for prediction rather than for testing hypotheses.
Both true, so a poor suggestion on my part.
I am also unclear as to how one would use a single observation (the gene expression measures) as a predictor for repeated measures. Presumably the repeated measures of the histochemical outcome change over time (or why did you measure it repeatedly?), and trying to infer something about that process by regressing on a static value seems pointless. Ideally there would be repeated measures of the gene expression, and the goal would be to find changes in gene expression that vary as the histochemical outcome varies.