Hi, first of all I would like to thank you in advance for your help and suggestions.
I'm studying the changes in genetic expression of 70 runners between before and after a mountain competition. The main objective is to study the changes in expression not explained by the cell counts (either because they come from cell types not included in the model, or because they come from free RNA in plasma). We also want to explain the changes in expression associated with each blood cell.
Data
- Expression values before and after the race extracted from the peripheral whole blood.
- Cell count values of the main cells in the blood (erythrocytes, neutrophils, monocytes, basophils, eosinophils and platelets).
- Some possible confounding variables (age, sex, running time).
- 3 categories (more or less uniform), each category is characterized by the distance of the race (14km, 35 km and 55 km).
Alternatives
1. First approach:
deltaG ~ 1 + deltaErythrocytes + deltaNeutrophils + ... + delta_Platelets
where
deltaG = (GEafter - GEbefore)/GEbefore
deltaCell = (CellAfter - CellBefore)/CellBefore
That is, use as output the change in gene expression relative to the initial expression. And as input the change in blood cell count of each cell. The main problem with using this approach is that the intercept also captures the difference in expression associated with changes in expression in each cell type. Then, we are only studying the changes in expression associated with changes in cell count. And we are not studying the changes in expression associated with cell count in a broader way.
2. Second approach:
G ~ PRE + POST + SubjectID + ErythrocytesCount + NeutrophilsCount + ... + PlateletsCount
Model without intercept. Where PRE (before race) and POST (after race) dummy variables, and the Subject_ID variable. Where (PRE=1 and POST=0 for Before race values and viceversa). Then contrast between PRE and POST. The results of the contrast can be interpreted as the changes in expression between PRE and POST of the expression not explained by the cell count.
The problem with this approach is that I can't study the differences in expression associated with each cell type. This model adjusts a beta for each cell type regardless of whether it is pre- or post-race.
3. Third approach
G ~ PRE + POST + SubjectID + ErythrocytesCount:PRE + Erythrocytes_Count:POST + .....
With this model I could make contrast between PRE-POST (the same as in the second model), and also make contrast between Cell_PRE
and Cell_POST
. This model could be more accurate because most cell counts are changing significantly between before and after the race and therefore follow different distributions. The only drawback I see is that I am doubling the number of covariates (two per cell type), thus reducing statistical power.
Summing up
My questions are:
- Should I discard the first approach?
- Between the second and third method, which do you think is more appropriate? Or, do you think there is a more appropriate alternative?
- Should I include the confounding variables in the model? What worries me about including such variables is that the statistical power decreases even more considering that in the third model I am already including 15 explanatory variables and I have 140 subjects.
Any comments, observations or advice are very welcome. Thank you very much in advance.
Kind regards,
Pol
Thank you very much for your recommendation, I will follow the third approach and try to compare it with other methods on testing DE on a cell-type specific basis. Kind regards,
Pol