Question

design matrix and correlate variable

1

Entering edit mode

mouton.alice ▴ 10

@moutonalice-20662

Last seen 5.0 years ago

Hello Bioconductor community I know my question might seem redondant but I can't figure out a good answer to it even after browsing and googling around.

Here is the thing, I am doing a gene expression analyses and I see that I have a response with weight, I also have a batch effect so 1) initially I did :

design <- model.matrix(~ weight + batch)
v_norm=voomWithQualityWeights(dge3,design,plot=F)
fit <- lmFit(v_norm,design)
vfit<-contrasts.fit(fit, coef = 2) # test "weight" coefficient
efit<-eBayes(vfit)
summary(decideTests(efit))

I have a lot of genes differentially expressed

2) In my phenotypic data, I observe that I have a correlation between weight and height. If I do the design only on height, I have 0 genes under fdr 0.05 but 700 under fdr 0.1

my question is: Do you think I should include height as a correlated variable as well? I mean if there are correlated, I don't understand why I should include them but I can't find an answer to that.

I tried to do a design such as :

design<-model.matrix(~weight+height + batch)

or

design<-model.matrix(~weight*height + batch) #

but I lose all genes differentially expressed. I don't really understand what to do and would highly appreciate any suggestions. Many thanks Best regards Alice

limma lmfit • 881 views

ADD COMMENT • link updated 4.9 years ago by Aaron Lun ★ 28k • written 5.0 years ago by mouton.alice ▴ 10

0

Entering edit mode

I have a pearson correlation of 0.8710369

ADD REPLY • link 5.0 years ago mouton.alice ▴ 10

score 1 · Answer 1 · 2019-05-09

This is a standard case of a confounding variable. You can't distinguish between the "effects" of weight and height because they're so well correlated. Any linear model involving both terms will find it difficult to yield significant results for either, because if you drop either term in the null model, the other will just replace it.

Now, I have mentioned "effects" in quotes because I doubt that you actually experimentally modified the height and weight. What you're really testing are associations between gene expression and weight/height, which changes the tone of the analysis considerably. In particular, you are probably not interested in the "effect" of weight itself - which you can't get anyway, because you weren't perturbing it experimentally.

Rather, you are probably interested in some unmeasured causal process or activity that happens to be correlated to weight. This provides a lot more freedom in how that process is modeled:

Use weight directly and ignore height.
Use weight + height together and do an ANOVA for both terms at once.
Use some function of weight and height to create a new term. For example, principal components regression performs PCA on the covariate matrix and takes the first PC as the "consolidated" covariate.

The first option is simplest and probably satisfactory given the tight correlation between weight and height.

Finally, I hope that you didn't decide that the weight was interesting after you looked at the data. This would be a textbook case of data dredging and will result in an increased number of false positives.

P.S. weight*height makes very little sense.