Controlling for confounding variables in bumphunter
1
0
Entering edit mode
gft219 • 0
@gft219-12695
Last seen 5.2 years ago

Hi,

I'm currently using the bump hunter function and I have one fixed effect (control and treatment) and three covariates I need to include in the model (sex, age, white blood cell count). How do I include these variables in the designMatrix and in the model?

Thank you.

minfi bumphunter confounders • 1.3k views
0
Entering edit mode
@james-w-macdonald-5106
Last seen 3 hours ago
United States

The bumphunter package just uses the model fitting code that exists in base R, so your question could be reformulated to

How do I specify a design matrix in R that includes my coefficient of interest, plus other confounders?

And the answer is to use model.matrix, and to note that in any model there isn't a difference between what you consider the effect of interest and any nuisance variables. In other words, you don't say 'I care about this thing, but I do not really care about these other things, but do want to adjust for them'. You just put them all in the model and then test the coefficient you care about for any differences (this is the purpose of the coef argument in bumphunter).

So in simple terms, using fake data

> sex <- factor(rep(c("M","F"), each  = 4))
> age <- c(30, 46, 54, 76, 13, 55, 43, 23)
> treat <- factor(rep(c("Treated","Control"), times = 4))
> wbc <- runif(8)*50

> design <- model.matrix(~age+sex+wbc+treat)
> design
(Intercept) age sexM        wbc treatTreated
1           1  30    1 12.5810146            1
2           1  46    1  2.9467961            0
3           1  54    1 21.2905942            1
4           1  76    1 45.6129080            0
5           1  13    0 13.7550424            1
6           1  55    0 47.4945138            0
7           1  43    0  0.8778084            1
8           1  23    0 44.9691603            0
attr(,"assign")
[1] 0 1 2 3 4
attr(,"contrasts")
attr(,"contrasts")$sex [1] "contr.treatment" attr(,"contrasts")$treat
[1] "contr.treatment"

And the fifth coefficient is the one you care about, so you would specify coef = 5.

But note that I am just showing an example and pointing out which coefficient is the one of interest. There is a huge difference between somebody showing you an example, and you understanding what you are doing. You cannot get that sort of knowledge from answers on a forum - you have to do your own reading. I really like Julian Faraway's book, which you can read for free. There are also like 5200 worked examples in the limma User's Guide, which uses model.matrix to specify various models. Plus a google search for 'design matrix R' turned up 3.8 million hits in .64 seconds, so there's evidently more where that came from.

0
Entering edit mode

@gft219 thanks for the question and @james thanks for answer and the links.But I was unable to understand if we have multiple targets under a variable;

{

sex <- factor(rep(c(“M”,”F”,”Un”), each  = 2))
age <- c(30, 46, 54, 76, 13, 55)
treat <- factor(rep(c("Treated","Control"), times = 3))
wbc <- runif(6)*50
design <- model.matrix(~age+sex+wbc+treat)

(Intercept) age sexM sexUn        wbc treatTreated

1           1  30    1    0  0.7787705            1

2           1  46    1    0 46.5060076            0

3           1  54    0    0 40.0520252            1

4           1  76    0    0 29.7359324            0

5           1  13    0    1  6.8423813            1

6           1  55    0    1 14.5680916            0

attr(,"assign")

[1] 0 1 2 2 3 4

attr(,"contrasts")

attr(,"contrasts")$sex [1] "contr.treatment" attr(,"contrasts")$treat

[1] "contr.treatment"

}

In this case if we run nee to run bumphunter I have the following 2 questions

1. if i need to select my coefficient of interest  be sex What should be the number to be used?

2. If I used coefficient of 6 (WBC: 5 from ur example) and does this consider difference with respect to sex as a whole or individually consider sexM sexUn?
0
Entering edit mode

1.) You don't have a single coefficient of interest, you have two, the sexM and sexUn coefficients, which estimate the difference in methylation between males and females, and unsexed(?) and females, respectively.

2.) Your coefficient 6 is the treatment factor, not WBC. Anyway, you have continuous covariates and factors in  your model and no interaction terms. It's a bit more complicated to talk about when you have two continuous covariates (because you are fitting a plane, rather than a line), so let's pretend you just have one (WBC).

In that situation, one way to describe what you are doing is to say that you are fitting a line on the female untreated samples (defined by the intercept and WBC terms, which are the estimated M-value at a WBC of zero and the slope of the line, respectively). This is your 'baseline'. The treatTreated coefficient estimates the change in intercept between the female treated and female untreated samples, and the sexM and sexUn estimate the change in intercept between the male and female untreated samples, and the Un and female untreated samples, respectively.

This isn't exactly true, as it makes it seem like you are fitting a line to just one subset of samples, then adding in the others later. Instead what is really happening is you are simultaneously estimating all these coefficients where you allow for different intercepts for some subsets of the samples.

Note that I haven't said anything about the slope (WBC) value for any of the other sample types (treatTreated, sexM, sexUn). This is because this model constrains the slope to be equal for all of those factors! You are plotting parallel lines, where they can go up or down relative to the 'baseline' of female untreated, but the slope can never change. You are in effect saying that the relationship between WBC and gene expression is always the same, regardless of sex or treatment. This may or may not be true, but it is a simplifying assumption.

So you could say that the treatTreated coefficient estimates the difference between the female treated and untreated samples. But it is equally valid to say that treatTreated estimates the difference between treated and untreated samples, after adjusting for differences due to sex (because you fit the slope based on all the samples, not just the females, but you do allow for different intercepts for different groups).