Question: limma lmFit coefficients and p-values
3
3.0 years ago by
CPR30
Chicago
CPR30 wrote:

I'm running a series of linear regressions on a subset of candidate genes using lmFit(.) from the limma package. Two, hopefully simple questions:

1. I am trying to create a matrix of residuals controlling for another factor (sex) so I can look at the effects of a series of variables in a loop. Does it still make sense to use the fit <-eBayes(fit), as I've seen recommended?

I ask because I don't entirely understand how eBayes is creating the variance, and whether the 'residuals' in this instance really make sense. I've noticed the unadjusted t- and p-values are slightly different from if I just run a simple linear regression for any given probe.

2. Assuming I can proceed with the above residuals and have run my analyses using the residuals matrix, I print the top results. I have some options for coef(.).  I've read the manual but am still a bit confused. Is the first coefficient (1) the intercept? If I am interested variable x, should I simply ask for the second coefficient (2). If I put coef(1, 2), I get quite different p-values than with coef(2)alone, with important effects on the significance of the adjusted p-values.

BONUS QUESTION:

3. Finally, if I am interested in the relationship between methylation and a variable with several levels (i.e. low, medium, high, very high), how ask the basic question "are the levels different", without asking "is low different from medium/high/very high", a la anova?

Hope these are not overly trivial questions and are helpful to someone else.

Many thanks!

Calen

limma methylation ewas • 3.2k views
modified 3.0 years ago by Steve Lianoglou12k • written 3.0 years ago by CPR30

The coef(2) call doesn't refer to anything that's I know of. Did you mean coef=2? And coef=c(1,2)? These will obviously give you different results as they are describing different contrasts.

Right, coef(2) was a syntax error for coef = 2, my apologies.

Where I'm struggling a bit is that you say coef = 2 is not equivalent to coef = c(1, 2). Why? Isn't a two level factor always compared to the value of the intercept in a way? So in summary(lm(model) the intercept will be for one of the two factor levels (e.g. sex = male), and the variable sex in the model will say sexF, and imply that there is a comparison to the intercept, which here is coef = 1 (so coef = c(1, 2))?

Sorry if I'm way off here, or if I'm making weird comparisons (lm(.) to lmFit(.)), just want to make sure I understand what is going on for the contrasts.

3

If you want to know which coefficient is which, you need to look at the column names of your design matrix. Unless you've done something weird or specifically excluded the intercept, the first column should be your intercept. The coef argument indicates which coefficients you are setting to zero in your null hypothesis. Assuming that your first coefficient is the intercept, you definitely don't want to test the hypothesis that your intercept is equal to zero, because this hypothesis has no biological interpretation and will be rejected for nearly 100% of probes. So you shouldn't be putting 1 in the coef argument.

Unlike lm, limma does not hide the design matrix from you, but rather requires you to manipulate it directly. This is because the hypotheses being tested with limma often involve individual coefficients from multi-level factors that cannot be expressed simply by dropping terms from the design formula, which is how one normally uses lm. So specifying your null hypothesis at the level of individual coefficients rather than design terms tends to be more convenient.

Ok awesome that makes sense, thanks @Ryan C. Thompson!

Answer: limma lmFit coefficients and p-values
3
3.0 years ago by
Denali
Steve Lianoglou12k wrote:

Let's use your bonus question to frame the basis for analysis. Let's say you want to look for differential methylation given drug dose ('low', 'medium', 'high') and you want to control for sex. Assume we have 6 replicates for each (3 male, 3 female) and that our data is in a matrix called E. You would setup your design like so:

dose <- factor(rep(c('low', 'medium', 'high'), each=6),
c('low', 'medium', 'high'))
sex <- rep(c('f', 'm'), 9)
des <- model.matrix(~ dose + sex)
#    (Intercept) dosemedium dosehigh sexm
# 1            1          0        0    0
# 2            1          0        0    1
# 3            1          0        0    0
# 4            1          0        0    1
# 5            1          0        0    0
fit <- eBayes(lmFit(E, des))
anova.res <- topTable(fit, 2:3, n=Inf, sort.by='none')

Now anova.res has your anova-like test across dose controlled for sex. A few take home points to address your other questions:

1. I understand it's generally preferable to include the batch factor in your design rather than trying to remove it and work over the residuals. Why you ask? I reckon it has something to do with the "true" degrees of freedom, ie. if you regress out sex and fit a model over the residuals using just your dose covariate, your design has one-less column/parameter and you therefore "buy" (steal(?) :-) yourself an extra degree of freedom which really isn't there; (take that guess with a grain of salt, since I'm not a card carrying statistician)
2. It sounds like you only want to test a (small(?)) subset of regions/probes/whatever (ie. rows of E). That's fine, but you want to fit the model on all of your data, then test the smaller subset (ie. you can just extract the subset and readjust the pvalues via calling p.adjust(res$P.Value) on the redueced set of results). Or subset out your fit object to the relevant rows, then run topTable on that. 3. (For your question two) The first coefficient is whatever it is in your design matrix. Here, my first coef is the (Intercept) which you wouldn't really want to test against. Here I set coef to 2:3 (ie. dosemedium and dosehigh), which is how I get my ANOVA results. There are other ways to get the same if you setup a contrast matrix (which you can see examples of in the user guide). ADD COMMENTlink modified 3.0 years ago • written 3.0 years ago by Steve Lianoglou12k That's so much @Steve Lianoglou for such a prompt and thorough response! My query is on an a priori set subset of candidate genes (~250 CpG probes total), so I only ran the lmFit for those probes. Fairly straightforward there. In some cases I'm also running a DMR analysis after the lmFit, which requires that I first "control" for any other variables of interest (i.e. sex), since it can't run the DMR analysis for more than one variable (i.e. sex AND dose) at a time. I am assuming that for a two level factor (e.g. Sex, 0 and 1 representing male and female), the reference group is baseline (0) for that variable. I'm just asking because say for a regular lm(.) in R the reference level is the intercept, correct? Which is why I was a bit confused why a contrast of coef = c(1, 2) isn't comparing the second variable to the baseline (the intercept). The one thing I'm a still a bit hazy on, and would like to learn a bit more of is what eBayes(.) is doing. If yourself of anybody can recommend an introductory treatment on that so I can at least feel comfortable with the bare essentials, I would be much obliged. Thanks and have a great day! ADD REPLYlink modified 3.0 years ago • written 3.0 years ago by CPR30 3 It sounds like you want to use limma to perform batch correction for covariates (e.g. sex) and then feed the residuals into some other tool to perform the actual DMR analysis. I would recommend against this for the reasons that Steve has described: your p-values will be overconfident due to overcounting the degrees of freedom in the residuals. Unless there's something else preventing it, the preferred approach is always to fit a single model with the factor of interest (dose) and all relevant nuisance/batch covariates (sex). Then you can perform your hypothesis tests using contrasts within this model, and the tests will be performed while taking proper account of the batch covariates, including adjusting the degrees of freedom to reflect the fact that additional coefficients are being estimated for these covariates. As for your confusion about the coef argument, you are confusing both the purpose of that argument and the way in which R encodes categorical variables into the design matrix. You are correct that the baseline level of a factor is absorbed into the intercept, but the consequence of this is that each other coefficient already represents a fold change relative to the baseline level. So if male is the baseline level for sex, then the coefficient for female represents the difference (i.e. log fold change) between male and female. The same applies for low/medium/high dose: assuming low is the baseline, the coefficients for medium and high dose represent the difference relative to low dose. What this means for you is that for many contrasts that you might want to make, such as male vs female or low dose vs high dose, there is a coefficient in the model that already encodes that contrast, and all you have to do is select that coefficient for the coef argument to topTable. If you want a contrast that is not already encoded as a coefficient (e.g. medium dose vs high dose), then you need to construct a contrast matrix to represent that contrast, using the makeContrasts function. Lastly, if you want to know more about eBayes or any of the other methods in limma, I recommend you follow the advice given in citation("limma"): start with the 2015 paper, which gives a general overview, and then look in the Limma User's Guide Section 2.1: "Citing limma", which gives citations for all the methods implemented in limma. ADD REPLYlink modified 3.0 years ago • written 3.0 years ago by Ryan C. Thompson7.3k 1 As I mentioned, you really want to fit the data to all of your probes (except ones that are failing some sort of qc or something (sorry, never really worked with methylation array data), then just test over a subset. It's because the eBayes procedure borrows information from all of your data to get better variance estimates for each individual observation (probe). Take a look at the blog post about empirical bayes esimtation usinb baseball statistics for a non-genomics tutorial of these ideas. There the author is using an empirical bayes approach to get better estimates of a given batter's batting average, here you're using an empirical bayes approach to get a better estimate of a particular gene's (probe's) variance. I don't understand what you mean when you say this: In some cases I'm also running a DMR analysis after the lmFit, which requires that I first "control" for any other variables of interest (i.e. sex), since it can't run the DMR analysis for more than one variable (i.e. sex AND dose) at a time All I can say is that you should be doing your DMR analysis based on the same design matrix you used in your lmFit. If your'e still not sure how to proceed, we need to stop talking in the abstract and start talking about your actual data. Show us code that will setup the design of your experiment in the way that I did above and let us know what are the difference you actually want to test. Or show us the $targets of your EList ... something.

Thanks for your help and patience Steve. That makes a lot of sense about using ALL probes for eBayes(.). I'll make sure to keep that in mind from now on, and try the approaches (subset 'fit' or readjust p-values) you described.

With respect to the DMR analysis I mentioned, I'll clarify:

In another approach to the same data, I'm using the package seqlm to look for differentially methylated regions (DMRs) with respect to a variable, here birthweight.

seqlm can only create the segments and find significant regions for one variable. Therefore if I want to 'control' for other factors (i.e. sex), I have to model this first, create a matrix of the residuals, and then run the seqlm DMR analysis with that. Does that make sense? It looks like this:

# design and model fit...

dim(matrix2) # 23058 373
design <- model.matrix(~  sex)
fit2 <-lmFit(matrix2, design, na.action = na.omit)
fit2 <- eBayes(fit2, trend = TRUE)

# take residuals of sex
res.matrix2 <-residuals.MArrayLM(fit2, matrix2)

dim(res.matrix2 ) # 23058 373

rownames(res.matrix2 ) <-rownames(mvals.var.drop2) # given rownames to new matrix

Now I can run the seqlm DMR analysis using the matrix of residuals:

segments = seqlm(values = res.matrix2, genome_information = methinfo, annotation =  focal.bw2$birthweight) seqlmreport(segments[1:10], values = res.matrix2, genome_information = methinfo, annotation = focal.bw2$birthweight, dir = temp)

This is a separate analysis, but also why I was asking about eBayes(.). I tried to use lm.fit(.), but this only works with no NAs, something I have in my data.

Again, thanks for your help - it's amazing how much a few comments can clarify things!

1

Ideally, you should get the author of the seqlm package to implement support for more complex designs. The underlying functions used by seqlm support arbitrary designs, just like limma and lm, but seqlm only uses them to fit 2-group designs. Alternatively, you could use limma to get a p-value for each probe and then use a method like comb-p to combine the p-values of adjacent probes.

This is my first go at DMR analysis and so will try a few different approaches and try to figure out what the consensus is as to the 'best' one. Hopefully I can avoid the 'false' degrees of freedom issue Steve and yourself reminded me about when 'correcting', possibly by using another DMR analysis package.