Nanostringdiff - defining contrast values for blocking variables in glm
2
0
Entering edit mode
reberya • 0
@95c489ea
Last seen 7 months ago
United States

Hello!

Long time lurker on Bioconductor but it is finally time to make an account and ask a question. I have experience using DESeq2/edgeR/limma in the past and have always found it relatively straight forward to account for variables that are not necessarily of interest - i.e. technical effects. An example of this would be controlling for a mouse cage in an RNA-seq experiment like so:

design <- model.matrix(~treatment + cage)
fit <- lm(~expr + treatment + cage)


In my specific example I wish to control for cycle. So setting up my design as so:

design = model.matrix(~ response + cycle)
(Intercept) B C1LATE C2EARLY C3EARLY C4EARLY C5EARLY SCREEN
1           1 1      0       0       1       0       0      0
2           1 1      0       0       0       0       0      1
3           1 1      0       0       0       0       0      0
4           1 1      0       1       0       0       0      0
5           1 1      0       0       0       1       0      0
6           1 1      0       0       0       0       0      0


I then effectively wish to run a linear model similar to:

fit <- lm(~expr + response + cycle)


However, I'm unsure how to set this up with contrasts in nanostringdiff. My best guess would be to weight the design as so, however I'm not sure if this is accounting for any of the cycle variables since they are given a value of 0 for contrasts.

contrast <- c(-1,1,0,0,0,0,0,0)
result <- glm.LRT(NanoStringData = cur, design.full = design, contrast = contrast)


Any resources or guidance would be greatly appreciated! Ryan

limma edgeR DESeq2 NanoStringDiff • 628 views
1
Entering edit mode
@james-w-macdonald-5106
Last seen 7 hours ago
United States

Technically you would use

contrast <- c(0,1,0,0,0,0,0,0)


Because in your parameterization the 'B' column estimates the B - A difference (presuming the other response is 'A'). But that might not work, so you are probably better off setting it up as a cell means model, and compute the contrast using an actual contrast.

design <- model.matrix(~ 0 + response + cycle)
contrast <- c(-1,1,rep(0,6))


And proceed from there.

0
Entering edit mode

@james-w-macdonald-5106 and @gordon-smyth thank you very much for your prompt replies!

To follow up: From the nanostringdiff documentation, it seems that using this model

design = model.matrix(~ response + cycle)


I can effectively control for cycle using either

# as shown by @james-w-macdonald-5106
contrast <- c(0,1,0,0,0,0,0,0)
result <- glm.LRT(NanoStringData, design, contrast)


or

# As shown by @gordon-smyth
result <- glm.LRT(NanoStringData, design, Beta=2)


As both of these methods will test whether the 2nd column is equal to zero in my model. My remaining questions are:

1. Do either of these methods (which I believe to be equivalent) fully account for the cycle variables? Mathematically I am unsure how this works given that their coefficients are set to zero.
2. How do the above referenced models differ from the 2nd model mentioned by @james-w-macdonald-5106 which has a forced intercept of zero. From my experience, this is typically only desirable when you always want your prediction to be positive (which it would be fair to say that in the case of gene expression this is true)
# forced zero intercept
design <- model.matrix(~ 0 + response + cycle)
contrast <- c(-1,1,rep(0,6))


Again, thank you very much for your help.

Ryan

1
Entering edit mode

You are confusing linear regression and ANOVA. In linear regression, you have a continuous covariate and in that context the intercept is, as you mention, the value at which the continuous covariate is equal to zero. And in that context if you set the intercept to zero, you are forcing the intercept to be zero.

In the context of an ANOVA, and in particular in R, if you have an intercept then you are fitting a model using a parameterization in which the intercept estimates the mean of one group, and all the other coefficients for that factor are then differences between the given factor level and the baseline. If you fit a model without an intercept, then you are fitting a model that simply estimates the mean of each group. So if you fit a model using the former parameterization you don't necessarily need to compute a contrast, because all of the coefficients except the intercept are already comparisons. But if you use the latter model then you do need a contrast because the mean of any one group is uninteresting.

If you add in an additional factor, then you are 'controlling for changes in that factor', which in R means that the intercept is now estimating the mean of a baseline group that includes the 'cycle' factor as well. You can tell which one it is because it won't be a column name for your design. As an example,

> trt <- factor(rep(c("trt","control"), each = 3, times = 6))
> cycle <- factor(rep(paste0("cycle", 1:6), 6))
> d.f <- data.frame(trt, cycle)
> colnames(model.matrix(~trt + cycle, d.f))
[1] "(Intercept)" "trttrt"      "cyclecycle2" "cyclecycle3" "cyclecycle4"
[6] "cyclecycle5" "cyclecycle6"


In that model, the baseline is control, cycle 1. And the second coefficient estimates the difference between treatment and control for cycle 1. And all the other cycle coefficients estimate the difference between cycle N and cycle 1 (so cyclecycle2 is estimating cycle 2 - cycle 1). This implies that you can capture the difference between cycles using the mean (e.g., the variability within each group is comparable), If that assumption holds, then the second coefficient estimates the difference between treatment and control after adjusting for cycle differences.

0
Entering edit mode

Hi James,

Your point about the linear model vs ANOVA is well taken and your explanation of the ANOVA method estimation using the difference between group means make sense. The only thing I'm wondering is whether your points about the ANOVA is applicable to the model that I'm fitting using glm.LRT() which fits a generalized linear model of the negative binomial family. Wouldn't my points about forcing coefficients be applicable since I'm using linear regression?

Also - I briefly looked at your bio and saw you were a UM fan - any mixed feelings with Saturday's game vs UW?

1
Entering edit mode

A GLM is just an extension of classical linear regression, and if you use factor levels for your predictors it's analogous to ANOVA. The interpretation of the coefficients doesn't change, it's just that the null distribution changes from a normal to (in this case) a negative binomial, so the inference can't be based on a normal or t-distribution. And a likelihood ratio test tells you if the model with the coefficient of interest fits the data better than a model without that coefficient (which is equivalent to testing if that coefficient is plausibly zero). You could use a Wald test instead (which is what DESeq2 uses), and that looks exactly like a t-statistic, which we already know tests for evidence that the coefficient is plausibly zero. So from a user's perspective you can just think about what you are doing in the context of an ANOVA.

I do have mixed feelings about the game. I've had season tickets for years now, and just can't bring myself to go sit with a bunch of people (for whom the naive expectation in MI is 49% unvaccinated) who are bellowing at the top of their lungs for ~3 hours straight. Sounds like a super-spreader event to me. And also Giles Jackson went to UW via the transfer portal, and I don't really like rooting against UM players, even those who transferred. I work for UW, but my fan loyalty is to UM, so GO BLUE!

1
Entering edit mode

Err. The null distribution used is a Chi-square. It's the distribution of the residuals that are assumed to be negative binomial...

1
Entering edit mode
@gordon-smyth
Last seen 3 hours ago
WEHI, Melbourne, Australia

The NanoStringDiff syntax and the internal code that deals with contrasts seem very closely modeled on that of edgeR so you can use the same arguments for NanoStringDiff that you would use for edgeR. If you type ?glm.LRT you can see that the descriptions of the arguments are word-for-word identical to the descriptions of the corresponding arguments for the edgeR function glmLRT. In NanoStringDff the Beta argument corresponds to the coef argument in edgeR. In your case, you would use

result <- glm.LRT(NanoStringData, design, Beta=2)


which is closely analogous to the edgeR call

resuls <- glmLRT(Data, design, coef=2)


for the same model.