Question

Understanding voom behavior when model matrix is used with/without intercept for a continuous variable

0

Entering edit mode

atakanekiz ▴ 30

@atakanekiz-15874

Last seen 7 months ago

Turkey

Hello,

I'd like to perform DE analyses using RNAseq/microarray data and I will be using age of the patients as a continuous predictor variable. To better understand how I can do this properly, I was playing with toy data from UC Davis. I am a bit unclear about voom behavior when a design matrix is used. Please see my analysis below:

library(limma)
library(edgeR)



counts <- read.delim("https://raw.githubusercontent.com/ucdavis-bioinformatics-training/2018-June-RNA-Seq-Workshop/master/thursday/all_counts.txt")



d0 <- DGEList(counts)

d0 <- calcNormFactors(d0)

# Drop lowly expressed genes
cutoff <- 1
drop <- which(apply(cpm(d0), 1, max) < cutoff)
d <- d0[-drop,] 

# Excerpt from vignette showing how a continuous variable (pH) can be modeled in limma
set.seed(99)
pH <- rnorm(n = 24, mean = 8, sd = 1.5)
pH
#>  [1] 8.320944 8.719487 8.131743 8.665788 7.455743 8.184011 6.704232 8.734436
#>  [9] 7.453825 6.058637 6.881346 9.382326 9.125082 4.237169 3.438599 8.000399
#> [17] 7.408972 5.382459 8.747947 8.406431 9.648382 9.128770 7.910875 7.483147

# MM with intercept
mm1 <- model.matrix(~pH)
v1 <- voom(d, mm1, plot = T)



# No intercept MM
mm2 <- model.matrix(~0+pH)
v2 <- voom(d, mm2, plot = T)



# Mock age data (with intercept)
age <- rep(c(20, 22, 30, 32, 40, 42, 50, 52), each=3)
mm3 <- model.matrix(~age)
v3 <- voom(d, mm3, plot=T)



# Mock age data (without intercept)
mm4 <- model.matrix(~0+age)
v4 <- voom(d, mm4, plot=T)

^{Created on 2019-12-05 by the reprex package (v0.3.0)}

My questions:

1- Why is there a dramatic difference in the voom output when no intercept model is used? 2- In the real-world analysis I would be looking at patient samples originating from male and female individuals with different ages. I would like to see:

a) DE genes as a factor of age regardless of gender. b) DE genes that are only changing as a factor of age in males or females.

I can think of the following approach for scenario a:

mm <- model.matrix(~age)
v <- voom(d, mm)
fit <- lmFit(v, mm)
cont <- contrasts.fit(fit, coef="age")
res <- eBayes(cont)
top.table <- topTable(tmp, sort.by="P", n=Inf)

I'm not sure how I would find DE genes for scenario 2. I could subset the data to include only males or females and employ the same approach as in (a), but I would be losing information from the excluded samples. I would appreciate your insights on how to perform this analysis properly.

limma voom rnaseq model.matrix intercept • 3.9k views

ADD COMMENT • link updated 5.0 years ago by Gordon Smyth 52k • written 5.0 years ago by atakanekiz ▴ 30

1

Entering edit mode

Gordon Smyth 52k

@gordon-smyth

Last seen 35 minutes ago

WEHI, Melbourne, Australia

Why is there a dramatic difference in the voom output when no intercept model is used?

Why would you not expect a dramatic difference in the output, considering that you have made such a dramatic change to the input?

It would be easier to answer your question more meaningfully if you explained why you would contemplate fitting a regression through the origin (which is what your no-intercept gives). Maybe you are confused why ~0+x gives the same voom results as ~x when x is a factor but not when x is continuous but, if that's what you are wondering, you would need to say so.

In the real-world analysis I would be looking at patient samples originating from male and female individuals with different ages. I would like to see: a) DE genes as a factor of age regardless of gender. b) DE genes that are only changing as a factor of age in males or females.

This second question has nothing to do with your first question or with the title of your post. In future, please ask a separate question in a separate post. Anyway, the model you need is

design <- model.matrix(~gender + gender:age)

ADD COMMENT • link 5.0 years ago Gordon Smyth 52k

0

Entering edit mode

Yes, my main question was about the ~x vs ~0+x difference when x is continuous but not when it is a factor. I'd appreciate if you could shed some light on this. I guess I'm unclear about what intercept means for factors and continuous variables at the end of the day...

ADD REPLY • link 5.0 years ago atakanekiz ▴ 30

0

Entering edit mode

A first step would be to look at the design matrix and count how many columns it has. Does the number of columns change when you add 0+?

A second step would be to do a Google search for "regression through the origin".

It still seems strange to me that you are devoting so much attention to this since none of the posts or documentation about limma and continuous variables have ever recommended removing the intercept.

ADD REPLY • link 5.0 years ago Gordon Smyth 52k

score 2 · Accepted Answer · 2019-12-05

2

Entering edit mode

James W. MacDonald 67k

@james-w-macdonald-5106

Last seen 3 hours ago

United States

Do note that with only a continuous variable and no intercept, you are saying that the intercept is zero. Put another way, you are saying that, for every gene, when pH is zero, the gene expression is zero as well. Same for age - at birth, nobody has any gene expression whatsoever... Which obviously doesn't make sense. I would imagine that is what is causing the weird results, but Gordon or Aaron will be along shortly to say for sure.

As for your question, what you are describing is an interaction (~age*sex), where you are allowing the two groups to have separate intercepts as well as slopes. You can then test the interaction (evidence that gene expression changes at a different rate as we age, depending on sex), and for those genes without a significant interaction you can test for differences due to sex (where you then assume the slopes are the same).

ADD COMMENT • link 5.0 years ago James W. MacDonald 67k

0

Entering edit mode

That's right. A nonsense design matrix will produce nonsense fitted values, and hence nonsense residuals, nonsense variances and a nonsense voom trend.

OP hasn't given any explanation for why they would contemplate fitting a regression through the origin, so I don't feel obliged to write at length about why it is wrong. Possibly OP might be confused why "~0+x" gives the same voom results as "~x" when x is a factor but not when x is continuous, but they haven't said so I'm just guessing.

ADD REPLY • link 5.0 years ago Gordon Smyth 52k

0

Entering edit mode

Hi Gordon,

You are right, my confusion lies in the difference of "~0+x" and "~x" when there is a continuous variable involved, but not a factor. I'm just trying to develop a better understanding of different designs. In my experience, when I was working with RNAseq data and categorical groups, different designs changed how the results are interpreted (e.g. comparing everything against a baseline ("~0+...") or treating one of the samples in the experiment as the baseline (no zero in design). But these approaches didn't change how the data transformation was performed. I have read a lot of threads regarding continuous variables and read the limma vignette multiple times. Somehow, certain design approaches (especially the ones involving interaction) don't seem intuitive to me. I think I'm not well versed in what is going on under the hood, making it difficult to wrap my mind around this. I appreciate the insights.

ADD REPLY • link 5.0 years ago atakanekiz ▴ 30