Hello,
I'd like to perform DE analyses using RNAseq/microarray data and I will be using age of the patients as a continuous predictor variable. To better understand how I can do this properly, I was playing with toy data from UC Davis. I am a bit unclear about voom behavior when a design matrix is used. Please see my analysis below:
library(limma)
library(edgeR)
counts <- read.delim("https://raw.githubusercontent.com/ucdavis-bioinformatics-training/2018-June-RNA-Seq-Workshop/master/thursday/all_counts.txt")
d0 <- DGEList(counts)
d0 <- calcNormFactors(d0)
# Drop lowly expressed genes
cutoff <- 1
drop <- which(apply(cpm(d0), 1, max) < cutoff)
d <- d0[-drop,]
# Excerpt from vignette showing how a continuous variable (pH) can be modeled in limma
set.seed(99)
pH <- rnorm(n = 24, mean = 8, sd = 1.5)
pH
#> [1] 8.320944 8.719487 8.131743 8.665788 7.455743 8.184011 6.704232 8.734436
#> [9] 7.453825 6.058637 6.881346 9.382326 9.125082 4.237169 3.438599 8.000399
#> [17] 7.408972 5.382459 8.747947 8.406431 9.648382 9.128770 7.910875 7.483147
# MM with intercept
mm1 <- model.matrix(~pH)
v1 <- voom(d, mm1, plot = T)
# No intercept MM
mm2 <- model.matrix(~0+pH)
v2 <- voom(d, mm2, plot = T)
# Mock age data (with intercept)
age <- rep(c(20, 22, 30, 32, 40, 42, 50, 52), each=3)
mm3 <- model.matrix(~age)
v3 <- voom(d, mm3, plot=T)
# Mock age data (without intercept)
mm4 <- model.matrix(~0+age)
v4 <- voom(d, mm4, plot=T)
Created on 2019-12-05 by the reprex package (v0.3.0)
My questions:
1- Why is there a dramatic difference in the voom output when no intercept model is used? 2- In the real-world analysis I would be looking at patient samples originating from male and female individuals with different ages. I would like to see:
a) DE genes as a factor of age regardless of gender. b) DE genes that are only changing as a factor of age in males or females.
I can think of the following approach for scenario a:
mm <- model.matrix(~age)
v <- voom(d, mm)
fit <- lmFit(v, mm)
cont <- contrasts.fit(fit, coef="age")
res <- eBayes(cont)
top.table <- topTable(tmp, sort.by="P", n=Inf)
I'm not sure how I would find DE genes for scenario 2. I could subset the data to include only males or females and employ the same approach as in (a), but I would be losing information from the excluded samples. I would appreciate your insights on how to perform this analysis properly.
That's right. A nonsense design matrix will produce nonsense fitted values, and hence nonsense residuals, nonsense variances and a nonsense voom trend.
OP hasn't given any explanation for why they would contemplate fitting a regression through the origin, so I don't feel obliged to write at length about why it is wrong. Possibly OP might be confused why "~0+x" gives the same voom results as "~x" when x is a factor but not when x is continuous, but they haven't said so I'm just guessing.
Hi Gordon,
You are right, my confusion lies in the difference of "~0+x" and "~x" when there is a continuous variable involved, but not a factor. I'm just trying to develop a better understanding of different designs. In my experience, when I was working with RNAseq data and categorical groups, different designs changed how the results are interpreted (e.g. comparing everything against a baseline ("~0+...") or treating one of the samples in the experiment as the baseline (no zero in design). But these approaches didn't change how the data transformation was performed. I have read a lot of threads regarding continuous variables and read the limma vignette multiple times. Somehow, certain design approaches (especially the ones involving interaction) don't seem intuitive to me. I think I'm not well versed in what is going on under the hood, making it difficult to wrap my mind around this. I appreciate the insights.