Question

Assessing goodness of fit of a limma model fitted to microarray data

0

Entering edit mode

rubi ▴ 110

@rubi-6462

Last seen 5.7 years ago

Hi,

I have gene expression data from two species, where both were treated similarly at time = 0 and RNA was purified at time points 0 (i.e., before treatment), 0.5, 1, 2, 3, 4, 6, 8, 10, 12, 18, 24, 48, 72, and 168 hours. For both species, there are 4 biological replicates per each time > 0, two from male-derived cell-lines and two from female-derived cell-lines. Altogether, there are 4*14*2 112 samples, excluding the 2 time = 0 samples.

RNA was profiled on Agilent two-color microarrays, where the control channel for each species was its time = 0 RNA.

My question is how do the two species differ in their response to the treatment over time, adjusting for sex. The model I thought would be appropriate is: expression ~ species*time + sex.

Here's the data for the gene with the strongest species:time effect (lowest p-value), according to eBayes fit, after reading in the data and normalizing to controls, within, and between channels:

df <- data.frame(species = c(rep("A",56),rep("B",56)),
                 time = rep(c(0.5,1,2,3,4,6,8,10,12,18,24,48,72,168),2),
                 sex = c(rep("F",28),rep("M",28),rep("M",28),rep("F",28)),
                 y = c(0.0478,-0.2094,0.0189,0.1909,0.0519,-0.0399,0.091,-0.1216,0.1446,0.0703,0.0771,-0.1193,-0.1402,0.3976,0.0819,-0.1283,0.1697,0.1217,-0.0406,-0.2874,-0.14,0.0231,-0.0515,0.1188,-0.5046,-0.1092,0.1278,-0.0126,-0.0899,0.0812,-0.0469,-0.3162,-0.1297,0.0941,-0.1858,-0.036,0.066,0.054,-0.1895,-0.21,-0.1456,-0.0307,0.0466,0.1902,0.0282,0.3124,-0.0802,-0.2426,-0.0835,0.0576,-0.0066,-0.2917,-0.0444,0.002,0.1574,0.1419,-0.0643,0.1302,-0.0309,0.2049,-0.2419,-0.3125,-0.0261,-0.1417,-0.0433,0.0026,0.1545,-0.0297,-0.5969,-1.1599,-0.104,0.16,-0.1138,0.2061,-0.1964,-0.2259,-0.0444,-0.2323,0.0079,0.0451,-0.3934,-0.7692,-1.1443,-1.0439,0.0239,-0.082,-0.0106,-0.2931,-0.173,0.0613,-0.2549,0.0076,-0.0491,0.0045,-0.0248,-0.2804,-0.6369,-1.207,-0.0767,-9e-04,-0.2133,0.0813,0.0249,-0.1925,0.0952,0.0174,0.0715,0.1252,-0.0883,-0.4783,-0.7413,-1.3201))

df$species <- factor(df$species, levels = c("B","A"))
df$sex <- factor(df$sex, levels = c("M","F"))
df$time <- as.numeric(df$time)

Where the design matrix I used is:

design.mat <- model.matrix(y ~ species*time + sex,

                           contrasts = c("contr.treatment",NA,"contr.treatment"),
                           data = df)

The species:time effect size from the eBayes fit is: -0.02381599, and the p-value is: 1.2894e-39.

And here is how the data look:

library(ggplot2)

ggplot(df,aes(x=time,y=y,color=species))+geom_point()+theme_minimal()+stat_smooth(method="lm")

However, what's actually going on is that up to time = 24 the two species' slopes are roughly similar:

ggplot(df %>% dplyr::filter(time <= 24),aes(x=time,y=y,color=species))+geom_point()+theme_minimal()+stat_smooth(method="lm")

Whereas after time = 24 they're diverging:

ggplot(df %>% dplyr::filter(time > 24),aes(x=time,y=y,color=species))+geom_point()+theme_minimal()+stat_smooth(method="lm")

I was thinking that a low goodness of fit for all time points would help indicating that issue (as a means of summarizing this across all genes), but I can't seem to find how to get a summary of a goodness of fit for each gene through limma.

If I just fit a linear model (lm <- y ~ species * time + sex, data = df) to either all time points, or the time <= 24 and time > 24 separately, and then look at the QQ-plots and run a Shapiro test on the residuals, I see a poor fit for all time points, bet good fits for the time points up to 24 and above 24, separately:

All time points:

fit <- lm(y ~ species*time,data=df %>% dplyr::filter(time <= 24))
qqnorm(resid(fit))
qqline(resid(fit))
shapiro.test(resid(fit))$p.value

0.03420831


Up to time = 24:

fit <- lm(y ~ species*time,data=df %>% dplyr::filter(time <= 24))
qqnorm(resid(fit))
qqline(resid(fit))
shapiro.test(resid(fit))$p.value

0.2452327

And for time > 24:

fit <- lm(y ~ species*time,data=df %>% dplyr::filter(time > 24))
qqnorm(resid(fit))
qqline(resid(fit))
shapiro.test(resid(fit))$p.value


0.5827696

So my question is if there's a way to get a measure of goodness of fit through limma?

limma microarray goodness of fit • 816 views

ADD COMMENT • link 5.7 years ago rubi ▴ 110