Assessing goodness of fit of a limma model fitted to microarray data
0
0
Entering edit mode
rubi ▴ 110
@rubi-6462
Last seen 5.7 years ago

Hi,

 

I have gene expression data from two species, where both were treated similarly at time = 0 and RNA was purified at time points 0 (i.e., before treatment), 0.5, 1, 2, 3, 4, 6, 8, 10, 12, 18, 24, 48, 72, and 168 hours. For both species, there are 4 biological replicates per each time > 0, two from male-derived cell-lines and two from female-derived cell-lines. Altogether, there are 4*14*2 112 samples, excluding the 2 time = 0 samples.

RNA was profiled on Agilent two-color microarrays, where the control channel for each species was its time = 0 RNA.

My question is how do the two species differ in their response to the treatment over time, adjusting for sex. The model I thought would be appropriate is: expression ~ species*time + sex

 

Here's the data for the gene with the strongest species:time effect (lowest p-value), according to eBayes fit, after reading in the data and normalizing to controls, within, and between channels:

 

df <- data.frame(species = c(rep("A",56),rep("B",56)),
                 time = rep(c(0.5,1,2,3,4,6,8,10,12,18,24,48,72,168),2),
                 sex = c(rep("F",28),rep("M",28),rep("M",28),rep("F",28)),
                 y = c(0.0478,-0.2094,0.0189,0.1909,0.0519,-0.0399,0.091,-0.1216,0.1446,0.0703,0.0771,-0.1193,-0.1402,0.3976,0.0819,-0.1283,0.1697,0.1217,-0.0406,-0.2874,-0.14,0.0231,-0.0515,0.1188,-0.5046,-0.1092,0.1278,-0.0126,-0.0899,0.0812,-0.0469,-0.3162,-0.1297,0.0941,-0.1858,-0.036,0.066,0.054,-0.1895,-0.21,-0.1456,-0.0307,0.0466,0.1902,0.0282,0.3124,-0.0802,-0.2426,-0.0835,0.0576,-0.0066,-0.2917,-0.0444,0.002,0.1574,0.1419,-0.0643,0.1302,-0.0309,0.2049,-0.2419,-0.3125,-0.0261,-0.1417,-0.0433,0.0026,0.1545,-0.0297,-0.5969,-1.1599,-0.104,0.16,-0.1138,0.2061,-0.1964,-0.2259,-0.0444,-0.2323,0.0079,0.0451,-0.3934,-0.7692,-1.1443,-1.0439,0.0239,-0.082,-0.0106,-0.2931,-0.173,0.0613,-0.2549,0.0076,-0.0491,0.0045,-0.0248,-0.2804,-0.6369,-1.207,-0.0767,-9e-04,-0.2133,0.0813,0.0249,-0.1925,0.0952,0.0174,0.0715,0.1252,-0.0883,-0.4783,-0.7413,-1.3201))

df$species <- factor(df$species, levels = c("B","A"))
df$sex <- factor(df$sex, levels = c("M","F"))
df$time <- as.numeric(df$time)

 

Where the design matrix I used is:

design.mat <- model.matrix(y ~ species*time + sex,

                           contrasts = c("contr.treatment",NA,"contr.treatment"),
                           data = df)

 

 

The species:time effect size from the eBayes fit is: -0.02381599, and the p-value is: 1.2894e-39.

And here is how the data look:

library(ggplot2)

ggplot(df,aes(x=time,y=y,color=species))+geom_point()+theme_minimal()+stat_smooth(method="lm")

However, what's actually going on is that up to time = 24 the two species' slopes are roughly similar:

ggplot(df %>% dplyr::filter(time <= 24),aes(x=time,y=y,color=species))+geom_point()+theme_minimal()+stat_smooth(method="lm")

 

Whereas after time = 24 they're diverging:

ggplot(df %>% dplyr::filter(time > 24),aes(x=time,y=y,color=species))+geom_point()+theme_minimal()+stat_smooth(method="lm")

 

I was thinking that a low goodness of fit for all time points would help indicating that issue (as a means of summarizing this across all genes), but I can't seem to find how to get a summary of a goodness of fit for each gene through limma.

 

If I just fit a linear model (lm <- y ~ species * time + sex, data = df) to either all time points, or the time <= 24 and time > 24 separately, and then look at the QQ-plots and run a Shapiro test on the residuals, I see a poor fit for all time points, bet good fits for the time points up to 24 and above 24, separately:

All time points:

fit <- lm(y ~ species*time,data=df %>% dplyr::filter(time <= 24))
qqnorm(resid(fit))
qqline(resid(fit))
shapiro.test(resid(fit))$p.value

0.03420831


Up to time = 24:
fit <- lm(y ~ species*time,data=df %>% dplyr::filter(time <= 24))
qqnorm(resid(fit))
qqline(resid(fit))
shapiro.test(resid(fit))$p.value

0.2452327


And for time > 24:

fit <- lm(y ~ species*time,data=df %>% dplyr::filter(time > 24))
qqnorm(resid(fit))
qqline(resid(fit))
shapiro.test(resid(fit))$p.value


0.5827696

 

 

So my question is if there's a way to get a measure of goodness of fit through limma?

limma microarray goodness of fit • 816 views
ADD COMMENT

Login before adding your answer.

Traffic: 893 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6