Building linear model (age effect on gene expression)
1
0
Entering edit mode
@detroitdrive-16757
Last seen 5.5 years ago

For the past few weeks I've been posting about devising a way to build linear models and plot them using the ALL dataset.  Specifically a linear model for the effects of age on gene expression.  Up to this point I've only been able to develop a model for 1 individual gene (probe id) but because my R syntax is so nascent (perhaps fledgling is more appropriate), my attempt to construct a linear model for all 12625 rows falls apart.  Thanks to an excellent video posted by Jeff Leek from Johns Hopkins (coursera) I was able to construct something that looks pretty close to what I'm seeking (but just for one gene):

edata=as.data.frame(exprs(ALL))

edata = as.matrix(edata)

age <- pData(ALL)$age #thanks Martin for the shortcut!

lm1 = lm(edata[1, ] ~ age)

​So lets try this for the entire set (12625):

lm12625 = lm(edata[1:12625] ~ age)

But I get an error:

Error in model.frame.default(formula = edata[1:12625] ~ age, drop.unused.levels = TRUE) : 
  variable lengths differ (found for 'age')

I'm also trying to figure out how I would alter the model to build one that specifically assesses the most significant effect on gene expression?  Another model for insignificant age effect?

bioconductor ALL linear model • 1.8k views
ADD COMMENT
2
Entering edit mode
@gordon-smyth
Last seen 20 minutes ago
WEHI, Melbourne, Australia
library(ALL)
library(limma)
data(ALL)
age <- pData(ALL)$age
ALL2 <- ALL[, !is.na(age)]
design <- model.matrix(~age)
fit <- lmFit(ALL2, design)
fit <- eBayes(fit, trend=TRUE)

Then

> topTable(fit, coef="age")
            logFC AveExpr     t  P.Value adj.P.Val      B
40419_at  0.02507    8.37  4.87 3.37e-06    0.0425  2.446
38639_at  0.01279    6.54  4.32 3.10e-05    0.1959  0.301
336_at    0.01208    6.70  4.06 8.61e-05    0.3623 -0.677
38994_at  0.03929    7.61  3.95 1.30e-04    0.3694 -1.073
33700_at  0.02601    6.17  3.87 1.78e-04    0.3694 -1.368
32406_at -0.00922    4.70 -3.85 1.91e-04    0.3694 -1.438
40202_at  0.03820    8.02  3.81 2.17e-04    0.3694 -1.556
34519_at -0.00457    3.83 -3.79 2.34e-04    0.3694 -1.630
39373_at -0.01423    3.91 -3.73 2.92e-04    0.4093 -1.839
38167_at -0.00542    4.86 -3.67 3.61e-04    0.4563 -2.041
ADD COMMENT
0
Entering edit mode

Many thanks Gordon!  Let me try to break down what you've accomplished for my own understanding.  First, for the !is.na(age) snippet - this is what's telling the code to create a variable (ALL2) which represents the entire (12625) matrix?  Next, for the design step, you did the opposite of what I did in my example above for one gene, that is you've converted the age data.frame into a matrix (whereas I converted the exprs(ALL) matrix into a data.frame).

I don't see how the exprs(ALL) data was incorporated - is this inherently part of the limma package? 

The topTable results provides a list of "Top Ten" probe_id's that were effected by age?  I'm also wondering if there is a simple command for "bottom table"?

Thanks again, I very much appreciate you taking the time to show me this example!

EDIT: as a solution for getting the bottom ten, I printed out all the rows to a csv file and just looked them up.  Not sure if this is the best way but it worked.

> topTable(fit, coef="age", sort.by="P", n="Inf")

> limma.res <-topTable(fit, coef="age", sort.by="P", n="Inf")

> write.csv(limma.res,file="B.PregVsLacResults.csv",row.names=TRUE)
ADD REPLY

Login before adding your answer.

Traffic: 579 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6