Question

Building linear model (age effect on gene expression)

0

Entering edit mode

detroit.drive • 0

@detroitdrive-16757

Last seen 4.9 years ago

For the past few weeks I've been posting about devising a way to build linear models and plot them using the ALL dataset. Specifically a linear model for the effects of age on gene expression. Up to this point I've only been able to develop a model for 1 individual gene (probe id) but because my R syntax is so nascent (perhaps fledgling is more appropriate), my attempt to construct a linear model for all 12625 rows falls apart. Thanks to an excellent video posted by Jeff Leek from Johns Hopkins (coursera) I was able to construct something that looks pretty close to what I'm seeking (but just for one gene):

edata=as.data.frame(exprs(ALL))

edata = as.matrix(edata)

age <- pData(ALL)$age #thanks Martin for the shortcut!

lm1 = lm(edata[1, ] ~ age)

So lets try this for the entire set (12625):

lm12625 = lm(edata[1:12625] ~ age)

But I get an error:

Error in model.frame.default(formula = edata[1:12625] ~ age, drop.unused.levels = TRUE) : 
  variable lengths differ (found for 'age')

I'm also trying to figure out how I would alter the model to build one that specifically assesses the most significant effect on gene expression? Another model for insignificant age effect?

bioconductor ALL linear model • 1.4k views

ADD COMMENT • link updated 5.7 years ago by Gordon Smyth 50k • written 5.7 years ago by detroit.drive • 0

score 2 · Accepted Answer · 2018-08-14

2

Entering edit mode

Gordon Smyth 50k

@gordon-smyth

Last seen 2 hours ago

WEHI, Melbourne, Australia

library(ALL)
library(limma)
data(ALL)
age <- pData(ALL)$age
ALL2 <- ALL[, !is.na(age)]
design <- model.matrix(~age)
fit <- lmFit(ALL2, design)
fit <- eBayes(fit, trend=TRUE)

Then

> topTable(fit, coef="age")
            logFC AveExpr     t  P.Value adj.P.Val      B
40419_at  0.02507    8.37  4.87 3.37e-06    0.0425  2.446
38639_at  0.01279    6.54  4.32 3.10e-05    0.1959  0.301
336_at    0.01208    6.70  4.06 8.61e-05    0.3623 -0.677
38994_at  0.03929    7.61  3.95 1.30e-04    0.3694 -1.073
33700_at  0.02601    6.17  3.87 1.78e-04    0.3694 -1.368
32406_at -0.00922    4.70 -3.85 1.91e-04    0.3694 -1.438
40202_at  0.03820    8.02  3.81 2.17e-04    0.3694 -1.556
34519_at -0.00457    3.83 -3.79 2.34e-04    0.3694 -1.630
39373_at -0.01423    3.91 -3.73 2.92e-04    0.4093 -1.839
38167_at -0.00542    4.86 -3.67 3.61e-04    0.4563 -2.041

ADD COMMENT • link 5.7 years ago Gordon Smyth 50k

0

Entering edit mode

Many thanks Gordon! Let me try to break down what you've accomplished for my own understanding. First, for the !is.na(age) snippet - this is what's telling the code to create a variable (ALL2) which represents the entire (12625) matrix? Next, for the design step, you did the opposite of what I did in my example above for one gene, that is you've converted the age data.frame into a matrix (whereas I converted the exprs(ALL) matrix into a data.frame).

I don't see how the exprs(ALL) data was incorporated - is this inherently part of the limma package?

The topTable results provides a list of "Top Ten" probe_id's that were effected by age? I'm also wondering if there is a simple command for "bottom table"?

Thanks again, I very much appreciate you taking the time to show me this example!

EDIT: as a solution for getting the bottom ten, I printed out all the rows to a csv file and just looked them up. Not sure if this is the best way but it worked.

> topTable(fit, coef="age", sort.by="P", n="Inf")

> limma.res <-topTable(fit, coef="age", sort.by="P", n="Inf")

> write.csv(limma.res,file="B.PregVsLacResults.csv",row.names=TRUE)

ADD REPLY • link 5.7 years ago detroit.drive • 0