programming problem: running many ANOVAs

0

Entering edit mode

Arne.Muller@aventis.com ▴ 620

@arnemulleraventiscom-466

Last seen 9.7 years ago

Hello, I've a R/bioconductor programming question. For each gene in a matrix (emat) generated by exprs() I'm running an anova based on a goven model. The p-values for the anova and the coefficients from the lm for each gene are stored in a list and are returned. A typical emat contains 12,000 rows and 40 colums or so. The probelm is that it runs about 10 minutes (P4 2.8GHz under linux with 2GB RAM) to complete. I wonder if there is a more effifient way to program this routine (running the 12,000 lm's and ANOVAs). Is there a way to get rid of the generation of the data.frame that I pass to the lm? I call my routine like this (dose and batch are two factors): > dif.04h <- arnova(emat.04h, list(dose, batch), Value ~ dose + batch) [1] "1000" "genes done ..." [1] "2000" "genes done ..." [1] "3000" "genes done ..." [1] "4000" "genes done ..." [1] "5000" "genes done ..." [1] "6000" "genes done ..." [1] "7000" "genes done ..." [1] "8000" "genes done ..." [1] "9000" "genes done ..." [1] "10000" "genes done ..." [1] "11000" "genes done ..." [1] "12000" "genes done ..." here's the routine: arnova <- function(emat, factors, model, contr=NULL) { genes <- rownames(emat) # gene names l <- length(genes) # number of genes difflabels <- attr(terms(model),"term.labels") # factors from model diff <- list() # stores an anova(lm) result coef <- list() # stores an summary(lm) ### loop over all rows (genes) and calculate the lm & anova, ### the p-values and coefficients for each factor are stored for ( i in 1:l ) { ### the input data frame for this gene d <- data.frame(Value = emat[i, ], factors) fit <- lm(model, data = d, contrasts=contr) genecoef <- summary(fit)$coefficients[-1,4] # coefficients genediff <- as.vector(na.omit(anova(fit)$'Pr(>F)')) # anova p-values names(genediff) <- difflabels # the factor names diff[[genes[i]]] <- genediff coef[[genes[i]]] <- genecoef ###print progress if ( !(i%%1000) ) { print(c(i, 'genes done ...')) } } list(anova=diff, coefs=coef) } A result look slike this: > dif.04h$'anova'[1:3] $"100001_at" dose batch 0.5245583 0.2333894 $"100002_at" dose batch 9.635027e-01 3.193364e-05 $"100003_at" dose batch 7.374520e-01 4.639076e-06 > dif.04h$'coef'[1:3] $"100001_at" dose010mM dose025mM dose050mM dose100mM batchOLD batchPRG 0.6895026 0.2869963 0.7480160 0.4776234 0.1204537 0.1837618 $"100002_at" dose010mM dose025mM dose050mM dose100mM batchOLD batchPRG 6.192953e-01 8.952266e-01 8.113848e-01 9.412216e-01 1.515069e-05 4.904563e-04 $"100003_at" dose010mM dose025mM dose050mM dose100mM batchOLD batchPRG 9.574872e-01 8.735150e-01 4.739319e-01 2.870358e-01 9.635836e-06 1.549058e-05 Any suggestions to speed this up? thanks a lot for your help, +regards, Arne -- Arne Muller, Ph.D. Toxicogenomics, Aventis Pharma arne dot muller domain=aventis com

DOSE DOSE • 995 views

ADD COMMENT • link updated 20.1 years ago by Francois Collin ▴ 130 • written 20.1 years ago by Arne.Muller@aventis.com ▴ 620

0

Entering edit mode

Francois Collin ▴ 130

@francois-collin-470

Last seen 9.7 years ago

Yes, there is a way to avoid all of the overhead in lm(). As you are fitting the same model to the data for all genes, you should only have to get the qr decomposition of your design matrix once. Here are some ideas for code - with someting like this you should get the same results that you would get out of lm() in seconds.: ################################### ## get qr decomposition of design matrix x.mtx <- model.matrix(~ -1 + Visit:Treatment) x.qr<-qr(x.mtx) x.det<-prod(diag(x.qr$qr)) if (x.det == 0) stop("Determinent is Zero!") # get unscaled covariance and degress of freedom x.cov.unscaled<-solve(t(qr.R(x.qr)) %*% qr.R(x.qr)) x.df<-nrow(x.mtx)-x.qr$rank # Fit Model for all genes # Y.mtx has one column for every gene $ rows of Y.mtx are observations with Visit, Treatment attributes Fit.coef<-qr.coef(x.qr, Y.mtx) Fit.res<-qr.resid(x.qr, Y.mtx) Fit.sigma2<-matrix(apply(Fit.res, 2, FUN=function(y) {t(y)%*%y/x.df}), nrow=1) Fit.coef.var<-apply(Fit.sigma2, 2, FUN=function(y) {y*diag(x.cov.unscaled)}) Fit.coef.t<-Fit.coef/sqrt(Fit.coef.var) # # # etc. --- Arne.Muller@aventis.com wrote: > Hello, > > I've a R/bioconductor programming question. For each > gene in a matrix (emat) generated by exprs() I'm > running an anova based on a goven model. The > p-values for the anova and the coefficients from the > lm for each gene are stored in a list and are > returned. > > A typical emat contains 12,000 rows and 40 colums or > so. The probelm is that it runs about 10 minutes (P4 > 2.8GHz under linux with 2GB RAM) to complete. > > I wonder if there is a more effifient way to program > this routine (running the 12,000 lm's and ANOVAs). > Is there a way to get rid of the generation of the > data.frame that I pass to the lm? > > I call my routine like this (dose and batch are two > factors): > > > dif.04h <- arnova(emat.04h, list(dose, batch), > Value ~ dose + batch) > [1] "1000" "genes done ..." > [1] "2000" "genes done ..." > [1] "3000" "genes done ..." > [1] "4000" "genes done ..." > [1] "5000" "genes done ..." > [1] "6000" "genes done ..." > [1] "7000" "genes done ..." > [1] "8000" "genes done ..." > [1] "9000" "genes done ..." > [1] "10000" "genes done ..." > [1] "11000" "genes done ..." > [1] "12000" "genes done ..." > > here's the routine: > > arnova <- function(emat, factors, model, contr=NULL) > { > genes <- rownames(emat) # gene names > l <- length(genes) # number of genes > difflabels <- attr(terms(model),"term.labels") > # factors from model > diff <- list() # stores an anova(lm) result > coef <- list() # stores an summary(lm) > > ### loop over all rows (genes) and calculate > the lm & anova, > ### the p-values and coefficients for each factor > are stored > for ( i in 1:l ) { > > ### the input data frame for this gene > d <- data.frame(Value = emat[i, ], > factors) > fit <- lm(model, data = d, > contrasts=contr) > genecoef <- > summary(fit)$coefficients[-1,4] # coefficients > genediff <- > as.vector(na.omit(anova(fit)$'Pr(>F)')) # anova > p-values > names(genediff) <- difflabels # the factor > names > diff[[genes[i]]] <- genediff > coef[[genes[i]]] <- genecoef > > ###print progress > if ( !(i%%1000) ) { print(c(i, 'genes done > ...')) } > } > > list(anova=diff, coefs=coef) > } > > A result look slike this: > > > dif.04h$'anova'[1:3] > $"100001_at" > dose batch > 0.5245583 0.2333894 > > $"100002_at" > dose batch > 9.635027e-01 3.193364e-05 > > $"100003_at" > dose batch > 7.374520e-01 4.639076e-06 > > > dif.04h$'coef'[1:3] > $"100001_at" > dose010mM dose025mM dose050mM dose100mM batchOLD > batchPRG > 0.6895026 0.2869963 0.7480160 0.4776234 0.1204537 > 0.1837618 > > $"100002_at" > dose010mM dose025mM dose050mM dose100mM > batchOLD batchPRG > 6.192953e-01 8.952266e-01 8.113848e-01 9.412216e-01 > 1.515069e-05 4.904563e-04 > > $"100003_at" > dose010mM dose025mM dose050mM dose100mM > batchOLD batchPRG > 9.574872e-01 8.735150e-01 4.739319e-01 2.870358e-01 > 9.635836e-06 1.549058e-05 > > Any suggestions to speed this up? > > thanks a lot for your help, > +regards, > > Arne > > -- > Arne Muller, Ph.D. > Toxicogenomics, Aventis Pharma > arne dot muller domain=aventis com > > _______________________________________________ > Bioconductor mailing list > Bioconductor@stat.math.ethz.ch > https://www.stat.math.ethz.ch/mailman/listinfo/bioconductor

ADD COMMENT • link 20.1 years ago Francois Collin ▴ 130

Login before adding your answer.