Question

Analysing Affy data with limma

0

Entering edit mode

January Weiner ▴ 240

@january-weiner-4252

Last seen 4.9 years ago

European Union

Dear all, I'm trying to analyse a data set reposited in GEO. In theory, I could download the CEL files and do everything from scratch, but I would prefer to use the GEO-reposited, normalized data. I load the data with library( GEOquery ) g <- getGEO( "GSE31348" )[[1]] So far, so good. The data is supposed to be GCrma-normalized. Density plots show that the arrays have a symmetrical distribution around 0, so they are not logarithmized signals. My question is how should I proceed with the data if I want to use limma to analyse it? Or should I use the original CEL files and a standard approach (as described, for example, in the limma guide)? If I dumbly run limma, I am getting more or less the results I expect, but the p-values seem to me overly optimistic: g2 <- new( "EList", list( targets= as( phenoData( g ), "data.frame" ), genes= as( featureData( g ), "data.frame" ), E= exprs( g ) ) ) g2$targets$tp <- factor( gsub( "time point: Week ", "tp.", g2$targets$characteristics_ch1.2 ) ) d <- model.matrix( ~ 0 + g2$targets$tp ) colnames( d ) <- levels( g2$targets$tp ) fit <- eBayes( contrasts.fit( lmFit( g2, d ), makeContrasts( "tp.0-tp.26", levels= d ) ) I think that this is incorrect. Kind regards, j. -- -------- January Weiner -------------------------------------- [[alternative HTML version deleted]]

limma GEOquery limma GEOquery • 1.1k views

ADD COMMENT • link updated 10.4 years ago by James W. MacDonald 65k • written 10.4 years ago by January Weiner ▴ 240

score 0 · Answer 1 · 2013-12-12

On Thu, Dec 12, 2013 at 9:50 AM, January Weiner <january.weiner@gmail.com>wrote: > Dear all, > > I'm trying to analyse a data set reposited in GEO. In theory, I could > download the CEL files and do everything from scratch, but I would prefer > to use the GEO-reposited, normalized data. > > I load the data with > > library( GEOquery ) > g <- getGEO( "GSE31348" )[[1]] > > So far, so good. The data is supposed to be GCrma-normalized. Density plots > show that the arrays have a symmetrical distribution around 0, so they are > not logarithmized signals. > Actually, according to the metadata associated with the samples, they are: "GC-RMA and baseline normalised to median" > > My question is how should I proceed with the data if I want to use limma to > analyse it? Or should I use the original CEL files and a standard approach > (as described, for example, in the limma guide)? > > If I dumbly run limma, I am getting more or less the results I expect, but > the p-values seem to me overly optimistic: > > g2 <- new( "EList", list( targets= as( phenoData( g ), "data.frame" ), > genes= as( featureData( g ), "data.frame" ), E= exprs( g ) ) ) > g2$targets$tp <- factor( gsub( "time point: Week ", "tp.", > g2$targets$characteristics_ch1.2 ) ) > d <- model.matrix( ~ 0 + g2$targets$tp ) > colnames( d ) <- levels( g2$targets$tp ) > fit <- eBayes( contrasts.fit( lmFit( g2, d ), makeContrasts( "tp.0-tp.26", > levels= d ) ) > > I think that this is incorrect. > The answer to your question may depend on how much you care about the specifics of the results. If you are going to make costly or important decisions based on the your analysis, starting with raw data is usually best. Some good QC and a search for batch effects may also help. Sean [[alternative HTML version deleted]]

score 0 · Answer 2 · 2013-12-12

Hi January, I think what you are doing is probably fine. From the paper: "Affymetrix CEL files were imported, summarized using the GC-RMA algorithm, and baseline-normalized to the median of all arrays." So that's pretty close to what you would get if you just took the celfiles yourself and ran them through affy or gcrma (or xps or frma or SCAN.UPC), except for the location change from shifting the data by the median. And I wouldn't be too surprised by very small p-values because you have a) 27 samples/group, and b) they have tuberculosis, which I would expect to get the various white blood cells really worked up. However, I believe you are ignoring the fact that these data are paired, and should be including a 'patient' factor to account for the pairing. Best, Jim On 12/12/2013 9:50 AM, January Weiner wrote: > Dear all, > > I'm trying to analyse a data set reposited in GEO. In theory, I could > download the CEL files and do everything from scratch, but I would prefer > to use the GEO-reposited, normalized data. > > I load the data with > > library( GEOquery ) > g <- getGEO( "GSE31348" )[[1]] > > So far, so good. The data is supposed to be GCrma-normalized. Density plots > show that the arrays have a symmetrical distribution around 0, so they are > not logarithmized signals. > > My question is how should I proceed with the data if I want to use limma to > analyse it? Or should I use the original CEL files and a standard approach > (as described, for example, in the limma guide)? > > If I dumbly run limma, I am getting more or less the results I expect, but > the p-values seem to me overly optimistic: > > g2 <- new( "EList", list( targets= as( phenoData( g ), "data.frame" ), > genes= as( featureData( g ), "data.frame" ), E= exprs( g ) ) ) > g2$targets$tp <- factor( gsub( "time point: Week ", "tp.", > g2$targets$characteristics_ch1.2 ) ) > d <- model.matrix( ~ 0 + g2$targets$tp ) > colnames( d ) <- levels( g2$targets$tp ) > fit <- eBayes( contrasts.fit( lmFit( g2, d ), makeContrasts( "tp.0-tp.26", > levels= d ) ) > > I think that this is incorrect. > > Kind regards, > j. > -- James W. MacDonald, M.S. Biostatistician University of Washington Environmental and Occupational Health Sciences 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099