Analysing Affy data with limma
2
0
Entering edit mode
@january-weiner-4252
Last seen 4.9 years ago
European Union
Dear all, I'm trying to analyse a data set reposited in GEO. In theory, I could download the CEL files and do everything from scratch, but I would prefer to use the GEO-reposited, normalized data. I load the data with library( GEOquery ) g <- getGEO( "GSE31348" )[[1]] So far, so good. The data is supposed to be GCrma-normalized. Density plots show that the arrays have a symmetrical distribution around 0, so they are not logarithmized signals. My question is how should I proceed with the data if I want to use limma to analyse it? Or should I use the original CEL files and a standard approach (as described, for example, in the limma guide)? If I dumbly run limma, I am getting more or less the results I expect, but the p-values seem to me overly optimistic: g2 <- new( "EList", list( targets= as( phenoData( g ), "data.frame" ), genes= as( featureData( g ), "data.frame" ), E= exprs( g ) ) ) g2$targets$tp <- factor( gsub( "time point: Week ", "tp.", g2$targets$characteristics_ch1.2 ) ) d <- model.matrix( ~ 0 + g2$targets$tp ) colnames( d ) <- levels( g2$targets$tp ) fit <- eBayes( contrasts.fit( lmFit( g2, d ), makeContrasts( "tp.0-tp.26", levels= d ) ) I think that this is incorrect. Kind regards, j. -- -------- January Weiner -------------------------------------- [[alternative HTML version deleted]]
limma GEOquery limma GEOquery • 1.1k views
ADD COMMENT
0
Entering edit mode
@sean-davis-490
Last seen 3 months ago
United States
On Thu, Dec 12, 2013 at 9:50 AM, January Weiner <january.weiner@gmail.com>wrote: > Dear all, > > I'm trying to analyse a data set reposited in GEO. In theory, I could > download the CEL files and do everything from scratch, but I would prefer > to use the GEO-reposited, normalized data. > > I load the data with > > library( GEOquery ) > g <- getGEO( "GSE31348" )[[1]] > > So far, so good. The data is supposed to be GCrma-normalized. Density plots > show that the arrays have a symmetrical distribution around 0, so they are > not logarithmized signals. > Actually, according to the metadata associated with the samples, they are: "GC-RMA and baseline normalised to median" > > My question is how should I proceed with the data if I want to use limma to > analyse it? Or should I use the original CEL files and a standard approach > (as described, for example, in the limma guide)? > > If I dumbly run limma, I am getting more or less the results I expect, but > the p-values seem to me overly optimistic: > > g2 <- new( "EList", list( targets= as( phenoData( g ), "data.frame" ), > genes= as( featureData( g ), "data.frame" ), E= exprs( g ) ) ) > g2$targets$tp <- factor( gsub( "time point: Week ", "tp.", > g2$targets$characteristics_ch1.2 ) ) > d <- model.matrix( ~ 0 + g2$targets$tp ) > colnames( d ) <- levels( g2$targets$tp ) > fit <- eBayes( contrasts.fit( lmFit( g2, d ), makeContrasts( "tp.0-tp.26", > levels= d ) ) > > I think that this is incorrect. > The answer to your question may depend on how much you care about the specifics of the results. If you are going to make costly or important decisions based on the your analysis, starting with raw data is usually best. Some good QC and a search for batch effects may also help. Sean [[alternative HTML version deleted]]
ADD COMMENT
0
Entering edit mode
@james-w-macdonald-5106
Last seen 6 hours ago
United States
Hi January, I think what you are doing is probably fine. From the paper: "Affymetrix CEL files were imported, summarized using the GC-RMA algorithm, and baseline-normalized to the median of all arrays." So that's pretty close to what you would get if you just took the celfiles yourself and ran them through affy or gcrma (or xps or frma or SCAN.UPC), except for the location change from shifting the data by the median. And I wouldn't be too surprised by very small p-values because you have a) 27 samples/group, and b) they have tuberculosis, which I would expect to get the various white blood cells really worked up. However, I believe you are ignoring the fact that these data are paired, and should be including a 'patient' factor to account for the pairing. Best, Jim On 12/12/2013 9:50 AM, January Weiner wrote: > Dear all, > > I'm trying to analyse a data set reposited in GEO. In theory, I could > download the CEL files and do everything from scratch, but I would prefer > to use the GEO-reposited, normalized data. > > I load the data with > > library( GEOquery ) > g <- getGEO( "GSE31348" )[[1]] > > So far, so good. The data is supposed to be GCrma-normalized. Density plots > show that the arrays have a symmetrical distribution around 0, so they are > not logarithmized signals. > > My question is how should I proceed with the data if I want to use limma to > analyse it? Or should I use the original CEL files and a standard approach > (as described, for example, in the limma guide)? > > If I dumbly run limma, I am getting more or less the results I expect, but > the p-values seem to me overly optimistic: > > g2 <- new( "EList", list( targets= as( phenoData( g ), "data.frame" ), > genes= as( featureData( g ), "data.frame" ), E= exprs( g ) ) ) > g2$targets$tp <- factor( gsub( "time point: Week ", "tp.", > g2$targets$characteristics_ch1.2 ) ) > d <- model.matrix( ~ 0 + g2$targets$tp ) > colnames( d ) <- levels( g2$targets$tp ) > fit <- eBayes( contrasts.fit( lmFit( g2, d ), makeContrasts( "tp.0-tp.26", > levels= d ) ) > > I think that this is incorrect. > > Kind regards, > j. > -- James W. MacDonald, M.S. Biostatistician University of Washington Environmental and Occupational Health Sciences 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099
ADD COMMENT
0
Entering edit mode
Hi James, yes, we do see a similar response in other data sets (with more samples) that we and others have generated, but it still feels a bit weird. I have omitted the question of pairing for the sake of the simplicity here, but I agree that I need to include this in the calculations. I use duplicateCorrelation for that. @Sean: no, it's not so important -- I just use it for the sake of completeness. Kind regards, j. On 12 December 2013 16:20, James W. MacDonald <jmacdon@uw.edu> wrote: > Hi January, > > I think what you are doing is probably fine. From the paper: > > "Affymetrix CEL files were imported, summarized using the GC-RMA > algorithm, and baseline-normalized to the median of all arrays." > > So that's pretty close to what you would get if you just took the celfiles > yourself and ran them through affy or gcrma (or xps or frma or SCAN.UPC), > except for the location change from shifting the data by the median. And I > wouldn't be too surprised by very small p-values because you have a) 27 > samples/group, and b) they have tuberculosis, which I would expect to get > the various white blood cells really worked up. > > However, I believe you are ignoring the fact that these data are paired, > and should be including a 'patient' factor to account for the pairing. > > Best, > > Jim > > > > On 12/12/2013 9:50 AM, January Weiner wrote: > >> Dear all, >> >> I'm trying to analyse a data set reposited in GEO. In theory, I could >> download the CEL files and do everything from scratch, but I would prefer >> to use the GEO-reposited, normalized data. >> >> I load the data with >> >> library( GEOquery ) >> g <- getGEO( "GSE31348" )[[1]] >> >> So far, so good. The data is supposed to be GCrma-normalized. Density >> plots >> show that the arrays have a symmetrical distribution around 0, so they are >> not logarithmized signals. >> >> My question is how should I proceed with the data if I want to use limma >> to >> analyse it? Or should I use the original CEL files and a standard approach >> (as described, for example, in the limma guide)? >> >> If I dumbly run limma, I am getting more or less the results I expect, but >> the p-values seem to me overly optimistic: >> >> g2 <- new( "EList", list( targets= as( phenoData( g ), "data.frame" ), >> genes= as( featureData( g ), "data.frame" ), E= exprs( g ) ) ) >> g2$targets$tp <- factor( gsub( "time point: Week ", "tp.", >> g2$targets$characteristics_ch1.2 ) ) >> d <- model.matrix( ~ 0 + g2$targets$tp ) >> colnames( d ) <- levels( g2$targets$tp ) >> fit <- eBayes( contrasts.fit( lmFit( g2, d ), makeContrasts( "tp.0-tp.26", >> levels= d ) ) >> >> I think that this is incorrect. >> >> Kind regards, >> j. >> >> > -- > James W. MacDonald, M.S. > Biostatistician > University of Washington > Environmental and Occupational Health Sciences > 4225 Roosevelt Way NE, # 100 > Seattle WA 98105-6099 > > -- -------- January Weiner -------------------------------------- [[alternative HTML version deleted]]
ADD REPLY

Login before adding your answer.

Traffic: 536 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6