Using limma to identify differentially expressed genes

0

Entering edit mode

David Westergaard ▴ 280

@david-westergaard-5119

Last seen 10.3 years ago

Hello, I've been trying to use limma to identify genes from the following data: http://www.ebi.ac.uk/arrayexpress/experiments/E-GEOD-21340 - It's a simple control vs. disease experiment # SDRF downloaded from above page SDRF <- read.table(file="E-GEOD-21340.sdrf.txt",header=TRUE,stringsAsF actors=FALSE,sep="\t") # Looking to compare Family-history negativer versus Diabetis controls <- SDRF[grep("Control, Family History Neg",SDRF$Comment..Sample_source_name.),] disease <- SDRF[grep("^DM",SDRF$Characteristics.DiseaseState.),] Batch <- rbind(controls,disease) # Read in CEL files mixture.batch <- ReadAffy(filenames=Batch$Array.Data.File) # Preprocess data mixture.processed <- expresso(mixture.batch, bgcorrect.method = "rma", normalize.method = "quantiles", pmcorrect.method = "pmonly", summary.method = "medianpolish") # Get data in matrix signals <- exprs(mixture.prepared) cl <- ifelse(colnames(signals) %in% disease$Array.Data.File,1,0) # Do design matrix and fit design <- model.matrix(~factor(cl)) fit <- lmFit(signals,design) fit <- eBayes(fit) topTable(fit2,coef=2) Which yields the following: ID logFC AveExpr t P.Value adj.P.Val B 7513 208004_at -0.323 5.43 -4.65 0.00191 0.999 -3.10 11225 211829_s_at 0.340 5.07 4.36 0.00278 0.999 -3.17 5950 206424_at -0.907 6.65 -4.15 0.00363 0.999 -3.23 1354 201826_s_at -0.447 8.37 -4.13 0.00374 0.999 -3.24 19782 220418_at 0.392 5.43 4.02 0.00431 0.999 -3.27 8889 209396_s_at 1.899 7.47 4.01 0.00437 0.999 -3.28 5005 205478_at -0.931 9.22 -3.94 0.00481 0.999 -3.30 9469 209983_s_at 0.412 5.72 3.92 0.00492 0.999 -3.31 2936 203409_at 0.506 6.93 3.87 0.00531 0.999 -3.32 5054 205527_s_at 0.331 6.80 3.84 0.00549 0.999 -3.33 I'm abit puzzled over the adjusted P-values. Can it really be true that ALL of the adjusted P-values are 0.999, or did I make a rookie mistake somewhere? Best Regards, David Westergaard

limma limma • 1.8k views

ADD COMMENT • link updated 12.7 years ago by Voke AO ▴ 760 • written 12.7 years ago by David Westergaard ▴ 280

0

Entering edit mode

James W. MacDonald 67k

@james-w-macdonald-5106

Last seen 3 days ago

United States

Hi David, On 4/10/2012 11:14 AM, David Westergaard wrote: > Hello, > > I've been trying to use limma to identify genes from the following > data: http://www.ebi.ac.uk/arrayexpress/experiments/E-GEOD-21340 - > It's a simple control vs. disease experiment > > > # SDRF downloaded from above page > SDRF<- read.table(file="E-GEOD-21340.sdrf.txt",header=TRUE,stringsAs Factors=FALSE,sep="\t") > > # Looking to compare Family-history negativer versus Diabetis > controls<- SDRF[grep("Control, Family History > Neg",SDRF$Comment..Sample_source_name.),] > disease<- SDRF[grep("^DM",SDRF$Characteristics.DiseaseState.),] > Batch<- rbind(controls,disease) > > # Read in CEL files > mixture.batch<- ReadAffy(filenames=Batch$Array.Data.File) > > # Preprocess data > mixture.processed<- expresso(mixture.batch, bgcorrect.method = "rma", > normalize.method = "quantiles", pmcorrect.method = "pmonly", > summary.method = "medianpolish") > > # Get data in matrix > signals<- exprs(mixture.prepared) > cl<- ifelse(colnames(signals) %in% disease$Array.Data.File,1,0) > > # Do design matrix and fit > design<- model.matrix(~factor(cl)) > fit<- lmFit(signals,design) > fit<- eBayes(fit) > topTable(fit2,coef=2) > > Which yields the following: > ID logFC AveExpr t P.Value adj.P.Val B > 7513 208004_at -0.323 5.43 -4.65 0.00191 0.999 -3.10 > 11225 211829_s_at 0.340 5.07 4.36 0.00278 0.999 -3.17 > 5950 206424_at -0.907 6.65 -4.15 0.00363 0.999 -3.23 > 1354 201826_s_at -0.447 8.37 -4.13 0.00374 0.999 -3.24 > 19782 220418_at 0.392 5.43 4.02 0.00431 0.999 -3.27 > 8889 209396_s_at 1.899 7.47 4.01 0.00437 0.999 -3.28 > 5005 205478_at -0.931 9.22 -3.94 0.00481 0.999 -3.30 > 9469 209983_s_at 0.412 5.72 3.92 0.00492 0.999 -3.31 > 2936 203409_at 0.506 6.93 3.87 0.00531 0.999 -3.32 > 5054 205527_s_at 0.331 6.80 3.84 0.00549 0.999 -3.33 > > I'm abit puzzled over the adjusted P-values. Can it really be true > that ALL of the adjusted P-values are 0.999, or did I make a rookie > mistake somewhere? Well, there definitely is some weirdness going on. To answer your question, yes it is possible for all the adjusted p-values to be right close to 1, if there isn't any evidence for differential expression. There are any number of reasons this can arise, but the main culprits in my experience are usually very noisy intra-group data with hardly any inter-group differences. A PCA plot can often shed light on this sort of problem. Now to get to the weirdness. You are getting these data by hand, when it might be easier to get from within R, using GEOquery. Note here that I have snipped extraneous output from my R session. > library(GEOquery) > geo <- getGEO("GSE21340") > dat <- getGEOSuppFiles("GSE21340") > setwd("GSE21340") > system("tar xvf GSE21340_RAW.tar") > library(affy) > eset <- rma(ReadAffy()) > design <- model.matrix(~pData(geo[[1]])$source_name_ch1) And just to note that I am using the correct data for the design matrix: > levels(pData(geo[[1]])$source_name_ch1) [1] "Control, Family History Neg" "Control, Family History Pos" [3] "Replicate from 1 patient" "Type 2 DM" > fit <- lmFit(eset, design) > fit <- eBayes(fit) > topTable(fit, 4)[,c(1,6)] ID adj.P.Val 2454 M19483_at 0.005571008 204 D10523_at 0.005571008 2126 L38941_at 0.009749608 2299 M11717_rna1_at 0.015951421 2574 M25077_at 0.015951421 193 D00760_at 0.022937490 824 D87002_cds2_at 0.022937490 3079 M71243_f_at 0.027533357 1467 J02902_at 0.027533357 4802 U59423_at 0.028686408 Sorry about the alignment, but you see here that I do get some differentially expressed genes, but not many. This is probably due in part to the fact that I am using all four sample types in this analysis, so the denominator in any contrast for me will be based on the sums of squares of error for all four samples, whereas you are using only two. This will increase the power to detect differentially expressed genes to some extent. However, note the probeset IDs, as well as the annotation of my eset object: > annotation(eset) [1] "hu6800" If I search for the top probeset ID from your topTable, it appears to be from a hgu133a or hgfocus array, not a hu6800 array (which is also known as a HuGeneFl array). So it looks like your expression data might not be from this experiment. Best, Jim > > Best Regards, > David Westergaard > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- James W. MacDonald, M.S. Biostatistician University of Washington Environmental and Occupational Health Sciences 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099

ADD COMMENT • link 12.7 years ago James W. MacDonald 67k

0

Entering edit mode

Voke AO ▴ 760

@voke-ao-4830

Last seen 10.3 years ago

Hi Jim, I digress a bit (sorry David). But I was looking at your code combining both getGEO and getGEOSuppFiles. The analysis you did I'm guessing is based on the raw files because you had to read in an affybatch object. I'm having a challenge with making my covdesc.txt file to work for me with read.affy(), so I'm wondering if your combination is a way to retrieve the phenotypic data without having to manually create the text file. In other words, my question is: does this combination of getGEO() and getGEOSuppFiles() make it possible to boycott the use of the manually created covdesc.txt file in read.affy()? Thanks. Avoks On Tue, Apr 10, 2012 at 5:14 PM, David Westergaard <david at="" harsk.dk=""> wrote: > Hello, > > I've been trying to use limma to identify genes from the following > data: http://www.ebi.ac.uk/arrayexpress/experiments/E-GEOD-21340 - > It's a simple control vs. disease experiment > > > # SDRF downloaded from above page > SDRF <- read.table(file="E-GEOD-21340.sdrf.txt",header=TRUE,stringsA sFactors=FALSE,sep="\t") > > # Looking to compare Family-history negativer versus Diabetis > controls <- SDRF[grep("Control, Family History > Neg",SDRF$Comment..Sample_source_name.),] > disease <- SDRF[grep("^DM",SDRF$Characteristics.DiseaseState.),] > Batch <- rbind(controls,disease) > > # Read in CEL files > mixture.batch <- ReadAffy(filenames=Batch$Array.Data.File) > > # Preprocess data > mixture.processed <- expresso(mixture.batch, bgcorrect.method = "rma", > normalize.method = "quantiles", pmcorrect.method = "pmonly", > summary.method = "medianpolish") > > # Get data in matrix > signals <- exprs(mixture.prepared) > cl <- ifelse(colnames(signals) %in% disease$Array.Data.File,1,0) > > # Do design matrix and fit > design <- model.matrix(~factor(cl)) > fit <- lmFit(signals,design) > fit <- eBayes(fit) > topTable(fit2,coef=2) > > Which yields the following: > ? ? ? ? ? ? ? ID ?logFC AveExpr ? ? t P.Value adj.P.Val ? ? B > 7513 ? ?208004_at -0.323 5.43 -4.65 0.00191 ? ? 0.999 -3.10 > 11225 211829_s_at ?0.340 5.07 4.36 0.00278 ? ? 0.999 -3.17 > 5950 ? ?206424_at -0.907 ? ?6.65 -4.15 0.00363 ? ? 0.999 -3.23 > 1354 ?201826_s_at -0.447 8.37 -4.13 0.00374 ? ? 0.999 -3.24 > 19782 ? 220418_at ?0.392 5.43 4.02 0.00431 ? ? 0.999 -3.27 > 8889 ?209396_s_at ?1.899 ? ?7.47 ?4.01 0.00437 ? ? 0.999 -3.28 > 5005 ? ?205478_at -0.931 ? ?9.22 -3.94 0.00481 ? ? 0.999 -3.30 > 9469 ?209983_s_at ?0.412 5.72 3.92 0.00492 ? ? 0.999 -3.31 > 2936 ? ?203409_at ?0.506 ? ?6.93 ?3.87 0.00531 ? ? 0.999 -3.32 > 5054 ?205527_s_at ?0.331 6.80 3.84 0.00549 ? ? 0.999 -3.33 > > I'm abit puzzled over the adjusted P-values. Can it really be true > that ALL of the adjusted P-values are 0.999, or did I make a rookie > mistake somewhere? > > Best Regards, > David Westergaard > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD COMMENT • link 12.7 years ago Voke AO ▴ 760

0

Entering edit mode

Hi Avoks, On 4/11/2012 7:28 AM, Ovokeraye Achinike-Oduaran wrote: > Hi Jim, > > I digress a bit (sorry David). But I was looking at your code > combining both getGEO and getGEOSuppFiles. The analysis you did I'm > guessing is based on the raw files because you had to read in an > affybatch object. I'm having a challenge with making my covdesc.txt > file to work for me with read.affy(), so I'm wondering if your > combination is a way to retrieve the phenotypic data without having to > manually create the text file. In other words, my question is: does > this combination of getGEO() and getGEOSuppFiles() make it possible to > boycott the use of the manually created covdesc.txt file in > read.affy()? I assume that a covdesc.txt file is something you want to use for the phenoData slot of your ExpressionSet? If so, note that you don't have to explicitly create or use the phenoData slot; it is there in order to make your ExpressionSet self-descriptive for others. Unless you are planning to give your ExpressionSet to somebody else, I don't see a pressing reason to ever bother with creating and using the phenoData. You already know what is in there, and can easily create any design matrices, etc for further analyses. But to answer your question, I only used getGEO() in order to get the phenoData so I could easily create a design matrix without having to figure out which sample is which. Best, Jim > > Thanks. > > Avoks > > > On Tue, Apr 10, 2012 at 5:14 PM, David Westergaard<david at="" harsk.dk=""> wrote: >> Hello, >> >> I've been trying to use limma to identify genes from the following >> data: http://www.ebi.ac.uk/arrayexpress/experiments/E-GEOD-21340 - >> It's a simple control vs. disease experiment >> >> >> # SDRF downloaded from above page >> SDRF<- read.table(file="E-GEOD-21340.sdrf.txt",header=TRUE,stringsA sFactors=FALSE,sep="\t") >> >> # Looking to compare Family-history negativer versus Diabetis >> controls<- SDRF[grep("Control, Family History >> Neg",SDRF$Comment..Sample_source_name.),] >> disease<- SDRF[grep("^DM",SDRF$Characteristics.DiseaseState.),] >> Batch<- rbind(controls,disease) >> >> # Read in CEL files >> mixture.batch<- ReadAffy(filenames=Batch$Array.Data.File) >> >> # Preprocess data >> mixture.processed<- expresso(mixture.batch, bgcorrect.method = "rma", >> normalize.method = "quantiles", pmcorrect.method = "pmonly", >> summary.method = "medianpolish") >> >> # Get data in matrix >> signals<- exprs(mixture.prepared) >> cl<- ifelse(colnames(signals) %in% disease$Array.Data.File,1,0) >> >> # Do design matrix and fit >> design<- model.matrix(~factor(cl)) >> fit<- lmFit(signals,design) >> fit<- eBayes(fit) >> topTable(fit2,coef=2) >> >> Which yields the following: >> ID logFC AveExpr t P.Value adj.P.Val B >> 7513 208004_at -0.323 5.43 -4.65 0.00191 0.999 -3.10 >> 11225 211829_s_at 0.340 5.07 4.36 0.00278 0.999 -3.17 >> 5950 206424_at -0.907 6.65 -4.15 0.00363 0.999 -3.23 >> 1354 201826_s_at -0.447 8.37 -4.13 0.00374 0.999 -3.24 >> 19782 220418_at 0.392 5.43 4.02 0.00431 0.999 -3.27 >> 8889 209396_s_at 1.899 7.47 4.01 0.00437 0.999 -3.28 >> 5005 205478_at -0.931 9.22 -3.94 0.00481 0.999 -3.30 >> 9469 209983_s_at 0.412 5.72 3.92 0.00492 0.999 -3.31 >> 2936 203409_at 0.506 6.93 3.87 0.00531 0.999 -3.32 >> 5054 205527_s_at 0.331 6.80 3.84 0.00549 0.999 -3.33 >> >> I'm abit puzzled over the adjusted P-values. Can it really be true >> that ALL of the adjusted P-values are 0.999, or did I make a rookie >> mistake somewhere? >> >> Best Regards, >> David Westergaard >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- James W. MacDonald, M.S. Biostatistician University of Washington Environmental and Occupational Health Sciences 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099

ADD REPLY • link 12.7 years ago James W. MacDonald 67k

0

Entering edit mode

Thanks Jim. -Avoks On Wed, Apr 11, 2012 at 2:56 PM, James W. MacDonald <jmacdon at="" uw.edu=""> wrote: > Hi Avoks, > > > On 4/11/2012 7:28 AM, Ovokeraye Achinike-Oduaran wrote: >> >> Hi Jim, >> >> I digress a bit (sorry David). But I was looking at your code >> combining both getGEO and getGEOSuppFiles. The analysis you did I'm >> guessing is based on the raw files because you had to read in an >> affybatch object. I'm having a challenge with making my covdesc.txt >> file to work for me with read.affy(), so I'm wondering if your >> combination is a way to retrieve the phenotypic data without having to >> manually create the text file. In other words, my question is: does >> this combination of getGEO() and getGEOSuppFiles() make it possible to >> boycott the use of the manually created covdesc.txt file in >> read.affy()? > > > I assume that a covdesc.txt file is something you want to use for the > phenoData slot of your ExpressionSet? If so, note that you don't have to > explicitly create or use the phenoData slot; it is there in order to make > your ExpressionSet self-descriptive for others. > > Unless you are planning to give your ExpressionSet to somebody else, I don't > see a pressing reason to ever bother with creating and using the phenoData. > You already know what is in there, and can easily create any design > matrices, etc for further analyses. > > But to answer your question, I only used getGEO() in order to get the > phenoData so I could easily create a design matrix without having to figure > out which sample is which. > > Best, > > Jim > > > >> >> Thanks. >> >> Avoks >> >> >> On Tue, Apr 10, 2012 at 5:14 PM, David Westergaard<david at="" harsk.dk=""> ?wrote: >>> >>> Hello, >>> >>> I've been trying to use limma to identify genes from the following >>> data: http://www.ebi.ac.uk/arrayexpress/experiments/E-GEOD-21340 - >>> It's a simple control vs. disease experiment >>> >>> >>> # SDRF downloaded from above page >>> SDRF<- >>> read.table(file="E-GEOD-21340.sdrf.txt",header=TRUE,stringsAsFacto rs=FALSE,sep="\t") >>> >>> # Looking to compare Family-history negativer versus Diabetis >>> controls<- SDRF[grep("Control, Family History >>> Neg",SDRF$Comment..Sample_source_name.),] >>> disease<- SDRF[grep("^DM",SDRF$Characteristics.DiseaseState.),] >>> Batch<- rbind(controls,disease) >>> >>> # Read in CEL files >>> mixture.batch<- ReadAffy(filenames=Batch$Array.Data.File) >>> >>> # Preprocess data >>> mixture.processed<- expresso(mixture.batch, bgcorrect.method = "rma", >>> normalize.method = "quantiles", pmcorrect.method = "pmonly", >>> summary.method = "medianpolish") >>> >>> # Get data in matrix >>> signals<- exprs(mixture.prepared) >>> cl<- ifelse(colnames(signals) %in% disease$Array.Data.File,1,0) >>> >>> # Do design matrix and fit >>> design<- model.matrix(~factor(cl)) >>> fit<- lmFit(signals,design) >>> fit<- eBayes(fit) >>> topTable(fit2,coef=2) >>> >>> Which yields the following: >>> ? ? ? ? ? ? ? ID ?logFC AveExpr ? ? t P.Value adj.P.Val ? ? B >>> 7513 ? ?208004_at -0.323 5.43 -4.65 0.00191 ? ? 0.999 -3.10 >>> 11225 211829_s_at ?0.340 5.07 4.36 0.00278 ? ? 0.999 -3.17 >>> 5950 ? ?206424_at -0.907 ? ?6.65 -4.15 0.00363 ? ? 0.999 -3.23 >>> 1354 ?201826_s_at -0.447 8.37 -4.13 0.00374 ? ? 0.999 -3.24 >>> 19782 ? 220418_at ?0.392 5.43 4.02 0.00431 ? ? 0.999 -3.27 >>> 8889 ?209396_s_at ?1.899 ? ?7.47 ?4.01 0.00437 ? ? 0.999 -3.28 >>> 5005 ? ?205478_at -0.931 ? ?9.22 -3.94 0.00481 ? ? 0.999 -3.30 >>> 9469 ?209983_s_at ?0.412 5.72 3.92 0.00492 ? ? 0.999 -3.31 >>> 2936 ? ?203409_at ?0.506 ? ?6.93 ?3.87 0.00531 ? ? 0.999 -3.32 >>> 5054 ?205527_s_at ?0.331 6.80 3.84 0.00549 ? ? 0.999 -3.33 >>> >>> I'm abit puzzled over the adjusted P-values. Can it really be true >>> that ALL of the adjusted P-values are 0.999, or did I make a rookie >>> mistake somewhere? >>> >>> Best Regards, >>> David Westergaard >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor > > > -- > James W. MacDonald, M.S. > Biostatistician > University of Washington > Environmental and Occupational Health Sciences > 4225 Roosevelt Way NE, # 100 > Seattle WA 98105-6099 >

ADD REPLY • link 12.7 years ago Voke AO ▴ 760

Login before adding your answer.