Very low P-values in limma
3
0
Entering edit mode
@gordon-smyth
Last seen 9 hours ago
WEHI, Melbourne, Australia
Dear Paul, I'm not quite sure why you're so shocked about getting significant results. Usually people complain to me when they don't get significance :). limma is able to obtain much higher significance levels than a t-test because it (i) leverages information from the within-array replicates and (ii) borrows information across probes. These two approaches in combination increase the effective degrees of freedom dramatically. We would hardly go to so much time and trouble to develop new statistical methods if they didn't improve on ordinary t-tests. If I were you, I'd be looking at the sizes of the fold changes, and whether the results seem biologically sensible and agree with known information, and so on. Best wishes Gordon ---------- original message ----------- [BioC] Very low P-values in limma Paul Geeleher paulgeeleher at gmail.com Thu Oct 22 11:34:01 CEST 2009 The value of df.prior is 3.208318. Hopefully somebody can help me here because I'm still really at a loss on why these values are so low. Here's the script. I'm fairly sure it conforms to the documentation: library(affy) setwd('/home/paul/PhD_Stuff/miRNA_guilleme_project/HERArrays/') # create a color pallete for boxplots cols <- brewer.pal(12, "Set3") # This defines the column name of the mean Cy3 foreground intensites #Cy3 <- "F532 Median" # background corrected value Cy3 <- "F532 Median - B532" # This defines the column name of the mean Cy3 background intensites Cy3b <- "B532 Mean" # Read the targets file (see limmaUsersGuide for info on how to create this) targets <- readTargets("targets.csv") # read values from gpr file # because there is no red channel, read Rb & R to be the same values as G and Gb # this is a type of hack that allows the function to work RG <- read.maimages( targets$FileName, source="genepix",columns=list(R=Cy3,G=Cy3, Rb=Cy3b, Gb=Cy3b)) # remove the extraneous values red channel values RG$R <- NULL RG$Rb <- NULL # this line of code will remove any negative values which cause errors further on RG$G[RG$G<0] <- 0 # create a pData for the data just read # this indicates which population each array belongs to # a are "her-" and b are "her+" (almost certain) pData <- data.frame(population = c('a', 'a', 'a', 'a', 'b', 'b', 'b')) rownames(pData) <- RG$targets$FileName # create design matrix design <- model.matrix(~factor(pData$population)) # In my .gpr files all miRNAs contain the string "-miR-" or "-let-" in their name # so the grep function can be used to extract all of these, removing # all control signals and printing buffers etc. # You need to check your .gpr files to find which signals you should extract. miRs <- c(grep("-miR-", RG$genes$Name), grep("-let-", RG$genes$Name)) RG.final <- RG[miRs, ] # load vsn library, this contains the normalization functions library('vsn') # Do VSN normalization and output as vns object mat <- vsnMatrix(RG.final$G) # insert rownames into matrix with normalized data # this will mean that the gene names will appear in our final output rownames(mat at hx) <- RG.final$genes$Name # my .gpr files contain 4 "within-array replicates" of each miRNA. # We need to make full use of this information by calculating the duplicate correlation # in order to use the duplicateCorrelation() function below, # we need to make sure that the replicates of each gene appear in sequence in this matrix, # so we sort the normalized values so the replicate groups of 4 miRs appear in sequence mat at hx <- mat at hx[order(rownames(mat at hx)), ] # calculate duplicate correlation between the 4 replicates on each array corfit <- duplicateCorrelation(mat, design, ndups=4) # fit the linear model, including info on duplicates and correlation fit <- lmFit(mat, design, ndups=4, correlation=corfit$consensus) # calculate values using ebayes ebayes <- eBayes(fit) # output a list of top differnetially expressed genes... topTable(ebayes, coef = 2, adjust = "BH", n = 10) -Paul. On Thu, Oct 22, 2009 at 12:09 AM, Wei Shi <shi at="" wehi.edu.au=""> wrote: > Dear Paul: > > The low p-values you have got are not surprising to me. I have got even > lower FDR adjusted p-values than yours. This just means you got > significantly differentially expressed genes. But on the other hand, I did > see much higher adjusted p-values in some of my microarray analyses. In that > case, I will explore the data in more depth such as looking at batch effect > etc. > > Cheers, > Wei > > > Paul Geeleher wrote: > >> Hi folks, I'm analyzing microRNA data using limma and I'm wondering about >> the validity of the p-values I'm getting out. Its a simple 'Group A Vs >> Group >> B' experimental design. 4 arrays in one group, 3 in the other and 4 >> duplicate spots for each miRNA on each array. >> >> The lowest adjusted p-values in the differential expression analysis are >> in >> the region of 10^-7. >> >> Its been pointed out to me that plugging the point values from each sample >> into a regular t-test you get p=0.008, which then also needs to take the >> multiple test hit. Can anybody explain why limma is giving me such lower >> values and if they are valid? >> >> I can provide more information if required. >> >> Thanks, >> >> Paul. >> >> >> > -- Paul Geeleher School of Mathematics, Statistics and Applied Mathematics National University of Ireland Galway Ireland
miRNA Microarray Normalization GO vsn limma microRNA miRNA Microarray Normalization GO • 3.0k views
ADD COMMENT
0
Entering edit mode
Paul Geeleher ★ 1.3k
@paul-geeleher-2679
Last seen 9.6 years ago
Yes some of the fold changes are quite large and seeing as my code appears to be correct it looks like these results are right. I guess it comes as a surprise to some people here as the same dataset was analysed using different software which showed no significance. Well done Limma! Paul. On Fri, Oct 23, 2009 at 3:45 AM, Gordon K Smyth <smyth at="" wehi.edu.au=""> wrote: > Dear Paul, > > I'm not quite sure why you're so shocked about getting significant results. > ?Usually people complain to me when they don't get significance :). > > limma is able to obtain much higher significance levels than a t-test > because it (i) leverages information from the within-array replicates and > (ii) borrows information across probes. ?These two approaches in combination > increase the effective degrees of freedom dramatically. > > We would hardly go to so much time and trouble to develop new statistical > methods if they didn't improve on ordinary t-tests. > > If I were you, I'd be looking at the sizes of the fold changes, and whether > the results seem biologically sensible and agree with known information, and > so on. > > Best wishes > Gordon > > > ---------- original message ----------- > [BioC] Very low P-values in limma > Paul Geeleher paulgeeleher at gmail.com > Thu Oct 22 11:34:01 CEST 2009 > > The value of df.prior is 3.208318. Hopefully somebody can help me here > because I'm still really at a loss on why these values are so low. > > Here's the script. I'm fairly sure it conforms to the documentation: > > library(affy) > > setwd('/home/paul/PhD_Stuff/miRNA_guilleme_project/HERArrays/') > > # create a color pallete for boxplots > cols <- brewer.pal(12, "Set3") > > # This defines the column name of the mean Cy3 foreground intensites > #Cy3 <- "F532 Median" > # background corrected value > Cy3 <- "F532 Median - B532" > > # This defines the column name of the mean Cy3 background intensites > Cy3b <- "B532 Mean" > > # Read the targets file (see limmaUsersGuide for info on how to create this) > targets <- readTargets("targets.csv") > > # read values from gpr file > # because there is no red channel, read Rb & R to be the same values as G > and Gb > # this is a type of hack that allows the function to work > RG <- read.maimages( targets$FileName, > source="genepix",columns=list(R=Cy3,G=Cy3, Rb=Cy3b, Gb=Cy3b)) > > # remove the extraneous values red channel values > RG$R <- NULL > RG$Rb <- NULL > > # this line of code will remove any negative values which cause errors > further on > RG$G[RG$G<0] <- 0 > > # create a pData for the data just read > # this indicates which population each array belongs to > # a are "her-" and b are "her+" (almost certain) > pData <- data.frame(population = c('a', 'a', 'a', 'a', 'b', 'b', 'b')) > rownames(pData) <- ?RG$targets$FileName > > # create design matrix > design <- model.matrix(~factor(pData$population)) > > # In my .gpr files all miRNAs contain the string "-miR-" or "-let-" in their > name > # so the grep function can be used to extract all of these, removing > # all control signals and printing buffers etc. > # You need to check your .gpr files to find which signals you should > extract. > miRs <- c(grep("-miR-", RG$genes$Name), grep("-let-", RG$genes$Name)) > RG.final <- RG[miRs, ] > > # load vsn library, this contains the normalization functions > library('vsn') > > # Do VSN normalization and output as vns object > mat <- vsnMatrix(RG.final$G) > > # insert rownames into matrix with normalized data > # this will mean that the gene names will appear in our final output > rownames(mat at hx) <- RG.final$genes$Name > > # my .gpr files contain 4 "within-array replicates" of each miRNA. > # We need to make full use of this information by calculating the duplicate > correlation > # in order to use the duplicateCorrelation() function below, > # we need to make sure that the replicates of each gene appear in sequence > in this matrix, > # so we sort the normalized values so the replicate groups of 4 miRs appear > in sequence > mat at hx <- mat at hx[order(rownames(mat at hx)), ] > > # calculate duplicate correlation between the 4 replicates on each array > corfit <- duplicateCorrelation(mat, design, ndups=4) > > # fit the linear model, including info on duplicates and correlation > fit <- lmFit(mat, design, ndups=4, correlation=corfit$consensus) > > # calculate values using ebayes > ebayes <- eBayes(fit) > > # output a list of top differnetially expressed genes... > topTable(ebayes, coef = 2, adjust = "BH", n = 10) > > > -Paul. > > > On Thu, Oct 22, 2009 at 12:09 AM, Wei Shi <shi at="" wehi.edu.au=""> wrote: > >> Dear Paul: >> >> ?The low p-values you have got are not surprising to me. I have got > > even >> >> lower FDR adjusted p-values than yours. This just means you got >> significantly differentially expressed genes. But on the other hand, I > > did >> >> see much higher adjusted p-values in some of my microarray analyses. In > > that >> >> case, I will explore the data in more depth such as looking at batch > > effect >> >> etc. >> >> Cheers, >> Wei >> >> >> Paul Geeleher wrote: >> >>> Hi folks, I'm analyzing microRNA data using limma and I'm wondering > > about >>> >>> the validity of the p-values I'm getting out. Its a simple 'Group A Vs >>> Group >>> B' experimental design. 4 arrays in one group, 3 in the other and 4 >>> duplicate spots for each miRNA on each array. >>> >>> The lowest adjusted p-values in the differential expression analysis > > are >>> >>> in >>> the region of 10^-7. >>> >>> Its been pointed out to me that plugging the point values from each > > sample >>> >>> into a regular t-test you get p=0.008, which then also needs to take > > the >>> >>> multiple test hit. Can anybody explain why limma is giving me such > > lower >>> >>> values and if they are valid? >>> >>> I can provide more information if required. >>> >>> Thanks, >>> >>> Paul. >>> >>> >>> >> > > > -- > Paul Geeleher > School of Mathematics, Statistics and Applied Mathematics > National University of Ireland > Galway > Ireland > > -- Paul Geeleher School of Mathematics, Statistics and Applied Mathematics National University of Ireland Galway Ireland
ADD COMMENT
0
Entering edit mode
@gordon-smyth
Last seen 9 hours ago
WEHI, Melbourne, Australia
> [BioC] Very low P-values in limma > Thomas Hampton Thomas.H.Hampton at Dartmouth.edu > Wed Oct 21 22:04:57 CEST 2009 > > I too find it very odd that a multiple hypothesis correction > apparently reduces P values. It would be odd if it did, but it doesn't. > One possibility is that > the duplicate spot weighting in limma is somehow doing a really > superior job compared to a simple t test > that might view duplicates on different arrays as the same thing as > duplicates on the same array. It doesn't. See the documentation. Best wishes Gordon > Tom
ADD COMMENT
0
Entering edit mode
Very often, when things look to good to be true, it is because they aren't. This thread is all just entertainment for me, but I was under the impression that limma produced corrected P values (from reading the manual): "Limma provides functions topTable() and decideTests() which summarize the results of the linear model, perform hypothesis tests and adjust the p-values for multiple testing. Results include (log) fold changes, standard errors, t-statistics and p- values. The basic statistic used for signi cance analysis is the moderated t-statistic, which is computed for each probe and for each contrast." And this being the case, it is quite reasonable to expect a t test on a single point to yield a much more attractive P value than any mechanism that corrects for multiple hypothesis testing. T On Oct 22, 2009, at 10:51 PM, Gordon K Smyth wrote: >> [BioC] Very low P-values in limma >> Thomas Hampton Thomas.H.Hampton at Dartmouth.edu >> Wed Oct 21 22:04:57 CEST 2009 >> >> I too find it very odd that a multiple hypothesis correction >> apparently reduces P values. > > It would be odd if it did, but it doesn't. > >> One possibility is that >> the duplicate spot weighting in limma is somehow doing a really >> superior job compared to a simple t test >> that might view duplicates on different arrays as the same thing as >> duplicates on the same array. > > It doesn't. See the documentation. > > Best wishes > Gordon > >> Tom > > _______________________________________________ > Bioconductor mailing list > Bioconductor@stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor [[alternative HTML version deleted]]
ADD REPLY
0
Entering edit mode
Paul Geeleher ★ 1.3k
@paul-geeleher-2679
Last seen 9.6 years ago
Dear list, I have a query, I've rerun my analysis by averaging out the within array duplicates using the avedups() function instead of duplicateCorrelation(). aveMat <- avedups(mat at hx, ndups=4, spacing=1, weights=NULL) fit <- lmFit(aveMat, design) ebayes <- eBayes(fit) topTable(ebayes, coef = 2, adjust = "BH", n = 10) When I do this my most significant p-values drop from 10^-7 to .06! This seems dubious? It seems like to get values as low as 10^-7 the duplicateCorrelation() function must be treating the within array replicates in a similar way it would separate samples, this seems suspicious to me... -Paul. On Fri, Oct 23, 2009 at 3:45 AM, Gordon K Smyth <smyth at="" wehi.edu.au=""> wrote: > Dear Paul, > > I'm not quite sure why you're so shocked about getting significant results. > ?Usually people complain to me when they don't get significance :). > > limma is able to obtain much higher significance levels than a t-test > because it (i) leverages information from the within-array replicates and > (ii) borrows information across probes. ?These two approaches in combination > increase the effective degrees of freedom dramatically. > > We would hardly go to so much time and trouble to develop new statistical > methods if they didn't improve on ordinary t-tests. > > If I were you, I'd be looking at the sizes of the fold changes, and whether > the results seem biologically sensible and agree with known information, and > so on. > > Best wishes > Gordon > > > ---------- original message ----------- > [BioC] Very low P-values in limma > Paul Geeleher paulgeeleher at gmail.com > Thu Oct 22 11:34:01 CEST 2009 > > The value of df.prior is 3.208318. Hopefully somebody can help me here > because I'm still really at a loss on why these values are so low. > > Here's the script. I'm fairly sure it conforms to the documentation: > > library(affy) > > setwd('/home/paul/PhD_Stuff/miRNA_guilleme_project/HERArrays/') > > # create a color pallete for boxplots > cols <- brewer.pal(12, "Set3") > > # This defines the column name of the mean Cy3 foreground intensites > #Cy3 <- "F532 Median" > # background corrected value > Cy3 <- "F532 Median - B532" > > # This defines the column name of the mean Cy3 background intensites > Cy3b <- "B532 Mean" > > # Read the targets file (see limmaUsersGuide for info on how to create this) > targets <- readTargets("targets.csv") > > # read values from gpr file > # because there is no red channel, read Rb & R to be the same values as G > and Gb > # this is a type of hack that allows the function to work > RG <- read.maimages( targets$FileName, > source="genepix",columns=list(R=Cy3,G=Cy3, Rb=Cy3b, Gb=Cy3b)) > > # remove the extraneous values red channel values > RG$R <- NULL > RG$Rb <- NULL > > # this line of code will remove any negative values which cause errors > further on > RG$G[RG$G<0] <- 0 > > # create a pData for the data just read > # this indicates which population each array belongs to > # a are "her-" and b are "her+" (almost certain) > pData <- data.frame(population = c('a', 'a', 'a', 'a', 'b', 'b', 'b')) > rownames(pData) <- ?RG$targets$FileName > > # create design matrix > design <- model.matrix(~factor(pData$population)) > > # In my .gpr files all miRNAs contain the string "-miR-" or "-let-" in their > name > # so the grep function can be used to extract all of these, removing > # all control signals and printing buffers etc. > # You need to check your .gpr files to find which signals you should > extract. > miRs <- c(grep("-miR-", RG$genes$Name), grep("-let-", RG$genes$Name)) > RG.final <- RG[miRs, ] > > # load vsn library, this contains the normalization functions > library('vsn') > > # Do VSN normalization and output as vns object > mat <- vsnMatrix(RG.final$G) > > # insert rownames into matrix with normalized data > # this will mean that the gene names will appear in our final output > rownames(mat at hx) <- RG.final$genes$Name > > # my .gpr files contain 4 "within-array replicates" of each miRNA. > # We need to make full use of this information by calculating the duplicate > correlation > # in order to use the duplicateCorrelation() function below, > # we need to make sure that the replicates of each gene appear in sequence > in this matrix, > # so we sort the normalized values so the replicate groups of 4 miRs appear > in sequence > mat at hx <- mat at hx[order(rownames(mat at hx)), ] > > # calculate duplicate correlation between the 4 replicates on each array > corfit <- duplicateCorrelation(mat, design, ndups=4) > > # fit the linear model, including info on duplicates and correlation > fit <- lmFit(mat, design, ndups=4, correlation=corfit$consensus) > > # calculate values using ebayes > ebayes <- eBayes(fit) > > # output a list of top differnetially expressed genes... > topTable(ebayes, coef = 2, adjust = "BH", n = 10) > > > -Paul. > > > On Thu, Oct 22, 2009 at 12:09 AM, Wei Shi <shi at="" wehi.edu.au=""> wrote: > >> Dear Paul: >> >> ?The low p-values you have got are not surprising to me. I have got > > even >> >> lower FDR adjusted p-values than yours. This just means you got >> significantly differentially expressed genes. But on the other hand, I > > did >> >> see much higher adjusted p-values in some of my microarray analyses. In > > that >> >> case, I will explore the data in more depth such as looking at batch > > effect >> >> etc. >> >> Cheers, >> Wei >> >> >> Paul Geeleher wrote: >> >>> Hi folks, I'm analyzing microRNA data using limma and I'm wondering > > about >>> >>> the validity of the p-values I'm getting out. Its a simple 'Group A Vs >>> Group >>> B' experimental design. 4 arrays in one group, 3 in the other and 4 >>> duplicate spots for each miRNA on each array. >>> >>> The lowest adjusted p-values in the differential expression analysis > > are >>> >>> in >>> the region of 10^-7. >>> >>> Its been pointed out to me that plugging the point values from each > > sample >>> >>> into a regular t-test you get p=0.008, which then also needs to take > > the >>> >>> multiple test hit. Can anybody explain why limma is giving me such > > lower >>> >>> values and if they are valid? >>> >>> I can provide more information if required. >>> >>> Thanks, >>> >>> Paul. >>> >>> >>> >> > > > -- > Paul Geeleher > School of Mathematics, Statistics and Applied Mathematics > National University of Ireland > Galway > Ireland > > -- Paul Geeleher School of Mathematics, Statistics and Applied Mathematics National University of Ireland Galway Ireland
ADD COMMENT
0
Entering edit mode
Dear Paul! What is the corfit$consensus value? If it is close to 1, using limma with duplicateCorrelation or averaging (or indeed just using any of the duplicates) should give you very similar results. If however corfit$consensus is close to 0, limma would indeed treat the duplicates similar to independent replicates. Obviously a corfit$consensus value close to 0 would be quite worrying as that would indicate a very high technical variability (or some subtle mistake in your code). Anyway, checking that value might help to solve the puzzle. Claus > -----Original Message----- > From: bioconductor-bounces at stat.math.ethz.ch [mailto:bioconductor- > bounces at stat.math.ethz.ch] On Behalf Of Paul Geeleher > Sent: 23 October 2009 15:13 > To: Gordon K Smyth > Cc: Bioconductor mailing list > Subject: Re: [BioC] Very low P-values in limma > > Dear list, > > I have a query, I've rerun my analysis by averaging out the within > array duplicates using the avedups() function instead of > duplicateCorrelation(). > > aveMat <- avedups(mat at hx, ndups=4, spacing=1, weights=NULL) > fit <- lmFit(aveMat, design) > ebayes <- eBayes(fit) > topTable(ebayes, coef = 2, adjust = "BH", n = 10) > > When I do this my most significant p-values drop from 10^-7 to .06! > > This seems dubious? It seems like to get values as low as 10^-7 the > duplicateCorrelation() function must be treating the within array > replicates in a similar way it would separate samples, this seems > suspicious to me... > > -Paul. > > > > On Fri, Oct 23, 2009 at 3:45 AM, Gordon K Smyth <smyth at="" wehi.edu.au=""> wrote: > > Dear Paul, > > > > I'm not quite sure why you're so shocked about getting significant > results. > > Usually people complain to me when they don't get significance :). > > > > limma is able to obtain much higher significance levels than a t-test > > because it (i) leverages information from the within-array replicates > and > > (ii) borrows information across probes. These two approaches in > combination > > increase the effective degrees of freedom dramatically. > > > > We would hardly go to so much time and trouble to develop new > statistical > > methods if they didn't improve on ordinary t-tests. > > > > If I were you, I'd be looking at the sizes of the fold changes, and > whether > > the results seem biologically sensible and agree with known information, > and > > so on. > > > > Best wishes > > Gordon > > > > > > ---------- original message ----------- > > [BioC] Very low P-values in limma > > Paul Geeleher paulgeeleher at gmail.com > > Thu Oct 22 11:34:01 CEST 2009 > > > > The value of df.prior is 3.208318. Hopefully somebody can help me here > > because I'm still really at a loss on why these values are so low. > > > > Here's the script. I'm fairly sure it conforms to the documentation: > > > > library(affy) > > > > setwd('/home/paul/PhD_Stuff/miRNA_guilleme_project/HERArrays/') > > > > # create a color pallete for boxplots > > cols <- brewer.pal(12, "Set3") > > > > # This defines the column name of the mean Cy3 foreground intensites > > #Cy3 <- "F532 Median" > > # background corrected value > > Cy3 <- "F532 Median - B532" > > > > # This defines the column name of the mean Cy3 background intensites > > Cy3b <- "B532 Mean" > > > > # Read the targets file (see limmaUsersGuide for info on how to create > this) > > targets <- readTargets("targets.csv") > > > > # read values from gpr file > > # because there is no red channel, read Rb & R to be the same values as > G > > and Gb > > # this is a type of hack that allows the function to work > > RG <- read.maimages( targets$FileName, > > source="genepix",columns=list(R=Cy3,G=Cy3, Rb=Cy3b, Gb=Cy3b)) > > > > # remove the extraneous values red channel values > > RG$R <- NULL > > RG$Rb <- NULL > > > > # this line of code will remove any negative values which cause errors > > further on > > RG$G[RG$G<0] <- 0 > > > > # create a pData for the data just read > > # this indicates which population each array belongs to > > # a are "her-" and b are "her+" (almost certain) > > pData <- data.frame(population = c('a', 'a', 'a', 'a', 'b', 'b', 'b')) > > rownames(pData) <- RG$targets$FileName > > > > # create design matrix > > design <- model.matrix(~factor(pData$population)) > > > > # In my .gpr files all miRNAs contain the string "-miR-" or "-let-" in > their > > name > > # so the grep function can be used to extract all of these, removing > > # all control signals and printing buffers etc. > > # You need to check your .gpr files to find which signals you should > > extract. > > miRs <- c(grep("-miR-", RG$genes$Name), grep("-let-", RG$genes$Name)) > > RG.final <- RG[miRs, ] > > > > # load vsn library, this contains the normalization functions > > library('vsn') > > > > # Do VSN normalization and output as vns object > > mat <- vsnMatrix(RG.final$G) > > > > # insert rownames into matrix with normalized data > > # this will mean that the gene names will appear in our final output > > rownames(mat at hx) <- RG.final$genes$Name > > > > # my .gpr files contain 4 "within-array replicates" of each miRNA. > > # We need to make full use of this information by calculating the > duplicate > > correlation > > # in order to use the duplicateCorrelation() function below, > > # we need to make sure that the replicates of each gene appear in > sequence > > in this matrix, > > # so we sort the normalized values so the replicate groups of 4 miRs > appear > > in sequence > > mat at hx <- mat at hx[order(rownames(mat at hx)), ] > > > > # calculate duplicate correlation between the 4 replicates on each array > > corfit <- duplicateCorrelation(mat, design, ndups=4) > > > > # fit the linear model, including info on duplicates and correlation > > fit <- lmFit(mat, design, ndups=4, correlation=corfit$consensus) > > > > # calculate values using ebayes > > ebayes <- eBayes(fit) > > > > # output a list of top differnetially expressed genes... > > topTable(ebayes, coef = 2, adjust = "BH", n = 10) > > > > > > -Paul. > > > > > > On Thu, Oct 22, 2009 at 12:09 AM, Wei Shi <shi at="" wehi.edu.au=""> wrote: > > > >> Dear Paul: > >> > >> The low p-values you have got are not surprising to me. I have got > > > > even > >> > >> lower FDR adjusted p-values than yours. This just means you got > >> significantly differentially expressed genes. But on the other hand, I > > > > did > >> > >> see much higher adjusted p-values in some of my microarray analyses. In > > > > that > >> > >> case, I will explore the data in more depth such as looking at batch > > > > effect > >> > >> etc. > >> > >> Cheers, > >> Wei > >> > >> > >> Paul Geeleher wrote: > >> > >>> Hi folks, I'm analyzing microRNA data using limma and I'm wondering > > > > about > >>> > >>> the validity of the p-values I'm getting out. Its a simple 'Group A Vs > >>> Group > >>> B' experimental design. 4 arrays in one group, 3 in the other and 4 > >>> duplicate spots for each miRNA on each array. > >>> > >>> The lowest adjusted p-values in the differential expression analysis > > > > are > >>> > >>> in > >>> the region of 10^-7. > >>> > >>> Its been pointed out to me that plugging the point values from each > > > > sample > >>> > >>> into a regular t-test you get p=0.008, which then also needs to take > > > > the > >>> > >>> multiple test hit. Can anybody explain why limma is giving me such > > > > lower > >>> > >>> values and if they are valid? > >>> > >>> I can provide more information if required. > >>> > >>> Thanks, > >>> > >>> Paul. > >>> > >>> > >>> > >> > > > > > > -- > > Paul Geeleher > > School of Mathematics, Statistics and Applied Mathematics > > National University of Ireland > > Galway > > Ireland > > > > > > > > -- > Paul Geeleher > School of Mathematics, Statistics and Applied Mathematics > National University of Ireland > Galway > Ireland > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor The University of Aberdeen is a charity registered in Scotland, No SC013683.
ADD REPLY
0
Entering edit mode
Hi Claus, corfit$consensus = .81 which I guess is pretty close to 1. The only differences in the code are that with the duplicateCorrelation method I finish with: corfit <- duplicateCorrelation(mat, design, ndups=4) fit <- lmFit(mat at hx, design, ndups=4, correlation=corfit$consensus) ebayes <- eBayes(fit) topTable(ebayes, coef = 2, adjust = "BH", n = 10) The huge discrepancy in p-values is strange indeed. Paul. On Fri, Oct 23, 2009 at 4:25 PM, Mayer, Claus-Dieter <c.mayer at="" abdn.ac.uk=""> wrote: > Dear Paul! > > What is the corfit$consensus value? If it is close to 1, using limma with duplicateCorrelation or averaging (or indeed just using any of the duplicates) should give you very similar results. > If however corfit$consensus is close to 0, limma would indeed treat the duplicates similar to independent replicates. Obviously a corfit$consensus value close to 0 would be quite worrying as that would indicate a very high technical variability (or some subtle mistake in your code). Anyway, checking that value might help to solve the puzzle. > > Claus > >> -----Original Message----- >> From: bioconductor-bounces at stat.math.ethz.ch [mailto:bioconductor- >> bounces at stat.math.ethz.ch] On Behalf Of Paul Geeleher >> Sent: 23 October 2009 15:13 >> To: Gordon K Smyth >> Cc: Bioconductor mailing list >> Subject: Re: [BioC] Very low P-values in limma >> >> Dear list, >> >> I have a query, I've rerun my analysis by averaging out the within >> array duplicates using the avedups() function instead of >> duplicateCorrelation(). >> >> aveMat <- avedups(mat at hx, ndups=4, spacing=1, weights=NULL) >> fit <- lmFit(aveMat, design) >> ebayes <- eBayes(fit) >> topTable(ebayes, coef = 2, adjust = "BH", n = 10) >> >> When I do this my most significant p-values drop from 10^-7 to .06! >> >> This seems dubious? It seems like to get values as low as 10^-7 the >> duplicateCorrelation() function must be treating the within array >> replicates in a similar way it would separate samples, this seems >> suspicious to me... >> >> -Paul. >> >> >> >> On Fri, Oct 23, 2009 at 3:45 AM, Gordon K Smyth <smyth at="" wehi.edu.au=""> wrote: >> > Dear Paul, >> > >> > I'm not quite sure why you're so shocked about getting significant >> results. >> > ?Usually people complain to me when they don't get significance :). >> > >> > limma is able to obtain much higher significance levels than a t-test >> > because it (i) leverages information from the within-array replicates >> and >> > (ii) borrows information across probes. ?These two approaches in >> combination >> > increase the effective degrees of freedom dramatically. >> > >> > We would hardly go to so much time and trouble to develop new >> statistical >> > methods if they didn't improve on ordinary t-tests. >> > >> > If I were you, I'd be looking at the sizes of the fold changes, and >> whether >> > the results seem biologically sensible and agree with known information, >> and >> > so on. >> > >> > Best wishes >> > Gordon >> > >> > >> > ---------- original message ----------- >> > [BioC] Very low P-values in limma >> > Paul Geeleher paulgeeleher at gmail.com >> > Thu Oct 22 11:34:01 CEST 2009 >> > >> > The value of df.prior is 3.208318. Hopefully somebody can help me here >> > because I'm still really at a loss on why these values are so low. >> > >> > Here's the script. I'm fairly sure it conforms to the documentation: >> > >> > library(affy) >> > >> > setwd('/home/paul/PhD_Stuff/miRNA_guilleme_project/HERArrays/') >> > >> > # create a color pallete for boxplots >> > cols <- brewer.pal(12, "Set3") >> > >> > # This defines the column name of the mean Cy3 foreground intensites >> > #Cy3 <- "F532 Median" >> > # background corrected value >> > Cy3 <- "F532 Median - B532" >> > >> > # This defines the column name of the mean Cy3 background intensites >> > Cy3b <- "B532 Mean" >> > >> > # Read the targets file (see limmaUsersGuide for info on how to create >> this) >> > targets <- readTargets("targets.csv") >> > >> > # read values from gpr file >> > # because there is no red channel, read Rb & R to be the same values as >> G >> > and Gb >> > # this is a type of hack that allows the function to work >> > RG <- read.maimages( targets$FileName, >> > source="genepix",columns=list(R=Cy3,G=Cy3, Rb=Cy3b, Gb=Cy3b)) >> > >> > # remove the extraneous values red channel values >> > RG$R <- NULL >> > RG$Rb <- NULL >> > >> > # this line of code will remove any negative values which cause errors >> > further on >> > RG$G[RG$G<0] <- 0 >> > >> > # create a pData for the data just read >> > # this indicates which population each array belongs to >> > # a are "her-" and b are "her+" (almost certain) >> > pData <- data.frame(population = c('a', 'a', 'a', 'a', 'b', 'b', 'b')) >> > rownames(pData) <- ?RG$targets$FileName >> > >> > # create design matrix >> > design <- model.matrix(~factor(pData$population)) >> > >> > # In my .gpr files all miRNAs contain the string "-miR-" or "-let-" in >> their >> > name >> > # so the grep function can be used to extract all of these, removing >> > # all control signals and printing buffers etc. >> > # You need to check your .gpr files to find which signals you should >> > extract. >> > miRs <- c(grep("-miR-", RG$genes$Name), grep("-let-", RG$genes$Name)) >> > RG.final <- RG[miRs, ] >> > >> > # load vsn library, this contains the normalization functions >> > library('vsn') >> > >> > # Do VSN normalization and output as vns object >> > mat <- vsnMatrix(RG.final$G) >> > >> > # insert rownames into matrix with normalized data >> > # this will mean that the gene names will appear in our final output >> > rownames(mat at hx) <- RG.final$genes$Name >> > >> > # my .gpr files contain 4 "within-array replicates" of each miRNA. >> > # We need to make full use of this information by calculating the >> duplicate >> > correlation >> > # in order to use the duplicateCorrelation() function below, >> > # we need to make sure that the replicates of each gene appear in >> sequence >> > in this matrix, >> > # so we sort the normalized values so the replicate groups of 4 miRs >> appear >> > in sequence >> > mat at hx <- mat at hx[order(rownames(mat at hx)), ] >> > >> > # calculate duplicate correlation between the 4 replicates on each array >> > corfit <- duplicateCorrelation(mat, design, ndups=4) >> > >> > # fit the linear model, including info on duplicates and correlation >> > fit <- lmFit(mat, design, ndups=4, correlation=corfit$consensus) >> > >> > # calculate values using ebayes >> > ebayes <- eBayes(fit) >> > >> > # output a list of top differnetially expressed genes... >> > topTable(ebayes, coef = 2, adjust = "BH", n = 10) >> > >> > >> > -Paul. >> > >> > >> > On Thu, Oct 22, 2009 at 12:09 AM, Wei Shi <shi at="" wehi.edu.au=""> wrote: >> > >> >> Dear Paul: >> >> >> >> ?The low p-values you have got are not surprising to me. I have got >> > >> > even >> >> >> >> lower FDR adjusted p-values than yours. This just means you got >> >> significantly differentially expressed genes. But on the other hand, I >> > >> > did >> >> >> >> see much higher adjusted p-values in some of my microarray analyses. In >> > >> > that >> >> >> >> case, I will explore the data in more depth such as looking at batch >> > >> > effect >> >> >> >> etc. >> >> >> >> Cheers, >> >> Wei >> >> >> >> >> >> Paul Geeleher wrote: >> >> >> >>> Hi folks, I'm analyzing microRNA data using limma and I'm wondering >> > >> > about >> >>> >> >>> the validity of the p-values I'm getting out. Its a simple 'Group A Vs >> >>> Group >> >>> B' experimental design. 4 arrays in one group, 3 in the other and 4 >> >>> duplicate spots for each miRNA on each array. >> >>> >> >>> The lowest adjusted p-values in the differential expression analysis >> > >> > are >> >>> >> >>> in >> >>> the region of 10^-7. >> >>> >> >>> Its been pointed out to me that plugging the point values from each >> > >> > sample >> >>> >> >>> into a regular t-test you get p=0.008, which then also needs to take >> > >> > the >> >>> >> >>> multiple test hit. Can anybody explain why limma is giving me such >> > >> > lower >> >>> >> >>> values and if they are valid? >> >>> >> >>> I can provide more information if required. >> >>> >> >>> Thanks, >> >>> >> >>> Paul. >> >>> >> >>> >> >>> >> >> >> > >> > >> > -- >> > Paul Geeleher >> > School of Mathematics, Statistics and Applied Mathematics >> > National University of Ireland >> > Galway >> > Ireland >> > >> > >> >> >> >> -- >> Paul Geeleher >> School of Mathematics, Statistics and Applied Mathematics >> National University of Ireland >> Galway >> Ireland >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor > > > The University of Aberdeen is a charity registered in Scotland, No SC013683. > -- Paul Geeleher School of Mathematics, Statistics and Applied Mathematics National University of Ireland Galway Ireland
ADD REPLY
0
Entering edit mode
Dear Paul, Give your consensus correlation value, limma is treating your within- array replicates as worth about 1/3 as much as replicates on independent arrays (because 1-0.81^2 is about 1/3). This is how it is designed to work. There are some caveats. Firstly, the limma duplicateCorrelation method assumes replicates to be equally spaced, and yours are not. You've actually re-ordered your data to fit it into limma. So the within- array correlation will be underestimated, and significance over-estimated, for transcripts for which the replicates are unusually close on the array. Secondly, the within-array correlation is assumed to be the same for all transcripts, which is never actually true. The approximation has proved worthwhile when ndups=2 or 3, but it will yield over-optimistic results when the number of within-replicates is large. In your case, you get promising results (FDR=0.06), even if you average over the within-array replicates. So there seems something genuine in your data which is not just an artefact of the within-array analysis. Given data like yours, I might personally use the within-array replicate analysis, because it takes into account within-array as well as between-array variation and usually gives a good ranking of transcripts, but would treat the p-values as somewhat optimistic. However, microarray analysis should never be just a plug-in analysis. You need to plot your data in various ways to check quality and to check assumptions. You should be doing some MA and MDS plots of your data pre model fitting, then an MA-plot of the estimated fold changes, and perhaps a sigma vs Amean plot as well (plotAS). We list members can't tell you whether your results are reliable or not just from the p-values or the number of DE genes. To me, your results are not impossible, but some exploratory data analysis is needed. Hope this helps Best wishes Gordon On Fri, 23 Oct 2009, Paul Geeleher wrote: > Hi Claus, > > corfit$consensus = .81 which I guess is pretty close to 1. > > The only differences in the code are that with the > duplicateCorrelation method I finish with: > > corfit <- duplicateCorrelation(mat, design, ndups=4) > fit <- lmFit(mat at hx, design, ndups=4, correlation=corfit$consensus) > ebayes <- eBayes(fit) > topTable(ebayes, coef = 2, adjust = "BH", n = 10) > > The huge discrepancy in p-values is strange indeed. > > Paul. > > On Fri, Oct 23, 2009 at 4:25 PM, Mayer, Claus-Dieter <c.mayer at="" abdn.ac.uk=""> wrote: >> Dear Paul! >> >> What is the corfit$consensus value? If it is close to 1, using limma >> with duplicateCorrelation or averaging (or indeed just using any of the >> duplicates) should give you very similar results. If however >> corfit$consensus is close to 0, limma would indeed treat the duplicates >> similar to independent replicates. Obviously a corfit$consensus value >> close to 0 would be quite worrying as that would indicate a very high >> technical variability (or some subtle mistake in your code). Anyway, >> checking that value might help to solve the puzzle. >> >> Claus >> >>> -----Original Message----- >>> From: bioconductor-bounces at stat.math.ethz.ch [mailto:bioconductor- >>> bounces at stat.math.ethz.ch] On Behalf Of Paul Geeleher >>> Sent: 23 October 2009 15:13 >>> To: Gordon K Smyth >>> Cc: Bioconductor mailing list >>> Subject: Re: [BioC] Very low P-values in limma >>> >>> Dear list, >>> >>> I have a query, I've rerun my analysis by averaging out the within >>> array duplicates using the avedups() function instead of >>> duplicateCorrelation(). >>> >>> aveMat <- avedups(mat at hx, ndups=4, spacing=1, weights=NULL) >>> fit <- lmFit(aveMat, design) >>> ebayes <- eBayes(fit) >>> topTable(ebayes, coef = 2, adjust = "BH", n = 10) >>> >>> When I do this my most significant p-values drop from 10^-7 to .06! >>> >>> This seems dubious? It seems like to get values as low as 10^-7 the >>> duplicateCorrelation() function must be treating the within array >>> replicates in a similar way it would separate samples, this seems >>> suspicious to me... >>> >>> -Paul. >>> >>> >>> >>> On Fri, Oct 23, 2009 at 3:45 AM, Gordon K Smyth <smyth at="" wehi.edu.au=""> wrote: >>>> Dear Paul, >>>> >>>> I'm not quite sure why you're so shocked about getting significant >>> results. >>>> ?Usually people complain to me when they don't get significance :). >>>> >>>> limma is able to obtain much higher significance levels than a t-test >>>> because it (i) leverages information from the within-array replicates >>> and >>>> (ii) borrows information across probes. ?These two approaches in >>> combination >>>> increase the effective degrees of freedom dramatically. >>>> >>>> We would hardly go to so much time and trouble to develop new >>> statistical >>>> methods if they didn't improve on ordinary t-tests. >>>> >>>> If I were you, I'd be looking at the sizes of the fold changes, and >>> whether >>>> the results seem biologically sensible and agree with known information, >>> and >>>> so on. >>>> >>>> Best wishes >>>> Gordon >>>> >>>> >>>> ---------- original message ----------- >>>> [BioC] Very low P-values in limma >>>> Paul Geeleher paulgeeleher at gmail.com >>>> Thu Oct 22 11:34:01 CEST 2009 >>>> >>>> The value of df.prior is 3.208318. Hopefully somebody can help me here >>>> because I'm still really at a loss on why these values are so low. >>>> >>>> Here's the script. I'm fairly sure it conforms to the documentation: >>>> >>>> library(affy) >>>> >>>> setwd('/home/paul/PhD_Stuff/miRNA_guilleme_project/HERArrays/') >>>> >>>> # create a color pallete for boxplots >>>> cols <- brewer.pal(12, "Set3") >>>> >>>> # This defines the column name of the mean Cy3 foreground intensites >>>> #Cy3 <- "F532 Median" >>>> # background corrected value >>>> Cy3 <- "F532 Median - B532" >>>> >>>> # This defines the column name of the mean Cy3 background intensites >>>> Cy3b <- "B532 Mean" >>>> >>>> # Read the targets file (see limmaUsersGuide for info on how to create >>> this) >>>> targets <- readTargets("targets.csv") >>>> >>>> # read values from gpr file >>>> # because there is no red channel, read Rb & R to be the same values as >>> G >>>> and Gb >>>> # this is a type of hack that allows the function to work >>>> RG <- read.maimages( targets$FileName, >>>> source="genepix",columns=list(R=Cy3,G=Cy3, Rb=Cy3b, Gb=Cy3b)) >>>> >>>> # remove the extraneous values red channel values >>>> RG$R <- NULL >>>> RG$Rb <- NULL >>>> >>>> # this line of code will remove any negative values which cause errors >>>> further on >>>> RG$G[RG$G<0] <- 0 >>>> >>>> # create a pData for the data just read >>>> # this indicates which population each array belongs to >>>> # a are "her-" and b are "her+" (almost certain) >>>> pData <- data.frame(population = c('a', 'a', 'a', 'a', 'b', 'b', 'b')) >>>> rownames(pData) <- ?RG$targets$FileName >>>> >>>> # create design matrix >>>> design <- model.matrix(~factor(pData$population)) >>>> >>>> # In my .gpr files all miRNAs contain the string "-miR-" or "-let-" in >>> their >>>> name >>>> # so the grep function can be used to extract all of these, removing >>>> # all control signals and printing buffers etc. >>>> # You need to check your .gpr files to find which signals you should >>>> extract. >>>> miRs <- c(grep("-miR-", RG$genes$Name), grep("-let-", RG$genes$Name)) >>>> RG.final <- RG[miRs, ] >>>> >>>> # load vsn library, this contains the normalization functions >>>> library('vsn') >>>> >>>> # Do VSN normalization and output as vns object >>>> mat <- vsnMatrix(RG.final$G) >>>> >>>> # insert rownames into matrix with normalized data >>>> # this will mean that the gene names will appear in our final output >>>> rownames(mat at hx) <- RG.final$genes$Name >>>> >>>> # my .gpr files contain 4 "within-array replicates" of each miRNA. >>>> # We need to make full use of this information by calculating the >>> duplicate >>>> correlation >>>> # in order to use the duplicateCorrelation() function below, >>>> # we need to make sure that the replicates of each gene appear in >>> sequence >>>> in this matrix, >>>> # so we sort the normalized values so the replicate groups of 4 miRs >>> appear >>>> in sequence >>>> mat at hx <- mat at hx[order(rownames(mat at hx)), ] >>>> >>>> # calculate duplicate correlation between the 4 replicates on each array >>>> corfit <- duplicateCorrelation(mat, design, ndups=4) >>>> >>>> # fit the linear model, including info on duplicates and correlation >>>> fit <- lmFit(mat, design, ndups=4, correlation=corfit$consensus) >>>> >>>> # calculate values using ebayes >>>> ebayes <- eBayes(fit) >>>> >>>> # output a list of top differnetially expressed genes... >>>> topTable(ebayes, coef = 2, adjust = "BH", n = 10) >>>> >>>> >>>> -Paul. >>>> >>>> >>>> On Thu, Oct 22, 2009 at 12:09 AM, Wei Shi <shi at="" wehi.edu.au=""> wrote: >>>> >>>>> Dear Paul: >>>>> >>>>> ?The low p-values you have got are not surprising to me. I have got >>>> >>>> even >>>>> >>>>> lower FDR adjusted p-values than yours. This just means you got >>>>> significantly differentially expressed genes. But on the other hand, I >>>> >>>> did >>>>> >>>>> see much higher adjusted p-values in some of my microarray analyses. In >>>> >>>> that >>>>> >>>>> case, I will explore the data in more depth such as looking at batch >>>> >>>> effect >>>>> >>>>> etc. >>>>> >>>>> Cheers, >>>>> Wei >>>>> >>>>> >>>>> Paul Geeleher wrote: >>>>> >>>>>> Hi folks, I'm analyzing microRNA data using limma and I'm wondering >>>> >>>> about >>>>>> >>>>>> the validity of the p-values I'm getting out. Its a simple 'Group A Vs >>>>>> Group >>>>>> B' experimental design. 4 arrays in one group, 3 in the other and 4 >>>>>> duplicate spots for each miRNA on each array. >>>>>> >>>>>> The lowest adjusted p-values in the differential expression analysis >>>> >>>> are >>>>>> >>>>>> in >>>>>> the region of 10^-7. >>>>>> >>>>>> Its been pointed out to me that plugging the point values from each >>>> >>>> sample >>>>>> >>>>>> into a regular t-test you get p=0.008, which then also needs to take >>>> >>>> the >>>>>> >>>>>> multiple test hit. Can anybody explain why limma is giving me such >>>> >>>> lower >>>>>> >>>>>> values and if they are valid? >>>>>> >>>>>> I can provide more information if required. >>>>>> >>>>>> Thanks, >>>>>> >>>>>> Paul. >>>>>> >>>>>> >>>>>> >>>>> >>>> >>>> >>>> -- >>>> Paul Geeleher >>>> School of Mathematics, Statistics and Applied Mathematics >>>> National University of Ireland >>>> Galway >>>> Ireland >>>> >>>> >>> >>> >>> >>> -- >>> Paul Geeleher >>> School of Mathematics, Statistics and Applied Mathematics >>> National University of Ireland >>> Galway >>> Ireland >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at stat.math.ethz.ch >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> >> The University of Aberdeen is a charity registered in Scotland, No SC013683. >> > > > > -- > Paul Geeleher > School of Mathematics, Statistics and Applied Mathematics > National University of Ireland > Galway > Ireland >
ADD REPLY
0
Entering edit mode
On Sat, 24 Oct 2009, Gordon K Smyth wrote: > Dear Paul, > > Give your consensus correlation value, limma is treating your within-array > replicates as worth about 1/3 as much as replicates on independent arrays > (because 1-0.81^2 is about 1/3). Sorry, my maths is wrong. The effective weight of the within-array replicates is quite a bit less than 1/3, given ndups=4 and cor=0.81. Best wishes Gordon
ADD REPLY
0
Entering edit mode
Dear list, The following are the words of a professor in my department: I still don't get why the 'real' p-values could be better than p-values you get with the assumption of zero measurement error. By averaging over within array replicates you are not ignoring the within array replicates, instead you are acting as though there were infinitely many of them, so that the standard error of the expression level within array is zero. Stats is about making inferences about populations from finite samples. The population you are making inferences about is the population of all late-stage breast cancers. The data are from 7 individuals. The within-array replicates give an indication of measurement error of the expression levels but don't give you a handle on the variability of the quantity of interest in the population. Paul On Sat, Oct 24, 2009 at 2:44 AM, Gordon K Smyth <smyth at="" wehi.edu.au=""> wrote: > > > On Sat, 24 Oct 2009, Gordon K Smyth wrote: > >> Dear Paul, >> >> Give your consensus correlation value, limma is treating your within-array >> replicates as worth about 1/3 as much as replicates on independent arrays >> (because 1-0.81^2 is about 1/3). > > Sorry, my maths is wrong. ?The effective weight of the within-array > replicates is quite a bit less than 1/3, given ndups=4 and cor=0.81. > > Best wishes > Gordon > -- Paul Geeleher School of Mathematics, Statistics and Applied Mathematics National University of Ireland Galway Ireland
ADD REPLY
0
Entering edit mode
Hi Paul! What can I say? Your professor is right! In one of his e-mails Gordon wrote the following: " There are some caveats. Firstly, the limma duplicateCorrelation method assumes replicates to be equally spaced, and yours are not. You've actually re-ordered your data to fit it into limma. So the within-array correlation will be underestimated, and significance over-estimated, for transcripts for which the replicates are unusually close on the array. Secondly, the within-array correlation is assumed to be the same for all transcripts, which is never actually true. The approximation has proved worthwhile when ndups=2 or 3, but it will yield over-optimistic results when the number of within-replicates is large." That probably gives you a hint, how using limma in this situation might give you "too good" results. As much as it might be an interesting exercise to find out exactly at which point your data are violating the assumptions limma makes (or what else is strange her), I would choose the pragmatic solution to simply average across duplicates first and then apply limma. As your colleague says you give up information about technical variability here, but as far as the comparison between the two groups is concerned this is an absolutely valid approach and as your p-values seem to be less significant you are on the "safe side" regarding control of FDR etc... Claus > -----Original Message----- > From: Paul Geeleher [mailto:paulgeeleher at gmail.com] > Sent: 28 October 2009 13:24 > To: Gordon K Smyth > Cc: Mayer, Claus-Dieter; Bioconductor mailing list > Subject: Re: [BioC] Very low P-values in limma > > Dear list, > > The following are the words of a professor in my department: > > I still don't get why the 'real' p-values could be better than > p-values you get with the assumption of zero measurement error. By > averaging over within array replicates you are not ignoring the within > array replicates, instead you are acting as though there were > infinitely many of them, so that the standard error of the expression > level within array is zero. Stats is about making inferences about > populations from finite samples. The population you are making > inferences about is the population of all late-stage breast cancers. > The data are from 7 individuals. The within-array replicates give an > indication of measurement error of the expression levels but don't > give you a handle on the variability of the quantity of interest in > the population. > > Paul > > On Sat, Oct 24, 2009 at 2:44 AM, Gordon K Smyth <smyth at="" wehi.edu.au=""> wrote: > > > > > > On Sat, 24 Oct 2009, Gordon K Smyth wrote: > > > >> Dear Paul, > >> > >> Give your consensus correlation value, limma is treating your within- > array > >> replicates as worth about 1/3 as much as replicates on independent > arrays > >> (because 1-0.81^2 is about 1/3). > > > > Sorry, my maths is wrong. The effective weight of the within- array > > replicates is quite a bit less than 1/3, given ndups=4 and cor=0.81. > > > > Best wishes > > Gordon > > > > > > -- > Paul Geeleher > School of Mathematics, Statistics and Applied Mathematics > National University of Ireland > Galway > Ireland The University of Aberdeen is a charity registered in Scotland, No SC013683.
ADD REPLY
0
Entering edit mode
Dear Paul, Is it possible that you haven't quoted your professor verbatim?, because these comments don't make sense as they stand. I really know what he might mean by real p-values or the assumption of zero measurement error. Measurement error obviously can't be zero. Nor can there be an infinite number of replicates. None of the alternative analysis methods we have discussed makes either of these assumptions. I wasn't the one arguing for averaging within-array replicates, but if that method did assume what you say, then it would have to be an invalid method. On the other hand, your professor is quite right to say that within- array replicates measure technical rather than biological variability. In a univariate analysis, one would simply average the technical replicates. This would give a summary reponse variable, with a variance made up of both biological and technical components, with replicates that you could reasonably treat as independent. In a genewise microarray analysis, averaging the within-replicates has a disadvantage in that it fails to penalize (lower the rank of) genes which have high within-array variability. If biological variability is high compared to technical, and you have a enough array replicates to get a decent estimate of between-array variability, then averaging the within-array replicates is likely still the way to go, just as in a univariate analysis. On the other hand, if technical variability (within and between arrays) is relatively large compared to biological, and the number of array replicates is very small, then the information in the within-array variances can be too valuable to ignore. duplicateCorrelation uses the fact that the between-array variance has a technical as well as a biological component, and the between and within technical components tend to be associated across probes for many microarray platforms. It is this last assumption which allows us to make use of within-array standard deviations when making inferences about between sample comparisons. If your priority is to get reliable p-values, and you think you have enough array replication to do this, then average the within-array replicates. If your array replication is limited, technical variability is high, and your priority is to rank the genes, then duplicateCorrelation may help. I would add that microarray p-values should always be taken with a grain of salt, as it's impossible to verify all assumptions in small experiments, and it's useful instead to think in terms of independent verification of the results. This is really as far as I want to debate it. Obviously it's your analysis and you should use your own judgement. As a maths graduate student, you would be able to read the duplicateCorrelation published paper if you want to check the reasoning in detail. Best wishes Gordon On Wed, 28 Oct 2009, Paul Geeleher wrote: > Dear list, > > The following are the words of a professor in my department: > > I still don't get why the 'real' p-values could be better than > p-values you get with the assumption of zero measurement error. By > averaging over within array replicates you are not ignoring the within > array replicates, instead you are acting as though there were > infinitely many of them, so that the standard error of the expression > level within array is zero. Stats is about making inferences about > populations from finite samples. The population you are making > inferences about is the population of all late-stage breast cancers. > The data are from 7 individuals. The within-array replicates give an > indication of measurement error of the expression levels but don't > give you a handle on the variability of the quantity of interest in > the population. > > Paul > > On Sat, Oct 24, 2009 at 2:44 AM, Gordon K Smyth <smyth at="" wehi.edu.au=""> wrote: >> >> >> On Sat, 24 Oct 2009, Gordon K Smyth wrote: >> >>> Dear Paul, >>> >>> Give your consensus correlation value, limma is treating your within-array >>> replicates as worth about 1/3 as much as replicates on independent arrays >>> (because 1-0.81^2 is about 1/3). >> >> Sorry, my maths is wrong. ?The effective weight of the within-array >> replicates is quite a bit less than 1/3, given ndups=4 and cor=0.81. >> >> Best wishes >> Gordon >> > > > > -- > Paul Geeleher > School of Mathematics, Statistics and Applied Mathematics > National University of Ireland > Galway > Ireland >
ADD REPLY

Login before adding your answer.

Traffic: 838 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6