limma vs t-test

0

Entering edit mode

Guest User ★ 13k

@guest-user-4897

Last seen 9.6 years ago

library(RCurl) library(genefilter) library(limma) setwd(tempdir()) ##https://drive.google.com/file/d/0B__nP63GoFhMd042V09GcHFvVFE/edit?us p=sharing x = getBinaryURL("https://docs.google.com/uc?export=download&id=0B__nP 63GoFhMd042V09GcHFvVFE", followlocation = TRUE, ssl.verifypeer = FALSE) writeBin(x, "std_var.txt", useBytes = TRUE) std_var = as.matrix(read.table("std_var.txt", header = FALSE, sep = "t", quote=""", comment.char = "" )) ##https://drive.google.com/file/d/0B__nP63GoFhMc2tFT3hXVW52Q0E/edit?us p=sharing x = getBinaryURL("https://docs.google.com/uc?export=download&id=0B__nP 63GoFhMc2tFT3hXVW52Q0E", followlocation = TRUE, ssl.verifypeer = FALSE) writeBin(x, "mean_var.txt", useBytes = TRUE) mean_var = as.matrix(read.table("mean_var.txt", header = FALSE, sep = "t", quote=""", comment.char = "" )) NO_OF_GROUPS = 2 NO_OF_SAMP_PER_GROUP_GROUND_TRUTH = 100 NO_OF_GENES = NROW(mean_var) gxexprs_GroundTruth = matrix(0, NO_OF_GENES, NO_OF_GROUPS*NO_OF_SAMP_PER_GROUP_GROUND_TRUTH) std_var_list = list(std_var, std_var,seq(.01,.2, length.out= NO_OF_GENES)) simulated_mean = matrix(c(8,9),NO_OF_GENES,NO_OF_GROUPS, byrow=TRUE) mean_var_list = list(mean_var, simulated_mean, simulated_mean) main_list = list("micorarray mean and std", "means 8 and 9 and micorarray std", "means 8 and 9 and uniform std from 0.01 to 0.2") for(count_i in 1:3) { mean_var = mean_var_list[[count_i]] std_var = std_var_list[[count_i]] for(i in 1:NO_OF_GENES) { if ((i%%1000)==1) { cat(i, "n") } for(j in 1:NO_OF_GROUPS) { gxexprs_GroundTruth[i, NO_OF_SAMP_PER_GROUP_GROUND_TRUTH*(j-1) + 1:NO_OF_SAMP_PER_GROUP_GROUND_TRUTH]= rnorm(NO_OF_SAMP_PER_GROUP_GROUND_TRUTH, mean_var[i,j], std_var[i]) } } final_choice = c(1,2) Group = factor(sprintf("S%02d", rep(final_choice,each=NO_OF_SAMP_PER_GROUP_GROUND_TRUTH) )) t_test_pvals = rowttests(gxexprs_GroundTruth, fac=Group)$p.value design <- model.matrix(~0+Group) colnames(design) <- gsub("Group","",colnames(design)) contrastsX = sprintf("S01 - S%02d",2) #fit <- lmFit(gxexprs_GroundTruth[, final_col_dest], design) fit <- lmFit(gxexprs_GroundTruth, design) contrast.matrix <- makeContrasts(contrasts=contrastsX,levels=design) fit2 <- contrasts.fit(fit, contrast.matrix) fit2 <- eBayes(fit2) TTable = topTable(fit2, coef=1, adjust="fdr", number = nrow(gxexprs_GroundTruth)) limma_pvals = TTable$P.Value[as.numeric(rownames(TTable))] dev.new() plot(-log10(limma_pvals), -log10(t_test_pvals), main=main_list[[count_i]]) } -- output of sessionInfo(): R version 2.15.2 (2012-10-26) Platform: x86_64-w64-mingw32/x64 (64-bit) locale: [1] LC_COLLATE=English_United States.1252 [2] LC_CTYPE=English_United States.1252 [3] LC_MONETARY=English_United States.1252 [4] LC_NUMERIC=C [5] LC_TIME=English_United States.1252 attached base packages: [1] grDevices datasets tcltk splines graphics utils stats [8] grid methods base other attached packages: [1] limma_3.14.4 genefilter_1.40.0 Biobase_2.18.0 BiocGenerics_0.4.0 [5] RCurl_1.95-4.1 bitops_1.0-6 reshape2_1.2.2 Hmisc_3.14-3 [9] Formula_1.1-1 survival_2.37-7 lattice_0.20-29 loaded via a namespace (and not attached): [1] annotate_1.36.0 AnnotationDbi_1.20.7 cluster_1.15.2 [4] DBI_0.2-7 IRanges_1.16.6 latticeExtra_0.6-26 [7] parallel_2.15.2 plyr_1.8 RColorBrewer_1.0-5 [10] RSQLite_0.11.4 stats4_2.15.2 stringr_0.6.2 [13] XML_3.98-1.1 xtable_1.7-3 -- Sent via the guest posting facility at bioconductor.org.

• 1.8k views

ADD COMMENT • link updated 15 months ago by Gordon Smyth 50k • written 9.9 years ago by Guest User ★ 13k

0

Entering edit mode

Guest User ★ 13k

@guest-user-4897

Last seen 9.6 years ago

Hi everybody, I have a question regarding comparing limma to the t-test. I compared the p values obtained from both algorithms using 100 samples in each condition. Given the large number of observations my expectation was to see high correlation between the p values. As shown below, I ran the same code in three different conditions for the mean and standard deviation. The mean and standard deviation loaded from the google docs files are from a real micro array experiment. 1. mean and std from microarray exeperiment: no correlation between p values 2. constant fold change and std from microarray exeperiment: very small correlation 3. constant fold change and uniform std from 0.01 to 0.2: high correlation I understand that limma uses information across genes, but shouldn't this information be weighed with the number of observations for each condition? Thank you, Giovanni library(RCurl) library(genefilter) library(limma) setwd(tempdir()) ##https://drive.google.com/file/d/0B__nP63GoFhMd042V09GcHFvVFE/edit?us p=sharing x = getBinaryURL("https://docs.google.com/uc?export=download&id=0B__nP 63GoFhMd042V09GcHFvVFE", followlocation = TRUE, ssl.verifypeer = FALSE) writeBin(x, "std_var.txt", useBytes = TRUE) std_var = as.matrix(read.table("std_var.txt", header = FALSE, sep = "t", quote=""", comment.char = "" )) ##https://drive.google.com/file/d/0B__nP63GoFhMc2tFT3hXVW52Q0E/edit?us p=sharing x = getBinaryURL("https://docs.google.com/uc?export=download&id=0B__nP 63GoFhMc2tFT3hXVW52Q0E", followlocation = TRUE, ssl.verifypeer = FALSE) writeBin(x, "mean_var.txt", useBytes = TRUE) mean_var = as.matrix(read.table("mean_var.txt", header = FALSE, sep = "t", quote=""", comment.char = "" )) NO_OF_GROUPS = 2 NO_OF_SAMP_PER_GROUP_GROUND_TRUTH = 100 NO_OF_GENES = NROW(mean_var) gxexprs_GroundTruth = matrix(0, NO_OF_GENES, NO_OF_GROUPS*NO_OF_SAMP_PER_GROUP_GROUND_TRUTH) std_var_list = list(std_var, std_var,seq(.01,.2, length.out= NO_OF_GENES)) simulated_mean = matrix(c(8,9),NO_OF_GENES,NO_OF_GROUPS, byrow=TRUE) mean_var_list = list(mean_var, simulated_mean, simulated_mean) main_list = list("micorarray mean and std", "means 8 and 9 and micorarray std", "means 8 and 9 and uniform std from 0.01 to 0.2") for(count_i in 1:3) { mean_var = mean_var_list[[count_i]] std_var = std_var_list[[count_i]] for(i in 1:NO_OF_GENES) { if ((i%%1000)==1) { cat(i, "n") } for(j in 1:NO_OF_GROUPS) { gxexprs_GroundTruth[i, NO_OF_SAMP_PER_GROUP_GROUND_TRUTH*(j-1) + 1:NO_OF_SAMP_PER_GROUP_GROUND_TRUTH]= rnorm(NO_OF_SAMP_PER_GROUP_GROUND_TRUTH, mean_var[i,j], std_var[i]) } } final_choice = c(1,2) Group = factor(sprintf("S%02d", rep(final_choice,each=NO_OF_SAMP_PER_GROUP_GROUND_TRUTH) )) t_test_pvals = rowttests(gxexprs_GroundTruth, fac=Group)$p.value design <- model.matrix(~0+Group) colnames(design) <- gsub("Group","",colnames(design)) contrastsX = sprintf("S01 - S%02d",2) #fit <- lmFit(gxexprs_GroundTruth[, final_col_dest], design) fit <- lmFit(gxexprs_GroundTruth, design) contrast.matrix <- makeContrasts(contrasts=contrastsX,levels=design) fit2 <- contrasts.fit(fit, contrast.matrix) fit2 <- eBayes(fit2) TTable = topTable(fit2, coef=1, adjust="fdr", number = nrow(gxexprs_GroundTruth)) limma_pvals = TTable$P.Value[as.numeric(rownames(TTable))] dev.new() plot(-log10(limma_pvals), -log10(t_test_pvals), main=main_list[[count_i]]) } -- output of sessionInfo(): R version 2.15.2 (2012-10-26) Platform: x86_64-w64-mingw32/x64 (64-bit) locale: [1] LC_COLLATE=English_United States.1252 [2] LC_CTYPE=English_United States.1252 [3] LC_MONETARY=English_United States.1252 [4] LC_NUMERIC=C [5] LC_TIME=English_United States.1252 attached base packages: [1] grDevices datasets tcltk splines graphics utils stats [8] grid methods base other attached packages: [1] limma_3.14.4 genefilter_1.40.0 Biobase_2.18.0 BiocGenerics_0.4.0 [5] RCurl_1.95-4.1 bitops_1.0-6 reshape2_1.2.2 Hmisc_3.14-3 [9] Formula_1.1-1 survival_2.37-7 lattice_0.20-29 loaded via a namespace (and not attached): [1] annotate_1.36.0 AnnotationDbi_1.20.7 cluster_1.15.2 [4] DBI_0.2-7 IRanges_1.16.6 latticeExtra_0.6-26 [7] parallel_2.15.2 plyr_1.8 RColorBrewer_1.0-5 [10] RSQLite_0.11.4 stats4_2.15.2 stringr_0.6.2 [13] XML_3.98-1.1 xtable_1.7-3 -- Sent via the guest posting facility at bioconductor.org.

ADD COMMENT • link 9.9 years ago Guest User ★ 13k

0

Entering edit mode

Giovanni Bucci ▴ 60

@giovanni-bucci-6524

Last seen 9.6 years ago

Hi everybody, I have a question regarding comparing limma to the t-test. I compared the p values obtained from both algorithms using 100 samples in each condition. Given the large number of observations my expectation was to see high correlation between the p values. As shown below, I ran the same code in three different conditions for the mean and standard deviation. The mean and standard deviation loaded from the google docs files are from a real micro array experiment. 1. mean and std from microarray exeperiment: no correlation between p values 2. constant fold change and std from microarray exeperiment: very small correlation 3. constant fold change and uniform std from 0.01 to 0.2: high correlation I understand that limma uses information across genes, but shouldn't this information be weighed with the number of observations for each condition? Thank you, Giovanni library(RCurl) library(genefilter) library(limma) setwd(tempdir()) ## https://drive.google.com/file/d/0B__nP63GoFhMd042V09GcHFvVFE/edit?usp= sharing x = getBinaryURL(" https://docs.google.com/uc?export=download&id=0B__nP63GoFhMd042V09GcHF vVFE", followlocation = TRUE, ssl.verifypeer = FALSE) writeBin(x, "std_var.txt", useBytes = TRUE) std_var = as.matrix(read.table("std_var.txt", header = FALSE, sep = "\t", quote="\"", comment.char = "" )) ## https://drive.google.com/file/d/0B__nP63GoFhMc2tFT3hXVW52Q0E/edit?usp= sharing x = getBinaryURL(" https://docs.google.com/uc?export=download&id=0B__nP63GoFhMc2tFT3hXVW5 2Q0E", followlocation = TRUE, ssl.verifypeer = FALSE) writeBin(x, "mean_var.txt", useBytes = TRUE) mean_var = as.matrix(read.table("mean_var.txt", header = FALSE, sep = "\t", quote="\"", comment.char = "" )) NO_OF_GROUPS = 2 NO_OF_SAMP_PER_GROUP_GROUND_TRUTH = 100 NO_OF_GENES = NROW(mean_var) gxexprs_GroundTruth = matrix(0, NO_OF_GENES, NO_OF_GROUPS*NO_OF_SAMP_PER_GROUP_GROUND_TRUTH) std_var_list = list(std_var, std_var,seq(.01,.2, length.out= NO_OF_GENES)) simulated_mean = matrix(c(8,9),NO_OF_GENES,NO_OF_GROUPS, byrow=TRUE) mean_var_list = list(mean_var, simulated_mean, simulated_mean) main_list = list("micorarray mean and std", "means 8 and 9 and micorarray std", "means 8 and 9 and uniform std from 0.01 to 0.2") for(count_i in 1:3) { mean_var = mean_var_list[[count_i]] std_var = std_var_list[[count_i]] for(i in 1:NO_OF_GENES) { if ((i%%1000)==1) { cat(i, "\n") } for(j in 1:NO_OF_GROUPS) { gxexprs_GroundTruth[i, NO_OF_SAMP_PER_GROUP_GROUND_TRUTH*(j-1) + 1:NO_OF_SAMP_PER_GROUP_GROUND_TRUTH]= rnorm(NO_OF_SAMP_PER_GROUP_GROUND_TRUTH, mean_var[i,j], std_var[i]) } } final_choice = c(1,2) Group = factor(sprintf("S%02d", rep(final_choice,each=NO_OF_SAMP_PER_GROUP_GROUND_TRUTH) )) t_test_pvals = rowttests(gxexprs_GroundTruth, fac=Group)$p.value design <- model.matrix(~0+Group) colnames(design) <- gsub("Group","",colnames(design)) contrastsX = sprintf("S01 - S%02d",2) fit <- lmFit(gxexprs_GroundTruth, design) contrast.matrix <- makeContrasts(contrasts=contrastsX,levels=design) fit2 <- contrasts.fit(fit, contrast.matrix) fit2 <- eBayes(fit2) TTable = topTable(fit2, coef=1, adjust="fdr", number = nrow(gxexprs_GroundTruth)) limma_pvals = TTable$P.Value[as.numeric(rownames(TTable))] dev.new() plot(-log10(limma_pvals), -log10(t_test_pvals), main=main_list[[count_i]]) } [[alternative HTML version deleted]]

ADD COMMENT • link 9.9 years ago Giovanni Bucci ▴ 60

0

Entering edit mode

Steve Lianoglou ★ 13k

@steve-lianoglou-2771

Last seen 13 months ago

United States

Hi Giovanni, Is there a question here somewhere that you wanted help with, or are you just posting your script here as a type of temporal backup solution, or something? -steve On Thu, May 22, 2014 at 3:07 PM, Giovanni Bucci [guest] <guest at="" bioconductor.org=""> wrote: > library(RCurl) > library(genefilter) > library(limma) > > setwd(tempdir()) > > > ##https://drive.google.com/file/d/0B__nP63GoFhMd042V09GcHFvVFE/edit? usp=sharing > x = getBinaryURL("https://docs.google.com/uc?export=download&id=0B__ nP63GoFhMd042V09GcHFvVFE", followlocation = TRUE, ssl.verifypeer = FALSE) > writeBin(x, "std_var.txt", useBytes = TRUE) > std_var = as.matrix(read.table("std_var.txt", > header = FALSE, sep = "t", quote=""", comment.char = "" > )) > > ##https://drive.google.com/file/d/0B__nP63GoFhMc2tFT3hXVW52Q0E/edit? usp=sharing > x = getBinaryURL("https://docs.google.com/uc?export=download&id=0B__ nP63GoFhMc2tFT3hXVW52Q0E", followlocation = TRUE, ssl.verifypeer = FALSE) > writeBin(x, "mean_var.txt", useBytes = TRUE) > mean_var = as.matrix(read.table("mean_var.txt", > header = FALSE, sep = "t", quote=""", comment.char = "" > )) > > > NO_OF_GROUPS = 2 > > NO_OF_SAMP_PER_GROUP_GROUND_TRUTH = 100 > > NO_OF_GENES = NROW(mean_var) > > gxexprs_GroundTruth = matrix(0, NO_OF_GENES, NO_OF_GROUPS*NO_OF_SAMP_PER_GROUP_GROUND_TRUTH) > > > std_var_list = list(std_var, std_var,seq(.01,.2, length.out= NO_OF_GENES)) > > simulated_mean = matrix(c(8,9),NO_OF_GENES,NO_OF_GROUPS, byrow=TRUE) > > mean_var_list = list(mean_var, simulated_mean, simulated_mean) > > main_list = list("micorarray mean and std", "means 8 and 9 and micorarray std", "means 8 and 9 and uniform std from 0.01 to 0.2") > > for(count_i in 1:3) > { > > mean_var = mean_var_list[[count_i]] > std_var = std_var_list[[count_i]] > > > for(i in 1:NO_OF_GENES) > { > if ((i%%1000)==1) > { > cat(i, "n") > } > for(j in 1:NO_OF_GROUPS) > { > gxexprs_GroundTruth[i, NO_OF_SAMP_PER_GROUP_GROUND_TRUTH*(j-1) + 1:NO_OF_SAMP_PER_GROUP_GROUND_TRUTH]= > rnorm(NO_OF_SAMP_PER_GROUP_GROUND_TRUTH, mean_var[i,j], std_var[i]) > } > } > > final_choice = c(1,2) > > Group = factor(sprintf("S%02d", > rep(final_choice,each=NO_OF_SAMP_PER_GROUP_GROUND_TRUTH) > )) > > t_test_pvals = rowttests(gxexprs_GroundTruth, fac=Group)$p.value > > > > design <- model.matrix(~0+Group) > colnames(design) <- gsub("Group","",colnames(design)) > contrastsX = sprintf("S01 - S%02d",2) > > #fit <- lmFit(gxexprs_GroundTruth[, final_col_dest], design) > fit <- lmFit(gxexprs_GroundTruth, design) > > contrast.matrix <- makeContrasts(contrasts=contrastsX,levels=design) > fit2 <- contrasts.fit(fit, contrast.matrix) > fit2 <- eBayes(fit2) > > TTable = topTable(fit2, coef=1, adjust="fdr", number = nrow(gxexprs_GroundTruth)) > limma_pvals = TTable$P.Value[as.numeric(rownames(TTable))] > > > dev.new() > plot(-log10(limma_pvals), -log10(t_test_pvals), main=main_list[[count_i]]) > > } > > > -- output of sessionInfo(): > > R version 2.15.2 (2012-10-26) > Platform: x86_64-w64-mingw32/x64 (64-bit) > > locale: > [1] LC_COLLATE=English_United States.1252 > [2] LC_CTYPE=English_United States.1252 > [3] LC_MONETARY=English_United States.1252 > [4] LC_NUMERIC=C > [5] LC_TIME=English_United States.1252 > > attached base packages: > [1] grDevices datasets tcltk splines graphics utils stats > [8] grid methods base > > other attached packages: > [1] limma_3.14.4 genefilter_1.40.0 Biobase_2.18.0 BiocGenerics_0.4.0 > [5] RCurl_1.95-4.1 bitops_1.0-6 reshape2_1.2.2 Hmisc_3.14-3 > [9] Formula_1.1-1 survival_2.37-7 lattice_0.20-29 > > loaded via a namespace (and not attached): > [1] annotate_1.36.0 AnnotationDbi_1.20.7 cluster_1.15.2 > [4] DBI_0.2-7 IRanges_1.16.6 latticeExtra_0.6-26 > [7] parallel_2.15.2 plyr_1.8 RColorBrewer_1.0-5 > [10] RSQLite_0.11.4 stats4_2.15.2 stringr_0.6.2 > [13] XML_3.98-1.1 xtable_1.7-3 > > -- > Sent via the guest posting facility at bioconductor.org. > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Steve Lianoglou Computational Biologist Genentech

ADD COMMENT • link 9.9 years ago Steve Lianoglou ★ 13k

0

Entering edit mode

Gordon Smyth 50k

@gordon-smyth

Last seen 13 hours ago

WEHI, Melbourne, Australia

limma does take account of the number of observations in each condition.

Gordon

ADD COMMENT • link 9.9 years ago • updated 15 months ago Gordon Smyth 50k

0

Entering edit mode

Hi Gordon,

I have two questions regarding applying a t-test and limma to microarray data.

For large number of samples per condition (for example 100 samples used in the R code sent previously) do you expect a high correlation between the 2 sets of p values? The R code shows that there is no correlation between the p values when the distribution of the standard deviation is taken from a real microarray experiment. However, the correlation is high when the standard deviation distribution is uniform. This suggests to me that the eBayes step in limma "adjusts" the standard deviation significantly when its distribution is similar to a microrarray experiment even though each condition has a large number of samples that already provides a good estimate of the standard deviation. I would appreciate your thoughts on this since I might be missing something.
Most likely the number of microarray experiments that have 100 samples per condition is very small. The number of experiments that could used a pooled variance is much higher. I would like to run some simulations that show the more conditions I have in an experiment the closer I get to the ground truth. The idea here is similar to what happens to the law of large numbers. The more conditions you have in your experiment the better the estimate of the variance, the closer you get to the ground truth. Only that I do not know what the ground truth is. I thought that I could use the t-test with 100 samples in each condition, but since the ranking of genes is different than what limma produces, I'm not sure.

So the question is, is it possible to produce a simulation which shows in a "law of large numbers" kind of fashion that the more conditions are included in the model the closer you get to the ground truth? Also it would be nice to show that limma is closer to the ground truth than the equivalent t-test.

Thank you,

Giovanni

ADD REPLY • link updated 15 months ago by Gordon Smyth 50k • written 9.9 years ago by Giovanni Bucci ▴ 60

0

Entering edit mode

I am not following your simulation or your conclusions but the comparison of limma vs t-test is easy to summarize and has been demonstrated in many studies.

limma always improves on the t-test regardless of the number of groups or the number of samples. However the t-test and the limma moderated t-test become asymptotically the same as the sample sizes become large, so the benefit of the empirical Bayes approach, while never zero, becomes relatively less as the sample size increases.

ADD REPLY • link 15 months ago Gordon Smyth 50k

Login before adding your answer.