Entering edit mode
Guest User
★
13k
@guest-user-4897
Last seen 10.6 years ago
library(RCurl)
library(genefilter)
library(limma)
setwd(tempdir())
##https://drive.google.com/file/d/0B__nP63GoFhMd042V09GcHFvVFE/edit?us
p=sharing
x = getBinaryURL("https://docs.google.com/uc?export=download&id=0B__nP
63GoFhMd042V09GcHFvVFE", followlocation = TRUE, ssl.verifypeer =
FALSE)
writeBin(x, "std_var.txt", useBytes = TRUE)
std_var = as.matrix(read.table("std_var.txt",
header = FALSE, sep = "t", quote=""", comment.char = ""
))
##https://drive.google.com/file/d/0B__nP63GoFhMc2tFT3hXVW52Q0E/edit?us
p=sharing
x = getBinaryURL("https://docs.google.com/uc?export=download&id=0B__nP
63GoFhMc2tFT3hXVW52Q0E", followlocation = TRUE, ssl.verifypeer =
FALSE)
writeBin(x, "mean_var.txt", useBytes = TRUE)
mean_var = as.matrix(read.table("mean_var.txt",
header = FALSE, sep = "t", quote=""", comment.char = ""
))
NO_OF_GROUPS = 2
NO_OF_SAMP_PER_GROUP_GROUND_TRUTH = 100
NO_OF_GENES = NROW(mean_var)
gxexprs_GroundTruth = matrix(0, NO_OF_GENES,
NO_OF_GROUPS*NO_OF_SAMP_PER_GROUP_GROUND_TRUTH)
std_var_list = list(std_var, std_var,seq(.01,.2, length.out=
NO_OF_GENES))
simulated_mean = matrix(c(8,9),NO_OF_GENES,NO_OF_GROUPS, byrow=TRUE)
mean_var_list = list(mean_var, simulated_mean, simulated_mean)
main_list = list("micorarray mean and std", "means 8 and 9 and
micorarray std", "means 8 and 9 and uniform std from 0.01 to 0.2")
for(count_i in 1:3)
{
mean_var = mean_var_list[[count_i]]
std_var = std_var_list[[count_i]]
for(i in 1:NO_OF_GENES)
{
if ((i%%1000)==1)
{
cat(i, "n")
}
for(j in 1:NO_OF_GROUPS)
{
gxexprs_GroundTruth[i,
NO_OF_SAMP_PER_GROUP_GROUND_TRUTH*(j-1) +
1:NO_OF_SAMP_PER_GROUP_GROUND_TRUTH]=
rnorm(NO_OF_SAMP_PER_GROUP_GROUND_TRUTH,
mean_var[i,j], std_var[i])
}
}
final_choice = c(1,2)
Group = factor(sprintf("S%02d",
rep(final_choice,each=NO_OF_SAMP_PER_GROUP_GROUND_TRUTH)
))
t_test_pvals = rowttests(gxexprs_GroundTruth, fac=Group)$p.value
design <- model.matrix(~0+Group)
colnames(design) <- gsub("Group","",colnames(design))
contrastsX = sprintf("S01 - S%02d",2)
#fit <- lmFit(gxexprs_GroundTruth[, final_col_dest], design)
fit <- lmFit(gxexprs_GroundTruth, design)
contrast.matrix <- makeContrasts(contrasts=contrastsX,levels=design)
fit2 <- contrasts.fit(fit, contrast.matrix)
fit2 <- eBayes(fit2)
TTable = topTable(fit2, coef=1, adjust="fdr", number =
nrow(gxexprs_GroundTruth))
limma_pvals = TTable$P.Value[as.numeric(rownames(TTable))]
dev.new()
plot(-log10(limma_pvals), -log10(t_test_pvals),
main=main_list[[count_i]])
}
-- output of sessionInfo():
R version 2.15.2 (2012-10-26)
Platform: x86_64-w64-mingw32/x64 (64-bit)
locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] grDevices datasets tcltk splines graphics utils stats
[8] grid methods base
other attached packages:
[1] limma_3.14.4 genefilter_1.40.0 Biobase_2.18.0
BiocGenerics_0.4.0
[5] RCurl_1.95-4.1 bitops_1.0-6 reshape2_1.2.2
Hmisc_3.14-3
[9] Formula_1.1-1 survival_2.37-7 lattice_0.20-29
loaded via a namespace (and not attached):
[1] annotate_1.36.0 AnnotationDbi_1.20.7 cluster_1.15.2
[4] DBI_0.2-7 IRanges_1.16.6 latticeExtra_0.6-26
[7] parallel_2.15.2 plyr_1.8 RColorBrewer_1.0-5
[10] RSQLite_0.11.4 stats4_2.15.2 stringr_0.6.2
[13] XML_3.98-1.1 xtable_1.7-3
--
Sent via the guest posting facility at bioconductor.org.
Hi Gordon,
I have two questions regarding applying a t-test and limma to microarray data.
For large number of samples per condition (for example 100 samples used in the R code sent previously) do you expect a high correlation between the 2 sets of p values? The R code shows that there is no correlation between the p values when the distribution of the standard deviation is taken from a real microarray experiment. However, the correlation is high when the standard deviation distribution is uniform. This suggests to me that the eBayes step in limma "adjusts" the standard deviation significantly when its distribution is similar to a microrarray experiment even though each condition has a large number of samples that already provides a good estimate of the standard deviation. I would appreciate your thoughts on this since I might be missing something.
Most likely the number of microarray experiments that have 100 samples per condition is very small. The number of experiments that could used a pooled variance is much higher. I would like to run some simulations that show the more conditions I have in an experiment the closer I get to the ground truth. The idea here is similar to what happens to the law of large numbers. The more conditions you have in your experiment the better the estimate of the variance, the closer you get to the ground truth. Only that I do not know what the ground truth is. I thought that I could use the t-test with 100 samples in each condition, but since the ranking of genes is different than what limma produces, I'm not sure.
So the question is, is it possible to produce a simulation which shows in a "law of large numbers" kind of fashion that the more conditions are included in the model the closer you get to the ground truth? Also it would be nice to show that limma is closer to the ground truth than the equivalent t-test.
Thank you,
Giovanni
I am not following your simulation or your conclusions but the comparison of limma vs t-test is easy to summarize and has been demonstrated in many studies.
limma always improves on the t-test regardless of the number of groups or the number of samples. However the t-test and the limma moderated t-test become asymptotically the same as the sample sizes become large, so the benefit of the empirical Bayes approach, while never zero, becomes relatively less as the sample size increases.