I use R in bioinformatic analysis of metabolomics and gene expression data. One of my colleagues wrote a (nearly published) R program to bundle the features of many R packages for informatics, including limma and plsgenomics packages. I have come to rely on the complex program primarily for something much simpler than intended, which is extraction of significance testing from limma and plsgenomics. In the case of limma, I take a file which assigns a p-value to each feature (row). In the case of plsgenomics, I take a file which assigns a VIP score to each feature (row). When I read the manual for either package I have a difficult time finding any examples that describe even approximately what I am attempting to do. If it's not already clear, I want to generate a p-value or VIP score from either package for features in a given data matrix.
I have done some basic coding in R which I will show, the most advanced is a "for" loop to get t.test results in the same fashion I am looking to use limma and plsgenomics. If there is a way to do something very similar with limma and plsgenomics to get the results I want I would greatly appreciate the assistance in generating such a code.
t.test Code:
#Read in your file
data <- read.table ("expression-data-for-tstat-2.txt", fill=TRUE, header=TRUE, sep="\t", stringsAsFactors=FALSE)
#Assign rownames
rownames(data) <- data$TCID
data <- data [-1]
#Convert to matrix
matrix_data <- data.matrix(data)
#Group samples
colnames(matrix_data) <- c("Con", "Con", "Con", "Cd", "Cd", "Cd")
#Variable declaration
ttest_result <- NULL
tt_pval <- NULL
ttest <- NULL
wilcox_result <- NULL
wx_pval <- NULL
wxtest <- NULL
foldchange <- NULL
#Initialization
count <- (1:nrow(matrix_data))
#"for" loop: Performs t.test
for (k in count){
index1 <- grep ("Con", colnames(matrix_data))
index2 <- grep ("Cd", colnames(matrix_data))
#check if columns exist
if(length(index1) == 0) next
if(length(index2) == 0) next
#Welch Two Sample t-test
ttest_result <- t.test (as.numeric(matrix_data[k,][index1]), na.rm = TRUE, as.numeric(matrix_data[k,][index2]), na.rm = TRUE, paired = FALSE)
tt_pval <- rbind (tt_pval, ttest_result$p.value)
ttest <- rbind (ttest, ttest_result$statistic)
#Wilcoxon Test
#wilcox_result <- wilcox.test (as.numeric(matrix_data[k,][index1]), na.rm = TRUE, as.numeric(matrix_data[k,][index2]), na.rm = TRUE, paired = FALSE)
#wx_pval <- rbind (wx_pval, wilcox_result$p.value)
#wxtest <- rbind (wxtest, wilcox_result$statistic)
#Fold change
#pair_fold <- mean(as.numeric(matrix_data[k,][index2], na.rm = TRUE))/mean(as.numeric(matrix_data[k,][index1], na.rm = TRUE))
#foldchange <- rbind (foldchange, pair_fold)
}
#Probably don't need this; puts row names on ttest, will fix with cbind function
#rownames(ttest) <- rownames(data)
#Bind object columns (so results are presented with original data)
results <- cbind (data, ttest, tt_pval)
#Write the new file
write.table(results, file="expressiondata-withtstat.txt", sep="\t", col.names=colnames(results), row.names=TRUE)
My Affy data is already log2 transformed, but are you suggesting it is conventional to loess normalize the array data as well?
I attempted to place the suggested limma changes into my original code structure. I'm running it now but it seems to be stuck or taking a long time.
#Read in your file
data <- read.table ("expression-data-for-tstat-2.txt", fill=TRUE, header=TRUE, sep="\t", stringsAsFactors=FALSE)
#Assign rownames
rownames(data) <- data$TCID
data <- data [-1]
#Convert to matrix
matrix_data <- data.matrix(data)
#Group samples
colnames(matrix_data) <- c("Con", "Con", "Con", "Cd", "Cd", "Cd")
#Load Limma
library(limma)
#Variable declaration
y <- NULL
fit <- NULL
res <- NULL
lm_p <- NULL
#Loess normalize
matrix_data <- normalizeBetweenArrays(matrix_data, method="cyclicloess")
#Initialization
count <- (1:nrow(matrix_data))
#"for" loop
for (k in count){
index1 <- grep ("Con", colnames(matrix_data))
index2 <- grep ("Cd", colnames(matrix_data))
#check if columns exist
if(length(index1) == 0) next
if(length(index2) == 0) next
#Limma p-val
y <- new("EList", list(E=matrix_data))
design <- model.matrix(~factor(colnames(matrix_data)))
fit <- lmFit(y, design)
fit <- eBayes(fit)
res <- topTable(fit, coef=2, sort.by="none", n=Inf)
lm_p <- rbind (lm_p, res$p.value)
}
#Bind object columns (so results are presented with original data)
results <- cbind (data, lm_p)
#Write the new file
write.table(results, file="expressiondata-withtstat.txt", sep="\t", col.names=colnames(results), row.names=TRUE)
As Gordon has written, the
limma
code block shouldn't be inside thefor
loop. If you remove the looping code around it, you should find that it runs a lot faster. As for normalization; if you're getting log-values from Affy data, it should already be normalized, so you shouldn't need to do anything extra.