Question

Simple significance test functions from R packages limma and plsgenomics

1

Entering edit mode

chandlerjd58 ▴ 10

@chandlerjd58-7257

Last seen 9.2 years ago

United States

I use R in bioinformatic analysis of metabolomics and gene expression data. One of my colleagues wrote a (nearly published) R program to bundle the features of many R packages for informatics, including limma and plsgenomics packages. I have come to rely on the complex program primarily for something much simpler than intended, which is extraction of significance testing from limma and plsgenomics. In the case of limma, I take a file which assigns a p-value to each feature (row). In the case of plsgenomics, I take a file which assigns a VIP score to each feature (row). When I read the manual for either package I have a difficult time finding any examples that describe even approximately what I am attempting to do. If it's not already clear, I want to generate a p-value or VIP score from either package for features in a given data matrix.

I have done some basic coding in R which I will show, the most advanced is a "for" loop to get t.test results in the same fashion I am looking to use limma and plsgenomics. If there is a way to do something very similar with limma and plsgenomics to get the results I want I would greatly appreciate the assistance in generating such a code.

t.test Code:

#Read in your file
data <- read.table ("expression-data-for-tstat-2.txt", fill=TRUE, header=TRUE, sep="\t", stringsAsFactors=FALSE)

#Assign rownames
rownames(data) <- data$TCID
data <- data [-1]

#Convert to matrix
matrix_data <- data.matrix(data)

#Group samples
colnames(matrix_data) <- c("Con", "Con", "Con", "Cd", "Cd", "Cd")

#Variable declaration
ttest_result <- NULL
tt_pval <- NULL
ttest <- NULL
wilcox_result <- NULL
wx_pval <- NULL
wxtest <- NULL
foldchange <- NULL


#Initialization
count <- (1:nrow(matrix_data))

#"for" loop: Performs t.test
for (k in count){
    index1 <- grep ("Con", colnames(matrix_data))
    index2 <- grep ("Cd", colnames(matrix_data))

    #check if columns exist
    if(length(index1) == 0) next
    if(length(index2) == 0) next

#Welch Two Sample t-test
    ttest_result <- t.test (as.numeric(matrix_data[k,][index1]), na.rm = TRUE, as.numeric(matrix_data[k,][index2]), na.rm = TRUE, paired = FALSE)
    tt_pval <- rbind (tt_pval, ttest_result$p.value)
    ttest <- rbind (ttest, ttest_result$statistic)

#Wilcoxon Test
    #wilcox_result <- wilcox.test (as.numeric(matrix_data[k,][index1]), na.rm = TRUE, as.numeric(matrix_data[k,][index2]), na.rm = TRUE, paired = FALSE)
    #wx_pval <- rbind (wx_pval, wilcox_result$p.value)
    #wxtest <- rbind (wxtest, wilcox_result$statistic)

#Fold change
    #pair_fold <- mean(as.numeric(matrix_data[k,][index2], na.rm = TRUE))/mean(as.numeric(matrix_data[k,][index1], na.rm = TRUE))
    #foldchange <- rbind (foldchange, pair_fold)
}

#Probably don't need this; puts row names on ttest, will fix with cbind function
#rownames(ttest) <- rownames(data)

#Bind object columns (so results are presented with original data)
results <- cbind (data, ttest, tt_pval)

#Write the new file
write.table(results, file="expressiondata-withtstat.txt", sep="\t", col.names=colnames(results), row.names=TRUE)

limma plsgenomics microarray differential gene expression • 4.9k views

ADD COMMENT • link updated 9.3 years ago by Gordon Smyth 50k • written 9.3 years ago by chandlerjd58 ▴ 10

score 2 · Answer 1 · 2015-01-17

2

Entering edit mode

Aaron Lun ★ 28k

@alun

Last seen 10 hours ago

The city by the bay

Well, at least for limma, you should be able to do something like:

y <- new("EList", list(E=matrix_data))
design <- model.matrix(~factor(colnames(matrix_data)))
fit <- lmFit(y, design)
fit <- eBayes(fit)
res <- topTable(fit, coef=2, sort.by="none", n=Inf)

Then, calling res$P.Value will give you a numeric vector containing a p-value for each row. Note that each p-value is the result of running a moderated t-test on the data from each row. This differs from the standard t-test, as moderation shares information between rows to improve the precision of the variance estimates (and thus, the power of the testing procedure).

In addition, normalization of expression values is often desirable for standard applications of the limma pipeline. For example, loess-based normalization can be performed by adding

matrix_data <- normalizeBetweenArrays(matrix_data, method="cyclicloess")

prior to the construction of the y object.

ADD COMMENT • link 9.3 years ago Aaron Lun ★ 28k

0

Entering edit mode

My Affy data is already log2 transformed, but are you suggesting it is conventional to loess normalize the array data as well?

I attempted to place the suggested limma changes into my original code structure. I'm running it now but it seems to be stuck or taking a long time.

#Read in your file
data <- read.table ("expression-data-for-tstat-2.txt", fill=TRUE, header=TRUE, sep="\t", stringsAsFactors=FALSE)

#Assign rownames
rownames(data) <- data$TCID
data <- data [-1]

#Convert to matrix
matrix_data <- data.matrix(data)

#Group samples
colnames(matrix_data) <- c("Con", "Con", "Con", "Cd", "Cd", "Cd")

#Load Limma
library(limma)

#Variable declaration
y <- NULL
fit <- NULL
res <- NULL
lm_p <- NULL

#Loess normalize
matrix_data <- normalizeBetweenArrays(matrix_data, method="cyclicloess")

#Initialization
count <- (1:nrow(matrix_data))

#"for" loop
for (k in count){
index1 <- grep ("Con", colnames(matrix_data))
index2 <- grep ("Cd", colnames(matrix_data))

   #check if columns exist
   if(length(index1) == 0) next
   if(length(index2) == 0) next

#Limma p-val
   y <- new("EList", list(E=matrix_data))
   design <- model.matrix(~factor(colnames(matrix_data)))
   fit <- lmFit(y, design)
   fit <- eBayes(fit)
   res <- topTable(fit, coef=2, sort.by="none", n=Inf)
   lm_p <- rbind (lm_p, res$p.value)
}

#Bind object columns (so results are presented with original data)
results <- cbind (data, lm_p)

#Write the new file
write.table(results, file="expressiondata-withtstat.txt", sep="\t", col.names=colnames(results), row.names=TRUE)

ADD REPLY • link 9.3 years ago chandlerjd58 ▴ 10

1

Entering edit mode

As Gordon has written, the limma code block shouldn't be inside the for loop. If you remove the looping code around it, you should find that it runs a lot faster. As for normalization; if you're getting log-values from Affy data, it should already be normalized, so you shouldn't need to do anything extra.

ADD REPLY • link 9.3 years ago Aaron Lun ★ 28k

score 2 · Answer 2 · 2015-01-18

I think you are somehow assuming that limma computes a univariate test statistic that you will call in a loop. This isn't how it works. limma uses the entire matrix of expression values at once to compute improved t-tests for each row. limma computes all the p-values for you at once -- you don't need a for loop. It also collates the results for you into a data.frame with the log-fold-changes and so on.

limma can compute both empirical Bayes and ordinary t-tests for you in much less time than it would take to run a for loop.

score 2 · Answer 3 · 2015-01-18

2

Entering edit mode

Gordon Smyth 50k

@gordon-smyth

Last seen 3 hours ago

WEHI, Melbourne, Australia

I am amazed by your comment that you couldn't find any advice in the limma User's Guide. There are heaps of examples of how to compute p-values for two group comparison. The "quick start" code on page 13 for example tells you what you need to know.

ADD COMMENT • link 9.3 years ago Gordon Smyth 50k

0

Entering edit mode

It could just be that I don't understand. I am very new to bioinformatics (biologist by training), but I am trying to understand some of the basics on my own to not have to rely as much waiting on others to do the work for me. One thing that frustrates me about the manual is that it seems to be written from the perspective that users will always read in raw array data but I prefer to start from the log2 transformed data matrix. However, on p. 13 you are asked to read in CEL files. Why must I have to do this every time? Yes, I have those files, but I'd rather start from a certain amount of pre-process. In addition, the nature of the design matrix (2 columns) is outside my preferred method of a column for every sample and row for each feature/transcript cluster.

ADD REPLY • link 9.3 years ago chandlerjd58 ▴ 10

1

Entering edit mode

Page 13 is just one example. There is a dedicated section in the User's Guide that explains how to analyze two group comparisons just like yours. It is called "Two Groups". It explains how to do the DE analysis starting from the log2 transformed data matrix. The DE analysis is always the same anyway regardless of how the data was pre-processed.

Unfortunately your "preferred method of a column for every sample and a row for each feature" for the design matrix is just wrong. You may possibly be confusing the design matrix with the data matrix.

If you are very new to bioinformatics, reading the advice provided with an open mind will surely be the quickest to make progress. Having a lot of preconceptions about how the analysis should be done may limit opportunities to benefit from all the work done in this area.

ADD REPLY • link 9.3 years ago Gordon Smyth 50k

0

Entering edit mode

I think there is some mis-communication. Essentially, I am trying to emulate coding I have seen done previously in my lab. This is an example based on a 7x24,589 feature data matrix (6 samples and an identifier column):

design <- model.matrix (~factor (c("Con", "Con", "Con", "Cd", "Cd", "Cd")))
fit <- lmFit (lm.data, design)
fit <- eBayes (fit)
tab <- topTable(ebayes, coef=2, adjust.method="fdr", number = 24589, sort.by="p")
write.table (tab, file="limma-toptable.txt", sep="\t", col.names=colnames(tab), row.name=TRUE)

EDIT: There looks to be something like this described on p.37 of the manual.

ADD REPLY • link 9.2 years ago chandlerjd58 ▴ 10

0

Entering edit mode

A corrected and improved version of one of the lines:

tab <- topTable(fit, coef=2, adjust.method="fdr", number = Inf, sort.by="p")

Otherwise, that code block looks good. If you're starting from a log2-expression matrix, then you can do the analysis by setting that matrix as lm.data.

With respect to the design matrix, it's important to remember that the columns of design correspond to the linear model coefficients, rather than the samples. Each sample is actually represented by a row of the design matrix. The entries in that row indicates the linear combination of coefficients used to model expression in that sample.

ADD REPLY • link 9.2 years ago Aaron Lun ★ 28k

score 0 · Answer 4 · 2015-01-20

0

Entering edit mode

Gordon Smyth 50k

@gordon-smyth

Last seen 3 hours ago

WEHI, Melbourne, Australia

Duplicate comment deleted

ADD COMMENT • link 9.3 years ago Gordon Smyth 50k