Question

Extremely small p-values using Limma for proteomic data

0

Entering edit mode

Abbey • 0

@01d3028e

Last seen 11 weeks ago

United States

Hi all,

I am trying to use Limma for proteomic analysis on a sample size of n=8 controls and n=8 experimental. The code runs without errors, but the results show that nearly all 700 proteins are significant and the p-values are tiny (ex. 3.02e-10). I feel like something must be wrong with the data preprocessing because these results don't seem right. Also, the resulting volcano plot has a linear shape rather than looking like a usual volcano plot. Any help troubleshooting this would be greatly appreciated!

This is my code:


# Load necessary libraries
library("limma")
library("readxl")

# Import dataset
masterProteinData <- read_excel("C:\\Users\\abbey\\Downloads\\MasterProteins2.xlsx")
sampleInfo <- read_excel("C:\\Users\\abbey\\Downloads\\SampleInformation.xlsx")

# Extract protein abundance data
master_abundance <- as.matrix(masterProteinData[, -1])  # Assuming the first column contains protein IDs

# Perform normalization
#master_normalized <- normalizeBetweenArrays(master_abundance, method = "quantile")

# Create binary design matrix
design_matrix <- model.matrix(~ 0 + Condition, data = sampleInfo)

# Fit linear model
#fit <- lmFit(master_normalized, design_matrix)
fit <- lmFit(master_abundance, design_matrix)

# Perform statistical testing for differential abundance 
fit <- eBayes(fit)

# Extract fitted values from the fit object using the fitted function
fitted_values <- fitted(fit)

# Plot the fitted values against the observed values
#plot(fitted_values, master_normalized, xlab="Fitted Values", ylab="Observed Values", main="Fit Plot")

# Extract results (including p-values; adjustment method Benjamini-Hochberg used)
master_results <- topTable(fit, coef = "ConditionExperiment", adjust.method = "BH", number = Inf)

# Add protein names as a column in master_results
protein_names <- masterProteinData[, 1]  # Assuming the first column contains protein names
master_results <- cbind(`Master Protein` = protein_names, master_results)

# Rename the title of the Master Protein column
colnames(master_results)[1] <- "Master Protein"

# Inspect master_results with protein names
#View(master_results)

# Inspect top differentially abundant proteins
top_proteins <- subset(master_results, abs(logFC) > 1 & adj.P.Val < 0.05)
#View(top_proteins)



## Visualization of top differentially abundant proteins

# Apply symmetrical log transformation to fold changes
fold_changes <- fit$coefficients[, "ConditionExperiment"] - fit$coefficients[, "ConditionControl"]
log2_fold_changes <- log2(abs(fold_changes) + 1) * sign(fold_changes)

# Define significance criteria
significant_indices <- which(master_results$adj.P.Val < 0.05 & abs(log2_fold_changes) > 1)
non_significant_indices <- which(master_results$adj.P.Val >= 0.05 & abs(log2_fold_changes) > 1)

# Create volcano plot with corrected fold changes
plot(log2_fold_changes, -log10(master_results$adj.P.Val), pch = 20, 
     col = ifelse(1:length(log2_fold_changes) %in% significant_indices, "magenta", 
                  ifelse(1:length(log2_fold_changes) %in% non_significant_indices, "black", "darkgreen")), 
     main = "Volcano Plot", 
     xlab = "Log2 Fold Change", ylab = "-log10(Adjusted P-value)",
     xlim = c(-max(abs(log2_fold_changes)), max(abs(log2_fold_changes))), 
     ylim = c(0, max(-log10(master_results$adj.P.Val)) + 1))

# Add significance threshold line
abline(h = -log10(0.05), col = "red", lty = 2)

# Add legend for highlighted proteins
legend("topleft", legend = c("Adjusted p-value < 0.05", "Fold change < 2"), 
       pch = 20, col = c("magenta", "darkgreen"))

Results from running the code:

Volcano Plot after running the code above

This is what my data structure for masterProteinData looks like:

Master Protein Data

And this is the sampleInfo data frame:

Sample Info Data Frame

Proteomics limma • 734 views

ADD COMMENT • link updated 11 weeks ago by James W. MacDonald 66k • written 11 weeks ago by Abbey • 0

score 2 · Answer 1 · 2024-05-07

2

Entering edit mode

James W. MacDonald 66k

@james-w-macdonald-5106

Last seen 1 hour ago

United States

If you fit a cell means model (without an intercept), you will have to construct contrasts using makeContrasts and fit them using contrasts.fit. Otherwise you are testing that the mean protein abundance for a given group is greater than zero, which will be significant for most proteins.

ADD COMMENT • link 11 weeks ago James W. MacDonald 66k

0

Entering edit mode

I'd also add that plotting adjusted P-values in the y-axis of the volcano plot is not the recommended way to produce a volcano plot. This has been previously discussed in this forum, see for instance the answers in this post.

ADD REPLY • link 11 weeks ago Robert Castelo ★ 3.3k

0

Entering edit mode

I'd argue that most users, especially the general reader of a paper is interested in a visual representation of the data and applied cutoffs, and that is FDR and not nominal p-value. These sorts of plots, unless used for diagnostic purposes during analysis (rather than in a results section) should imo directly allow to see the two populations of significant vs insignificant genes. In this context I see no point plotting nominal p-values as it can visually overestimate the number of DEGs.

ADD REPLY • link 11 weeks ago ATpoint ★ 4.2k

0

Entering edit mode

For visualization purposes you can draw an horizontal line at the p-value cutoff that meets the experiment-wide FDR cutoff. For instance, if a data.frame object tt would store the results of a call to topTable() you can draw such a line as follows:

abline(h=-log10(max(tt$P.Value[tt$adj.P.Val <= FDRcutoff])), col="gray", lwd=2, lty=2)

where FDRcutoff would store the FDR cutoff you want to use for selecting DE genes.

ADD REPLY • link 11 weeks ago Robert Castelo ★ 3.3k

0

Entering edit mode

It's not an implementation problem I am discussing (it's trivial obviously) -- more that for most purposes the y-axis metric should be the one productively used for analysis and downstream and this is almost always the FDR. Different opinions, that's fine.

ADD REPLY • link 11 weeks ago ATpoint ★ 4.2k

0

Entering edit mode

From the point of view of visualization, I find handy the fact that raw p-values give more resolution than adjusted ones, which allows one to more easily highlight genes or proteins of interest; see for instance panels (a) and (c) in Figure 3 in this publication of mine. If dots overlap each other because different raw p-values correspond to the identical or very similar FDR values, then highlighting genes or proteins with names or colors becomes more difficult.

ADD REPLY • link 11 weeks ago Robert Castelo ★ 3.3k

0

Entering edit mode

Thank you James! This worked to fix the p-values. However, the plot still looks one-sided... any idea what's going on with the left side? Did I calculate the fold change incorrectly? Volcano Plot

ADD REPLY • link 11 weeks ago Abbey • 0

0

Entering edit mode

Probably. I would not expect such large fold changes. But without seeing your code, who knows?

ADD REPLY • link 11 weeks ago James W. MacDonald 66k

0

Entering edit mode

This is the code. The only part I changed was adding the contrasts.fit as you suggested. Thanks for taking a look!


# Load necessary libraries
library("limma")
library("readxl")

# Import dataset (non-normalized data from CORE)
masterProteinData <- read_excel("C:\\Users\\abbey\\Downloads\\MasterProteins2.xlsx")
sampleInfo <- read_excel("C:\\Users\\abbey\\Downloads\\SampleInformation.xlsx")

# Extract protein abundance data
master_abundance <- as.matrix(masterProteinData[, -1])  # Assuming the first column contains protein IDs

# Perform normalization
#master_normalized <- normalizeBetweenArrays(master_abundance, method = "quantile")

# Create binary design matrix
design_matrix <- model.matrix(~ 0 + Condition, data = sampleInfo)

# Fit linear model
#fit <- lmFit(master_normalized, design_matrix)
fit <- lmFit(master_abundance, design_matrix)
#fit <- lmFit(master_abundance, design_matrix, method="robust") #gives convergence error

# Define contrasts using makeContrasts
contrasts <- makeContrasts(ConditionExperiment - ConditionControl, levels = design_matrix)

# Fit contrasts
fit_contrasts <- contrasts.fit(fit, contrasts)

# Perform statistical testing for differential abundance on contrasts
fit_contrasts <- eBayes(fit_contrasts)

# Extract fitted values from the fit object using the fitted function
fitted_values <- fitted(fit)

# Check the lengths of fitted_values and master_normalized
#length_fitted <- length(fitted_values)
#length_normalized <- length(master_normalized)

# Plot the fitted values against the observed values
#plot(fitted_values, master_normalized, xlab="Fitted Values", ylab="Observed Values", main="Fit Plot")

# Extract results (including p-values; adjustment method Benjamini-Hochberg used)
master_results_contrasts <- topTable(fit_contrasts, coef = "ConditionExperiment - ConditionControl", adjust.method = "BH", number = Inf)

# Add protein names as a column in master_results
protein_names <- masterProteinData[, 1]  # Assuming the first column contains protein names
master_results_contrasts <- cbind(`Master Protein` = protein_names, master_results_contrasts)

# Rename the title of the Master Protein column
colnames(master_results_contrasts)[1] <- "Master Protein"

# Inspect master_results with protein names
#View(master_results_contrasts)

# Inspect top differentially abundant proteins
top_proteins_contrasts <- subset(master_results_contrasts, abs(logFC) > 1 & adj.P.Val < 0.05)
#View(top_proteins)




## Visualization of top differentially abundant proteins based on contrasts

# Accessing coefficients
coefficients_contrasts <- fit_contrasts@.Data[[1]]

# Convert coefficients_contrasts into a numeric vector
coefficients_contrasts <- as.vector(coefficients_contrasts)

# Apply symmetrical log transformation to fold changes from contrasts
fold_changes_contrasts <- coefficients_contrasts

# Calculate log2 fold changes for contrasts
log2_fold_changes_contrasts <- log2(abs(fold_changes_contrasts) + 1) * sign(fold_changes_contrasts)

# Define significance criteria based on contrasts
significant_indices_contrasts <- which(master_results_contrasts$P.Value < 0.05 & abs(log2_fold_changes_contrasts) > 1)
non_significant_indices_contrasts <- which(master_results_contrasts$P.Value >= 0.05 & abs(log2_fold_changes_contrasts) > 1)

# Create volcano plot with corrected fold changes from contrasts
plot(log2_fold_changes_contrasts, -log10(master_results_contrasts$P.Value), pch = 20, 
     col = ifelse(1:length(log2_fold_changes_contrasts) %in% significant_indices_contrasts, "magenta", 
                  ifelse(1:length(log2_fold_changes_contrasts) %in% non_significant_indices_contrasts, "black", "darkgreen")), 
     main = "Volcano Plot (Contrasts)",
     xlab = "Log2 Fold Change", ylab = "-log10(P-value)",
     xlim = c(-max(abs(log2_fold_changes_contrasts)), max(abs(log2_fold_changes_contrasts))), 
     ylim = c(0, max(-log10(master_results_contrasts$P.Value)) + 1))

# Add significance threshold line
abline(h = -log10(0.05), col = "red", lty = 2)

# Add legend for highlighted proteins
legend("topleft", legend = c("P-value < 0.05", "Fold change < 2"), 
       pch = 20, col = c("magenta", "darkgreen"))

ADD REPLY • link 11 weeks ago Abbey • 0

1

Entering edit mode

If you ever find yourself using the @ function, you should reconsider what you are doing. There are vanishingly small instances when an end user should use that accessor, and moreso for the limma package, which barely makes use of the S4 paradigm. It's mostly just a simple list, and you can extract things using the $ function instead.

Also, you appear not to be normalizing your data at all, which is probably suboptimal. At the very least you should be doing a shift normalization, and you should be taking logs prior to that. If you do not take logs, then your coefficient will be the difference in protein abundance estimates rather than the log fold change, which is a much more interpretable measure. Put another way, assume for Protein X that you have a protein abundance (likely area under the curve of m/z intensities for a set of peptides) of say 1000 in one sample and 2000 in another. The difference is 1000, which seems large. Say you have another protein with 5000 in one sample and 10,000 in another. The difference there is 5000, which is bigger.

But both changes represent a simple doubling of the apparent protein concentration. It's of no interest that the 'baseline' area under the curve is 1000 for the first and 5000 for the second - those are derived quantities that are likely proportional to the amount of protein in a sample, but not quantitatively so. And this doesn't even consider the right skew inherent in this sort of data that you want to control for. So take logs, base 2.

You also most likely want to use trend = TRUE and possibly robust = TRUE to account for an expected dependence of the variance on the mean abundance. Also there is volcanoplot in the limma package, as well as the EnhancedVolcano package, both of which shield you from having to directly extract things, and make your life easier. All things equal, I use existing functionality rather than generating my own.

ADD REPLY • link 11 weeks ago James W. MacDonald 66k