Question

Volcano plot labeling troubles

0

Entering edit mode

mccallionc403 • 0

@mccallionc403-6871

Last seen 9.4 years ago

United States

Hello,

I am trying to make a volcano plot, but of Log Fold Change vs FDR adjusted p-values. So far, I've managed to make a decent plot, but I'm having trouble figuring out how to label the genes that are differentially expressed, wether as the top, say 30, genes, or all of the genes above a certain threshold, and identify them on the graph. My progress so far goes thusly (Also, if anyone knows how to get limma to plot a graph with these axis instead of log fold change vs log odds, that would be greatly appreciated as well):

source ("http://bioconductor.org/biocLite.R")#f o r
#downloading biocLite
biocLite()#for installing bioconductor packages
biocLite("BiocUpgrade")#use latest version
biocLite("limma")#linear fits
biocLite("GEOquery")#downloading data sets
biocLite("statmod")#
biocLite("affyPLM")#normalizing ExpressionSets

#begin loading libraries
library("limma")
library("GEOquery")
library("statmod")
library("affyPLM")
#libraries loaded

gse44563 <- getGEO("GSE44563", GSEMatrix = TRUE)
#downloading series record
#44563 with any attached data
eset44563 = gse44563[[1]]#extract attached eSet (data)

y <- log2(exprs(eset44563))
design <- model.matrix(~source_name_ch1,data=pData(eset44563))
preFit <- lmFit(y, design)

eset <- normalize.ExpressionSet.quantiles(eset44563, transfn="none" )
#quantile normalization of eset44563
#8
y <- log2(exprs(eset))#log2 transformation of eset
design <- model.matrix(~source_name_ch1,data=pData(eset44563))#making
#design matrix
#describes what array looked like
#allows for a linear fit , using matrix algebra
fit <- lmFit(y, design)#linear fit

plotMA(fit)

IsExpressed <- fit$Amean>3#is mean expression
#of each gene above 3?
efit <- eBayes(fit[IsExpressed,], trend=TRUE, robust=TRUE)#Empi r ical
#Bayes adjustment

plotMA(efit)

write.csv((topTable(efit, adjust = "fdr", n = 54675)), "possible5.csv")
#write table of genes,
#sorted according to False Discovery Rate adjusted
#p-value to file

tableTop <- topTable(efit, adjust = "fdr", n = 42471)
plot(tableTop$logFC, -log10(tableTop$adj.P.Val), xlim = c(-1, 2), ylim = c(0, 6), xlab = "log2 fold change", ylab = "FDR adjusted p-value")

My issue right now is trying to figure out how to label the points that are meaningfully foldchanged, for now, without using limma, since it appears that limma only uses logodds as its measure of statistical significance.

Any help, or even just pointers to pre-existing explanations would be immensly appreciated. Thank you

Also, here's my SessionInfo():

R version 3.1.1 (2014-07-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=English_United States.1252 
[2] LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] splines   parallel  stats     graphics 
[5] grDevices utils     datasets  methods  
[9] base     

other attached packages:
 [1] BiocInstaller_1.16.0  affyPLM_1.40.1       
 [3] preprocessCore_1.26.1 gcrma_2.36.0         
 [5] statmod_1.4.20        GEOquery_2.30.1      
 [7] limma_3.20.9          affy_1.42.3          
 [9] Biobase_2.24.0        BiocGenerics_0.10.0  

loaded via a namespace (and not attached):
[1] affyio_1.32.0     Biostrings_2.32.1
[3] IRanges_1.22.10   RCurl_1.95-4.3   
[5] stats4_3.1.1      tools_3.1.1      
[7] XML_3.98-1.1      XVector_0.4.0    
[9] zlibbioc_1.10.0

limma volcanoplot • 6.8k views

ADD COMMENT • link updated 6.8 years ago by Gordon Smyth 50k • written 9.5 years ago by mccallionc403 • 0

score 3 · Answer 1 · 2014-10-28

I understand why you might be tempted to plot logFC vs FDR but, in my opinion, this isn't the right way to make a volcano plot.

First, if a volcano plot is made with p-values, then it should use raw p-values rather than FDR adjusted values. This is because FDR adjusting loses some information (e.g., distinct p-values may be mapped to the same FDR value).

Second, there is a reason why limma plots log-odds instead of log-p-values. The log-odds are exactly equivalent to p-values for your analysis (same ordering) but on a scale that is arguably more meaningful than log-p. The log-odds also has the effect of global adjustment, like FDR, but without losing information.

Instead of doing your own coding, you might consider using the plot that limma provides

Later edit

By popular demand, volcanoplot() now offers either p-values or log-odds as options for the y-axis and the default has been changed to p-values.

score 1 · Answer 2 · 2014-10-28

1

Entering edit mode

James W. MacDonald 65k

@james-w-macdonald-5106

Last seen 1 day ago

United States

ind <- tableTop$adj.P.Val < 0.05
points(tableTop$logFC[ind], -log10(tableTop$adj.P.Val)[ind], pch = 19, col = "red")

ADD COMMENT • link 9.5 years ago James W. MacDonald 65k

score 1 · Answer 3 · 2014-10-28

The quantile normalization step you are doing is not necessary. This data has already been RMA processed, according to the GEO webpage, meaning that it can already been quantile normalized at the probe level. By re-normalizing at the probe-set level, you are replacing the normalization already done with something not quite as good.

score 0 · Answer 4 · 2014-10-28

0

Entering edit mode

mccallionc403 • 0

@mccallionc403-6871

Last seen 9.4 years ago

United States

Thank you so much

ADD COMMENT • link 9.5 years ago mccallionc403 • 0