Question

Are the p-values supposed to be on the order of 10^-120?

0

Entering edit mode

mccallionc403 • 0

@mccallionc403-6871

Last seen 10.1 years ago

United States

I have been working on a project for a while now. For my data, I have tried converting the GDS file into an eSet object, fitting it with lmFit, from the limma package (with and without a design matrix as a parameter), and getting a hold of the raw CEL files, running those through justRMA, from the affy package, using the sdrf file as the phenodata file and using pData(GDS converted to expression set) as the phenodata AnnotatedDataFrame, then running that through lmFit, with a design matrix, then running the result through eBayes. However, when I call topTable on the resulting MArrayLM object, with the parameters: method = "fdr", I either get a constant value for the adj.p.value column or I get values on the order of 10^-121. If anyone could help, or at least point me in the right direction, that would be greatly appreciated. Thank you in advance

limma affy bioconductor affymetrix • 3.0k views

ADD COMMENT • link updated 10.2 years ago by Gordon Smyth 52k • written 10.2 years ago by mccallionc403 • 0

0

Entering edit mode

Forgot to put in the commands and such

the commands:

source("http://bioconductor.org/biocLite.R")
biocLite()
biocLite("limma")
biocLite("GEOquery")

library("limma")
library("GEOquery")

#gds4523 <- getGEO("GDS4523")
#eset4523 <- GDS2eSet(gds4523)

gse17612 <- getGEO("GSE17612", GSEMatrix = TRUE)

eset17612 = gse17612[[1]]

design <- model.matrix(~0 + pData(eset17612)$source_name_ch1)

#data <- pData(eset4523)

#Disease.state <- data$disease.state

#design <- model.matrix(~0 + Disease.state)

fit <- lmFit(eset17612, design)

efit <- eBayes(fit)

write.csv((topTable(efit, adjust="fdr", n=54675)), "C:/users/mccallionc403/Dropbox/Research/ACPHS/Research/project1/possible.csv")

sessionInfo():

R version 3.0.1 (2013-05-16)
Platform: x86_64-w64-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=English_United States.1252 
[2] LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] parallel  stats     graphics  grDevices
[5] utils     datasets  methods   base     

other attached packages:
 [1] hgu133plus2cdf_2.12.0               
 [2] BiocInstaller_1.12.1                
 [3] hugene10sttranscriptcluster.db_8.0.1
 [4] hugene10stprobeset.db_8.0.1         
 [5] org.Hs.eg.db_2.9.0                  
 [6] RSQLite_0.11.4                      
 [7] DBI_0.3.1                           
 [8] hugene10stv1probe_2.12.0            
 [9] hugene10stv1cdf_2.12.0              
[10] AnnotationDbi_1.22.6                
[11] gcrma_2.32.0                        
[12] affy_1.38.1                         
[13] GEOquery_2.26.2                     
[14] Biobase_2.20.1                      
[15] BiocGenerics_0.6.0                  
[16] limma_3.16.8                        

loaded via a namespace (and not attached):
 [1] affyio_1.28.0        
 [2] Biostrings_2.28.0    
 [3] IRanges_1.18.4       
 [4] preprocessCore_1.22.0
 [5] RCurl_1.95-4.3       
 [6] splines_3.0.1        
 [7] stats4_3.0.1         
 [8] tools_3.0.1          
 [9] XML_3.98-1.1         
[10] zlibbioc_1.6.0

ADD REPLY • link 10.2 years ago mccallionc403 • 0

score 2 · Answer 1 · 2014-10-17

Another major consideration is that you need to explore the data properly before testing for differential expression. This dataset has a large number of explanatory factors other than disease status:

 colnames(pData(eset17612))
 [1] "title"                   "geo_accession"           "status"                 
 [4] "submission_date"         "last_update_date"        "type"                   
 [7] "channel_count"           "source_name_ch1"         "organism_ch1"           
[10] "characteristics_ch1"     "characteristics_ch1.1"   "characteristics_ch1.2"  
[13] "characteristics_ch1.3"   "molecule_ch1"            "extract_protocol_ch1"   
[16] "label_ch1"               "label_protocol_ch1"      "taxid_ch1"              
[19] "hyb_protocol"            "scan_protocol"           "description"            
[22] "data_processing"         "platform_id"             "contact_name"           
[25] "contact_department"      "contact_institute"       "contact_address"        
[28] "contact_city"            "contact_state"           "contact_zip/postal_code"
[31] "contact_country"         "supplementary_file"      "data_row_count"

A little bit of playing around with plots such as plotMDS(y) very quickly shows there are huge expression differences between the male and female subjects, and probably there are other important factors as well. These effects need to be incorporated into the linear model before you will have any chance of detecting the relatively subtle differences between the disease states.

score 1 · Answer 2 · 2014-10-16

1

Entering edit mode

Gordon Smyth 52k

@gordon-smyth

Last seen 10 hours ago

WEHI, Melbourne, Australia

You've made a few classic mistakes.

First, you've downloaded expression values which are on the raw scale and haven't converted them to log-values.

Second, you haven't filtering out unexpressed genes or allowed for a mean-variance trend.

Third, you have tested whether all the expression values are zero instead of testing whether they are different between disease states.

A more correct analysis would go something like this:

> y <- log2(exprs(eset17612))
> design <- model.matrix(~source_name_ch1, data=pData(eset17612))
> fit <- lmFit(y,design)
> IsExpressed <- fit$Amean>3
> efit <- eBayes(fit[IsExpressed,], trend=TRUE, robust=TRUE)
> topTable(efit)
Removing intercept from test coefficients
               logFC AveExpr       t    P.Value adj.P.Val       B
222617_s_at -0.32912  7.9830 -4.1661 0.00010884   0.99986 -1.8241
243245_at   -0.93968  3.6870 -4.0804 0.00014468   0.99986 -1.9249
203698_s_at -0.73853  6.4350 -4.0334 0.00016894   0.99986 -1.9799
222722_at   -1.48148  6.1147 -4.0060 0.00019023   0.99986 -2.0266
229931_at    0.89540  4.9939  3.9302 0.00023680   0.99986 -2.1003
203535_at    1.61516  6.7329  3.9478 0.00023754   0.99986 -2.1104
221581_s_at  0.57926  6.4632  3.8839 0.00027511   0.99986 -2.1539
230746_s_at  1.06638  4.1059  3.8699 0.00028787   0.99986 -2.1702
1558727_at   0.92696  4.5805  3.8642 0.00029323   0.99986 -2.1768
235867_at    0.32057  8.6997  3.7223 0.00046153   0.99986 -2.3398

I am using R 3.1.1 and limma 3.22.0.

ADD COMMENT • link 10.2 years ago Gordon Smyth 52k

0

Entering edit mode

Thank you so much. However, I notice that your code would still give constant adj.P.Vals

ADD REPLY • link 10.2 years ago mccallionc403 • 0

0

Entering edit mode

Runs of identical adj.P.Val's are not caused by any software error. Rather, they are present due to enforced monotonicity in the Benjamini-Hochberg method for controlling the FDR. This monotonicity ensures that a gene with a lower p-value (relative to other genes) will not have a higher adjusted p-value.

ADD REPLY • link 10.2 years ago Aaron Lun ★ 28k

0

Entering edit mode

Okay, but then how do I know which genes have significant fold changes if the P.Values are so small?

ADD REPLY • link 10.2 years ago mccallionc403 • 0

0

Entering edit mode

The adjusted p-values are large, so none of the genes in the above analysis would be considered to have significant differential expression at any reasonable FDR threshold. If you run:

plotMDS(y, col=ifelse(design[,2]==1, "blue", "red"))

you will see that there is no clear separation between the samples from your two conditions. The small size of the P.Value's has little meaning, as they have not been corrected for multiple testing.

ADD REPLY • link 10.2 years ago Aaron Lun ★ 28k

0

Entering edit mode

Additionally, when I run your code, I get the following error:

Error in eBayes(fit[IsExpressed, ], trend = TRUE, robust = TRUE) : 
  Need Amean component in fit to estimate trend
In addition: Warning messages:
1: installed directory not writable, cannot update packages 'boot',
  'class', 'cluster', 'codetools', 'foreign', 'KernSmooth',
  'lattice', 'MASS', 'Matrix', 'mgcv', 'nlme', 'nnet', 'rpart',
  'spatial', 'survival' 
2: installed directory not writable, cannot update packages 'boot',
  'class', 'cluster', 'codetools', 'foreign', 'KernSmooth',
  'lattice', 'MASS', 'Matrix', 'mgcv', 'nlme', 'nnet', 'rpart',
  'spatial', 'survival' 
3: installed directory not writable, cannot update packages 'boot',
  'class', 'cluster', 'codetools', 'foreign', 'KernSmooth',
  'lattice', 'MASS', 'Matrix', 'mgcv', 'nlme', 'nnet', 'rpart',
  'spatial', 'survival'

ADD REPLY • link 10.2 years ago mccallionc403 • 0

0

Entering edit mode

That is because you are running a version of limma that is three Bioconductor releases out of date. With your old version you could try

IsExpressed <- rowMeans(y)>3

instead, but in general we assume people are running the current version.

ADD REPLY • link 10.2 years ago Gordon Smyth 52k

0

Entering edit mode

Thank you. I updated my software. However, I still get the following error message:

Error in ebayes(fit = fit, proportion = proportion, stdev.coef.lim = stdev.coef.lim,  : 
  No residual degrees of freedom in linear model fits
In addition: Warning messages:
1: installed directory not writable, cannot update packages 'boot', 'class', 'cluster', 'codetools', 'KernSmooth', 'MASS',
  'mgcv', 'nlme' 
2: installed directory not writable, cannot update packages 'boot', 'class', 'cluster', 'codetools', 'KernSmooth', 'MASS',
  'mgcv', 'nlme' 
3: installed directory not writable, cannot update packages 'boot', 'class', 'cluster', 'codetools', 'KernSmooth', 'MASS',
  'mgcv', 'nlme' 
4: In download.file(myurl, destfile, mode = mode, quiet = TRUE, method = getOption("download.file.method.GEOquery")) :
  downloaded length 64700445 != reported length 200

I am using R 3.1.1 and limma 3.22.0 this time

> sessionInfo()
R version 3.1.1 (2014-07-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                           LC_TIME=English_United States.1252    

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] GEOquery_2.32.0      Biobase_2.26.0       BiocGenerics_0.12.0  limma_3.22.0         BiocInstaller_1.16.0

loaded via a namespace (and not attached):
[1] RCurl_1.95-4.3 tools_3.1.1    XML_3.98-1.1

ADD REPLY • link 10.2 years ago mccallionc403 • 0

0

Entering edit mode

To be clear, you should be running:

library("limma")
library("GEOquery")
gse17612 <- getGEO("GSE17612", GSEMatrix = TRUE)
eset17612 = gse17612[[1]]
y <- log2(exprs(eset17612))
design <- model.matrix(~source_name_ch1, data=pData(eset17612))
fit <- lmFit(y,design)
IsExpressed <- fit$Amean>3
efit <- eBayes(fit[IsExpressed,], trend=TRUE, robust=TRUE)
topTable(efit)

When I run this, I don't get any errors, nor do I get the 4th warning in your previous post (where downloaded length != reported length). Perhaps the download may be incomplete. What do you get when you run dim(y) or dim(design)? I get a 54675-by-51 matrix for the first one, and a 51-by-2 matrix for the second.

ADD REPLY • link 10.2 years ago Aaron Lun ★ 28k

0

Entering edit mode

I do apologize. I made a typo. However, in the line:

IsExpressed <- fit$Amean>3

I would think that that means that a gene is considered to be expressed if the average expression value is greater than 3. Why does this have to be greater than 3?

ADD REPLY • link 10.2 years ago mccallionc403 • 0

0

Entering edit mode

It's arbitrary. You could pick any number and use it instead of 3. However, you'll end up with no genes if the threshold is too high. Conversely, you'll fail to remove lowly expressed genes if the threshold is too low (and if these genes aren't well expressed, they probably won't be differentially expressed, so testing would be moot). Experience suggests that, in most cases, a value of 3 is a suitable compromise.

ADD REPLY • link 10.2 years ago Aaron Lun ★ 28k