Are the p-values supposed to be on the order of 10^-120?
2
0
Entering edit mode
@mccallionc403-6871
Last seen 9.3 years ago
United States

I have been working on a project for a while now.  For my data, I have tried converting the GDS file into an eSet object, fitting it with lmFit, from the limma package (with and without a design matrix as a parameter), and getting a hold of the raw CEL files, running those through justRMA, from the affy package, using the sdrf file as the phenodata file and using pData(GDS converted to expression set) as the phenodata AnnotatedDataFrame, then running that through lmFit, with a design matrix, then running the result through eBayes.  However, when I call topTable on the resulting MArrayLM object, with the parameters: method = "fdr", I either get a constant value for the adj.p.value column or I get values on the order of 10^-121.  If anyone could help, or at least point me in the right direction, that would be greatly appreciated.  Thank you in advance

limma affy bioconductor affymetrix • 2.7k views
ADD COMMENT
0
Entering edit mode

Forgot to put in the commands and such

the commands:

source("http://bioconductor.org/biocLite.R")
biocLite()
biocLite("limma")
biocLite("GEOquery")

library("limma")
library("GEOquery")

#gds4523 <- getGEO("GDS4523")
#eset4523 <- GDS2eSet(gds4523)

gse17612 <- getGEO("GSE17612", GSEMatrix = TRUE)

eset17612 = gse17612[[1]]

design <- model.matrix(~0 + pData(eset17612)$source_name_ch1)

#data <- pData(eset4523)

#Disease.state <- data$disease.state

#design <- model.matrix(~0 + Disease.state)

fit <- lmFit(eset17612, design)

efit <- eBayes(fit)

write.csv((topTable(efit, adjust="fdr", n=54675)), "C:/users/mccallionc403/Dropbox/Research/ACPHS/Research/project1/possible.csv")

sessionInfo(): 
R version 3.0.1 (2013-05-16)
Platform: x86_64-w64-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=English_United States.1252 
[2] LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] parallel  stats     graphics  grDevices
[5] utils     datasets  methods   base     

other attached packages:
 [1] hgu133plus2cdf_2.12.0               
 [2] BiocInstaller_1.12.1                
 [3] hugene10sttranscriptcluster.db_8.0.1
 [4] hugene10stprobeset.db_8.0.1         
 [5] org.Hs.eg.db_2.9.0                  
 [6] RSQLite_0.11.4                      
 [7] DBI_0.3.1                           
 [8] hugene10stv1probe_2.12.0            
 [9] hugene10stv1cdf_2.12.0              
[10] AnnotationDbi_1.22.6                
[11] gcrma_2.32.0                        
[12] affy_1.38.1                         
[13] GEOquery_2.26.2                     
[14] Biobase_2.20.1                      
[15] BiocGenerics_0.6.0                  
[16] limma_3.16.8                        

loaded via a namespace (and not attached):
 [1] affyio_1.28.0        
 [2] Biostrings_2.28.0    
 [3] IRanges_1.18.4       
 [4] preprocessCore_1.22.0
 [5] RCurl_1.95-4.3       
 [6] splines_3.0.1        
 [7] stats4_3.0.1         
 [8] tools_3.0.1          
 [9] XML_3.98-1.1         
[10] zlibbioc_1.6.0       

 

ADD REPLY
2
Entering edit mode
@gordon-smyth
Last seen 3 hours ago
WEHI, Melbourne, Australia

Another major consideration is that you need to explore the data properly before testing for differential expression. This dataset has a large number of explanatory factors other than disease status:

 colnames(pData(eset17612))
 [1] "title"                   "geo_accession"           "status"                 
 [4] "submission_date"         "last_update_date"        "type"                   
 [7] "channel_count"           "source_name_ch1"         "organism_ch1"           
[10] "characteristics_ch1"     "characteristics_ch1.1"   "characteristics_ch1.2"  
[13] "characteristics_ch1.3"   "molecule_ch1"            "extract_protocol_ch1"   
[16] "label_ch1"               "label_protocol_ch1"      "taxid_ch1"              
[19] "hyb_protocol"            "scan_protocol"           "description"            
[22] "data_processing"         "platform_id"             "contact_name"           
[25] "contact_department"      "contact_institute"       "contact_address"        
[28] "contact_city"            "contact_state"           "contact_zip/postal_code"
[31] "contact_country"         "supplementary_file"      "data_row_count"     

A little bit of playing around with plots such as plotMDS(y) very quickly shows there are huge expression differences between the male and female subjects, and probably there are other important factors as well. These effects need to be incorporated into the linear model before you will have any chance of detecting the relatively subtle differences between the disease states.

ADD COMMENT
0
Entering edit mode

Okay.  Possibly obvious question:  How would I go about doing that, or where would I start at least?

ADD REPLY
0
Entering edit mode

Formal model selection is tricky, and requires use of procedures like AIC or LASSO. We usually use a more ad hoc approach, whereby we make a MDS plot and check if any of the factors are driving major differences between libraries. If so, it suggests that the offending factor should be included in the model fit. Another approach is to try out a bunch of different models, and see which one gives you more DE genes. This balances the greater explanatory ability of larger models against the loss of residual degrees of freedom and associated decrease in power.

ADD REPLY
1
Entering edit mode
@gordon-smyth
Last seen 3 hours ago
WEHI, Melbourne, Australia

You've made a few classic mistakes.

First, you've downloaded expression values which are on the raw scale and haven't converted them to log-values.

Second, you haven't filtering out unexpressed genes or allowed for a mean-variance trend.

Third, you have tested whether all the expression values are zero instead of testing whether they are different between disease states.

A more correct analysis would go something like this:

> y <- log2(exprs(eset17612))
> design <- model.matrix(~source_name_ch1, data=pData(eset17612))
> fit <- lmFit(y,design)
> IsExpressed <- fit$Amean>3
> efit <- eBayes(fit[IsExpressed,], trend=TRUE, robust=TRUE)
> topTable(efit)
Removing intercept from test coefficients
               logFC AveExpr       t    P.Value adj.P.Val       B
222617_s_at -0.32912  7.9830 -4.1661 0.00010884   0.99986 -1.8241
243245_at   -0.93968  3.6870 -4.0804 0.00014468   0.99986 -1.9249
203698_s_at -0.73853  6.4350 -4.0334 0.00016894   0.99986 -1.9799
222722_at   -1.48148  6.1147 -4.0060 0.00019023   0.99986 -2.0266
229931_at    0.89540  4.9939  3.9302 0.00023680   0.99986 -2.1003
203535_at    1.61516  6.7329  3.9478 0.00023754   0.99986 -2.1104
221581_s_at  0.57926  6.4632  3.8839 0.00027511   0.99986 -2.1539
230746_s_at  1.06638  4.1059  3.8699 0.00028787   0.99986 -2.1702
1558727_at   0.92696  4.5805  3.8642 0.00029323   0.99986 -2.1768
235867_at    0.32057  8.6997  3.7223 0.00046153   0.99986 -2.3398

I am using R 3.1.1 and limma 3.22.0.

ADD COMMENT
0
Entering edit mode

Thank you so much.  However, I notice that your code would still give constant adj.P.Vals

ADD REPLY
0
Entering edit mode

Runs of identical adj.P.Val's are not caused by any software error. Rather, they are present due to enforced monotonicity in the Benjamini-Hochberg method for controlling the FDR. This monotonicity ensures that a gene with a lower p-value (relative to other genes) will not have a higher adjusted p-value.

ADD REPLY
0
Entering edit mode

Okay, but then how do I know which genes have significant fold changes if the P.Values are so small?

ADD REPLY
0
Entering edit mode

The adjusted p-values are large, so none of the genes in the above analysis would be considered to have significant differential expression at any reasonable FDR threshold. If you run:

plotMDS(y, col=ifelse(design[,2]==1, "blue", "red"))

you will see that there is no clear separation between the samples from your two conditions. The small size of the P.Value's has little meaning, as they have not been corrected for multiple testing.

ADD REPLY
0
Entering edit mode

Additionally, when I run your code, I get the following error:

Error in eBayes(fit[IsExpressed, ], trend = TRUE, robust = TRUE) : 
  Need Amean component in fit to estimate trend
In addition: Warning messages:
1: installed directory not writable, cannot update packages 'boot',
  'class', 'cluster', 'codetools', 'foreign', 'KernSmooth',
  'lattice', 'MASS', 'Matrix', 'mgcv', 'nlme', 'nnet', 'rpart',
  'spatial', 'survival' 
2: installed directory not writable, cannot update packages 'boot',
  'class', 'cluster', 'codetools', 'foreign', 'KernSmooth',
  'lattice', 'MASS', 'Matrix', 'mgcv', 'nlme', 'nnet', 'rpart',
  'spatial', 'survival' 
3: installed directory not writable, cannot update packages 'boot',
  'class', 'cluster', 'codetools', 'foreign', 'KernSmooth',
  'lattice', 'MASS', 'Matrix', 'mgcv', 'nlme', 'nnet', 'rpart',
  'spatial', 'survival' 
ADD REPLY
0
Entering edit mode

That is because you are running a version of limma that is three Bioconductor releases out of date. With your old version you could try

IsExpressed <- rowMeans(y)>3

instead, but in general we assume people are running the current version.

ADD REPLY
0
Entering edit mode

Thank you.  I updated my software.  However, I still get the following error message:

Error in ebayes(fit = fit, proportion = proportion, stdev.coef.lim = stdev.coef.lim,  : 
  No residual degrees of freedom in linear model fits
In addition: Warning messages:
1: installed directory not writable, cannot update packages 'boot', 'class', 'cluster', 'codetools', 'KernSmooth', 'MASS',
  'mgcv', 'nlme' 
2: installed directory not writable, cannot update packages 'boot', 'class', 'cluster', 'codetools', 'KernSmooth', 'MASS',
  'mgcv', 'nlme' 
3: installed directory not writable, cannot update packages 'boot', 'class', 'cluster', 'codetools', 'KernSmooth', 'MASS',
  'mgcv', 'nlme' 
4: In download.file(myurl, destfile, mode = mode, quiet = TRUE, method = getOption("download.file.method.GEOquery")) :
  downloaded length 64700445 != reported length 200

I am using R 3.1.1 and limma 3.22.0 this time

> sessionInfo()
R version 3.1.1 (2014-07-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                           LC_TIME=English_United States.1252    

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] GEOquery_2.32.0      Biobase_2.26.0       BiocGenerics_0.12.0  limma_3.22.0         BiocInstaller_1.16.0

loaded via a namespace (and not attached):
[1] RCurl_1.95-4.3 tools_3.1.1    XML_3.98-1.1  
ADD REPLY
0
Entering edit mode

To be clear, you should be running:

library("limma")
library("GEOquery")
gse17612 <- getGEO("GSE17612", GSEMatrix = TRUE)
eset17612 = gse17612[[1]]
y <- log2(exprs(eset17612))
design <- model.matrix(~source_name_ch1, data=pData(eset17612))
fit <- lmFit(y,design)
IsExpressed <- fit$Amean>3
efit <- eBayes(fit[IsExpressed,], trend=TRUE, robust=TRUE)
topTable(efit)

When I run this, I don't get any errors, nor do I get the 4th warning in your previous post (where downloaded length != reported length). Perhaps the download may be incomplete. What do you get when you run dim(y) or dim(design)? I get a 54675-by-51 matrix for the first one, and a 51-by-2 matrix for the second.

ADD REPLY
0
Entering edit mode

I do apologize.  I made a typo.  However, in the line:

IsExpressed <- fit$Amean>3

I would think that that means that a gene is considered to be expressed if the average expression value is greater than 3.  Why does this have to be greater than 3?

ADD REPLY
0
Entering edit mode

It's arbitrary. You could pick any number and use it instead of 3. However, you'll end up with no genes if the threshold is too high. Conversely, you'll fail to remove lowly expressed genes if the threshold is too low (and if these genes aren't well expressed, they probably won't be differentially expressed, so testing would be moot). Experience suggests that, in most cases, a value of 3 is a suitable compromise.

ADD REPLY

Login before adding your answer.

Traffic: 883 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6