Are the p-values supposed to be on the order of 10^-120?
2
0
Entering edit mode
@mccallionc403-6871
Last seen 7.0 years ago
United States

I have been working on a project for a while now.  For my data, I have tried converting the GDS file into an eSet object, fitting it with lmFit, from the limma package (with and without a design matrix as a parameter), and getting a hold of the raw CEL files, running those through justRMA, from the affy package, using the sdrf file as the phenodata file and using pData(GDS converted to expression set) as the phenodata AnnotatedDataFrame, then running that through lmFit, with a design matrix, then running the result through eBayes.  However, when I call topTable on the resulting MArrayLM object, with the parameters: method = "fdr", I either get a constant value for the adj.p.value column or I get values on the order of 10^-121.  If anyone could help, or at least point me in the right direction, that would be greatly appreciated.  Thank you in advance

limma affy bioconductor affymetrix • 1.5k views
0
Entering edit mode

Forgot to put in the commands and such

the commands:

source("http://bioconductor.org/biocLite.R")
biocLite()
biocLite("limma")
biocLite("GEOquery")

library("limma")
library("GEOquery")

#gds4523 <- getGEO("GDS4523")
#eset4523 <- GDS2eSet(gds4523)

gse17612 <- getGEO("GSE17612", GSEMatrix = TRUE)

eset17612 = gse17612[[1]]

design <- model.matrix(~0 + pData(eset17612)$source_name_ch1) #data <- pData(eset4523) #Disease.state <- data$disease.state

#design <- model.matrix(~0 + Disease.state)

fit <- lmFit(eset17612, design)

efit <- eBayes(fit)

sessionInfo(): 
R version 3.0.1 (2013-05-16)
Platform: x86_64-w64-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252

attached base packages:
[1] parallel  stats     graphics  grDevices
[5] utils     datasets  methods   base

other attached packages:
[1] hgu133plus2cdf_2.12.0
[2] BiocInstaller_1.12.1
[3] hugene10sttranscriptcluster.db_8.0.1
[4] hugene10stprobeset.db_8.0.1
[5] org.Hs.eg.db_2.9.0
[6] RSQLite_0.11.4
[7] DBI_0.3.1
[8] hugene10stv1probe_2.12.0
[9] hugene10stv1cdf_2.12.0
[10] AnnotationDbi_1.22.6
[11] gcrma_2.32.0
[12] affy_1.38.1
[13] GEOquery_2.26.2
[14] Biobase_2.20.1
[15] BiocGenerics_0.6.0
[16] limma_3.16.8

loaded via a namespace (and not attached):
[1] affyio_1.28.0
[2] Biostrings_2.28.0
[3] IRanges_1.18.4
[4] preprocessCore_1.22.0
[5] RCurl_1.95-4.3
[6] splines_3.0.1
[7] stats4_3.0.1
[8] tools_3.0.1
[9] XML_3.98-1.1
[10] zlibbioc_1.6.0       

2
Entering edit mode
@gordon-smyth
Last seen 6 hours ago
WEHI, Melbourne, Australia

Another major consideration is that you need to explore the data properly before testing for differential expression. This dataset has a large number of explanatory factors other than disease status:

 colnames(pData(eset17612))
[1] "title"                   "geo_accession"           "status"
[4] "submission_date"         "last_update_date"        "type"
[7] "channel_count"           "source_name_ch1"         "organism_ch1"
[10] "characteristics_ch1"     "characteristics_ch1.1"   "characteristics_ch1.2"
[13] "characteristics_ch1.3"   "molecule_ch1"            "extract_protocol_ch1"
[16] "label_ch1"               "label_protocol_ch1"      "taxid_ch1"
[19] "hyb_protocol"            "scan_protocol"           "description"
[22] "data_processing"         "platform_id"             "contact_name"
[28] "contact_city"            "contact_state"           "contact_zip/postal_code"
[31] "contact_country"         "supplementary_file"      "data_row_count"     

A little bit of playing around with plots such as plotMDS(y) very quickly shows there are huge expression differences between the male and female subjects, and probably there are other important factors as well. These effects need to be incorporated into the linear model before you will have any chance of detecting the relatively subtle differences between the disease states.

0
Entering edit mode

Okay.  Possibly obvious question:  How would I go about doing that, or where would I start at least?

0
Entering edit mode

Formal model selection is tricky, and requires use of procedures like AIC or LASSO. We usually use a more ad hoc approach, whereby we make a MDS plot and check if any of the factors are driving major differences between libraries. If so, it suggests that the offending factor should be included in the model fit. Another approach is to try out a bunch of different models, and see which one gives you more DE genes. This balances the greater explanatory ability of larger models against the loss of residual degrees of freedom and associated decrease in power.

1
Entering edit mode
@gordon-smyth
Last seen 6 hours ago
WEHI, Melbourne, Australia

You've made a few classic mistakes.

First, you've downloaded expression values which are on the raw scale and haven't converted them to log-values.

Second, you haven't filtering out unexpressed genes or allowed for a mean-variance trend.

Third, you have tested whether all the expression values are zero instead of testing whether they are different between disease states.

A more correct analysis would go something like this:

> y <- log2(exprs(eset17612))
> design <- model.matrix(~source_name_ch1, data=pData(eset17612))
> fit <- lmFit(y,design)
> IsExpressed <- fit$Amean>3 > efit <- eBayes(fit[IsExpressed,], trend=TRUE, robust=TRUE) > topTable(efit) Removing intercept from test coefficients logFC AveExpr t P.Value adj.P.Val B 222617_s_at -0.32912 7.9830 -4.1661 0.00010884 0.99986 -1.8241 243245_at -0.93968 3.6870 -4.0804 0.00014468 0.99986 -1.9249 203698_s_at -0.73853 6.4350 -4.0334 0.00016894 0.99986 -1.9799 222722_at -1.48148 6.1147 -4.0060 0.00019023 0.99986 -2.0266 229931_at 0.89540 4.9939 3.9302 0.00023680 0.99986 -2.1003 203535_at 1.61516 6.7329 3.9478 0.00023754 0.99986 -2.1104 221581_s_at 0.57926 6.4632 3.8839 0.00027511 0.99986 -2.1539 230746_s_at 1.06638 4.1059 3.8699 0.00028787 0.99986 -2.1702 1558727_at 0.92696 4.5805 3.8642 0.00029323 0.99986 -2.1768 235867_at 0.32057 8.6997 3.7223 0.00046153 0.99986 -2.3398  I am using R 3.1.1 and limma 3.22.0. ADD COMMENT 0 Entering edit mode Thank you so much. However, I notice that your code would still give constant adj.P.Vals ADD REPLY 0 Entering edit mode Runs of identical adj.P.Val's are not caused by any software error. Rather, they are present due to enforced monotonicity in the Benjamini-Hochberg method for controlling the FDR. This monotonicity ensures that a gene with a lower p-value (relative to other genes) will not have a higher adjusted p-value. ADD REPLY 0 Entering edit mode Okay, but then how do I know which genes have significant fold changes if the P.Values are so small? ADD REPLY 0 Entering edit mode The adjusted p-values are large, so none of the genes in the above analysis would be considered to have significant differential expression at any reasonable FDR threshold. If you run: plotMDS(y, col=ifelse(design[,2]==1, "blue", "red")) you will see that there is no clear separation between the samples from your two conditions. The small size of the P.Value's has little meaning, as they have not been corrected for multiple testing. ADD REPLY 0 Entering edit mode Additionally, when I run your code, I get the following error: Error in eBayes(fit[IsExpressed, ], trend = TRUE, robust = TRUE) : Need Amean component in fit to estimate trend In addition: Warning messages: 1: installed directory not writable, cannot update packages 'boot', 'class', 'cluster', 'codetools', 'foreign', 'KernSmooth', 'lattice', 'MASS', 'Matrix', 'mgcv', 'nlme', 'nnet', 'rpart', 'spatial', 'survival' 2: installed directory not writable, cannot update packages 'boot', 'class', 'cluster', 'codetools', 'foreign', 'KernSmooth', 'lattice', 'MASS', 'Matrix', 'mgcv', 'nlme', 'nnet', 'rpart', 'spatial', 'survival' 3: installed directory not writable, cannot update packages 'boot', 'class', 'cluster', 'codetools', 'foreign', 'KernSmooth', 'lattice', 'MASS', 'Matrix', 'mgcv', 'nlme', 'nnet', 'rpart', 'spatial', 'survival'  ADD REPLY 0 Entering edit mode That is because you are running a version of limma that is three Bioconductor releases out of date. With your old version you could try IsExpressed <- rowMeans(y)>3 instead, but in general we assume people are running the current version. ADD REPLY 0 Entering edit mode Thank you. I updated my software. However, I still get the following error message: Error in ebayes(fit = fit, proportion = proportion, stdev.coef.lim = stdev.coef.lim, : No residual degrees of freedom in linear model fits In addition: Warning messages: 1: installed directory not writable, cannot update packages 'boot', 'class', 'cluster', 'codetools', 'KernSmooth', 'MASS', 'mgcv', 'nlme' 2: installed directory not writable, cannot update packages 'boot', 'class', 'cluster', 'codetools', 'KernSmooth', 'MASS', 'mgcv', 'nlme' 3: installed directory not writable, cannot update packages 'boot', 'class', 'cluster', 'codetools', 'KernSmooth', 'MASS', 'mgcv', 'nlme' 4: In download.file(myurl, destfile, mode = mode, quiet = TRUE, method = getOption("download.file.method.GEOquery")) : downloaded length 64700445 != reported length 200 I am using R 3.1.1 and limma 3.22.0 this time > sessionInfo() R version 3.1.1 (2014-07-10) Platform: x86_64-w64-mingw32/x64 (64-bit) locale: [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252 [4] LC_NUMERIC=C LC_TIME=English_United States.1252 attached base packages: [1] parallel stats graphics grDevices utils datasets methods base other attached packages: [1] GEOquery_2.32.0 Biobase_2.26.0 BiocGenerics_0.12.0 limma_3.22.0 BiocInstaller_1.16.0 loaded via a namespace (and not attached): [1] RCurl_1.95-4.3 tools_3.1.1 XML_3.98-1.1  ADD REPLY 0 Entering edit mode To be clear, you should be running: library("limma") library("GEOquery") gse17612 <- getGEO("GSE17612", GSEMatrix = TRUE) eset17612 = gse17612[[1]] y <- log2(exprs(eset17612)) design <- model.matrix(~source_name_ch1, data=pData(eset17612)) fit <- lmFit(y,design) IsExpressed <- fit$Amean>3
efit <- eBayes(fit[IsExpressed,], trend=TRUE, robust=TRUE)
topTable(efit)

When I run this, I don't get any errors, nor do I get the 4th warning in your previous post (where downloaded length != reported length). Perhaps the download may be incomplete. What do you get when you run dim(y) or dim(design)? I get a 54675-by-51 matrix for the first one, and a 51-by-2 matrix for the second.

0
Entering edit mode

I do apologize.  I made a typo.  However, in the line:

IsExpressed <- fit\$Amean>3

I would think that that means that a gene is considered to be expressed if the average expression value is greater than 3.  Why does this have to be greater than 3?

0
Entering edit mode

It's arbitrary. You could pick any number and use it instead of 3. However, you'll end up with no genes if the threshold is too high. Conversely, you'll fail to remove lowly expressed genes if the threshold is too low (and if these genes aren't well expressed, they probably won't be differentially expressed, so testing would be moot). Experience suggests that, in most cases, a value of 3 is a suitable compromise.