I have been working on a project for a while now. For my data, I have tried converting the GDS file into an eSet object, fitting it with lmFit, from the limma package (with and without a design matrix as a parameter), and getting a hold of the raw CEL files, running those through justRMA, from the affy package, using the sdrf file as the phenodata file and using pData(GDS converted to expression set) as the phenodata AnnotatedDataFrame, then running that through lmFit, with a design matrix, then running the result through eBayes. However, when I call topTable on the resulting MArrayLM object, with the parameters: method = "fdr", I either get a constant value for the adj.p.value column or I get values on the order of 10^-121. If anyone could help, or at least point me in the right direction, that would be greatly appreciated. Thank you in advance
Another major consideration is that you need to explore the data properly before testing for differential expression. This dataset has a large number of explanatory factors other than disease status:
A little bit of playing around with plots such as plotMDS(y) very quickly shows there are huge expression differences between the male and female subjects, and probably there are other important factors as well. These effects need to be incorporated into the linear model before you will have any chance of detecting the relatively subtle differences between the disease states.
Formal model selection is tricky, and requires use of procedures like AIC or LASSO. We usually use a more ad hoc approach, whereby we make a MDS plot and check if any of the factors are driving major differences between libraries. If so, it suggests that the offending factor should be included in the model fit. Another approach is to try out a bunch of different models, and see which one gives you more DE genes. This balances the greater explanatory ability of larger models against the loss of residual degrees of freedom and associated decrease in power.
Runs of identical adj.P.Val's are not caused by any software error. Rather, they are present due to enforced monotonicity in the Benjamini-Hochberg method for controlling the FDR. This monotonicity ensures that a gene with a lower p-value (relative to other genes) will not have a higher adjusted p-value.
The adjusted p-values are large, so none of the genes in the above analysis would be considered to have significant differential expression at any reasonable FDR threshold. If you run:
you will see that there is no clear separation between the samples from your two conditions. The small size of the P.Value's has little meaning, as they have not been corrected for multiple testing.
When I run this, I don't get any errors, nor do I get the 4th warning in your previous post (where downloaded length != reported length). Perhaps the download may be incomplete. What do you get when you run dim(y) or dim(design)? I get a 54675-by-51 matrix for the first one, and a 51-by-2 matrix for the second.
I do apologize. I made a typo. However, in the line:
IsExpressed <- fit$Amean>3
I would think that that means that a gene is considered to be expressed if the average expression value is greater than 3. Why does this have to be greater than 3?
It's arbitrary. You could pick any number and use it instead of 3. However, you'll end up with no genes if the threshold is too high. Conversely, you'll fail to remove lowly expressed genes if the threshold is too low (and if these genes aren't well expressed, they probably won't be differentially expressed, so testing would be moot). Experience suggests that, in most cases, a value of 3 is a suitable compromise.
Forgot to put in the commands and such
the commands:
R version 3.0.1 (2013-05-16) Platform: x86_64-w64-mingw32/x64 (64-bit) locale: [1] LC_COLLATE=English_United States.1252 [2] LC_CTYPE=English_United States.1252 [3] LC_MONETARY=English_United States.1252 [4] LC_NUMERIC=C [5] LC_TIME=English_United States.1252 attached base packages: [1] parallel stats graphics grDevices [5] utils datasets methods base other attached packages: [1] hgu133plus2cdf_2.12.0 [2] BiocInstaller_1.12.1 [3] hugene10sttranscriptcluster.db_8.0.1 [4] hugene10stprobeset.db_8.0.1 [5] org.Hs.eg.db_2.9.0 [6] RSQLite_0.11.4 [7] DBI_0.3.1 [8] hugene10stv1probe_2.12.0 [9] hugene10stv1cdf_2.12.0 [10] AnnotationDbi_1.22.6 [11] gcrma_2.32.0 [12] affy_1.38.1 [13] GEOquery_2.26.2 [14] Biobase_2.20.1 [15] BiocGenerics_0.6.0 [16] limma_3.16.8 loaded via a namespace (and not attached): [1] affyio_1.28.0 [2] Biostrings_2.28.0 [3] IRanges_1.18.4 [4] preprocessCore_1.22.0 [5] RCurl_1.95-4.3 [6] splines_3.0.1 [7] stats4_3.0.1 [8] tools_3.0.1 [9] XML_3.98-1.1 [10] zlibbioc_1.6.0