Hello,
I am new at using the packages limma, and GEOquery.
I have read the documentation available for both, and I am still a bit lost.
This is what I have done:
I downloaded a GDS data set and stored in my R session using GEOquery (i.e., GDS <- getGEO("GDSxxx"), where xxx is the identifier).
Examined Meta(GDS) entries and can see clear information about its contents (i.e: Table(gds)[1:5, ] for example).
I know there are 44 columns of sampled data with about 54K rows which correspond to genes. About 20 of these columns indicate that they are normal samples, and the rest are not.
I can also read the disease.state information and also in the description which columns are normal and which ones are NOT normal.
I need to identify the differential genes between the normal and abnormal. Therefore I needed to start by creating a design matrix.
I created the design matrix, and ran lmFit, and eBayes. They run fine, no errors.
The p-values of my output after eBayes is too small (in the order of 10^-67), and I am sure I am doing something wrong when getting the data ready for the analysis.
I have tried the following command:
eset <- GDS2eSet(GDS, do.log = TRUE) setting do.log to both TRUE and FALSE. With FALSE, the numbers are around 10^-54, with true around 10^-67. I also used topTreat, and that gives even smaller p-values in the order of 10^110.
These are some of the entries from my source GDS file:
ID_REF IDENTIFIER GSM549121 GSM549102 GSM549104
1 1007_s_at MIR4640 10.8591 1 1.1874 9.33047
2 1053_at RFC2 6.53584 6.33768 6.26199
3 117_at HSPA6 5.72289 5.79694 5.64208
4 121_at PAX8 9.28742 8.38915 8.19872
5 1255_g_at GUCA1A 3.65037 3.77146 3.63181
Any tips are appreciated.
Hello Gordon, here is the code I ran:
source("http://bioconductor.org/biocLite.R")
library(limma)
library(preprocessCore)
library(GEOquery)
GDS4718 <- getGEO("GDS4718")
Meta(GDS4718)
Table(GDS4718)[1:10]
eset <- GDS2eSet(GDS4718, do.log = FALSE)
pData(eset)
sampleNames(eset)
design_matrix <- model.matrix(~ -1+factor(c(1,1,1,1,1,2,2,1,1,1,1,1,2,2,2,1,1,2,2,2,1,1,1,1,1,2,1,2,1,2,2,2,1,1,2,2,2,1,1,1,1,2,2,2)))
colnames(design_matrix) <-c("normal", "cancer")
design_matrix
fit_matrix <- lmFit(eset, design_matrix)
fit_Bayes <- eBayes(fit_matrix)
The design matrix entries are based on the value 'normal' or 'cancer' in pData(eset). I suspect therein lies the problem. Maybe I need to have a different design_matrix using the disease.state entries. The task was to find the differential genes between the normal and cancer samples.
Hints are appreciated.
Just remove the "-1" from the model.matrix() line and don't reset the design colnames. Then it will run correctly.
The only reason for suppressing the intercept from the design matrix would be define your own contrasts. Since you haven't defined any contrasts, you shouldn't do that.
Hello Gordon,
Thank you for the information, it really helped a lot.
I do have one more question, however, if I want to perform the analysis to determine which genes are differentially expressed between normal and cancer samples, is what I am doing the most appropriate way ? Would defining contrasts make these genes standout more ?
No. The results are already the same as a contrast would give you.
You only have two groups after all, there's only one comparison that can possibly be made.