model.matrix from GDS data
1
0
Entering edit mode
darthratus • 0
@darthratus-15044
Last seen 6.9 years ago

Hello,

I am new at using the packages limma, and GEOquery.

I have read the documentation available for both, and I am still a bit lost.

This is what I have done:

I downloaded a GDS data set and stored in my R session using GEOquery (i.e., GDS <- getGEO("GDSxxx"), where xxx is the identifier).

Examined Meta(GDS) entries and can see clear information about its contents (i.e: Table(gds)[1:5, ] for example).

I know there are 44 columns of sampled data with about 54K rows which correspond to genes. About 20 of these columns indicate that they are normal samples, and the rest are not.

I can also read the disease.state information and also in the description which columns are normal and which ones are NOT normal.

I need to identify the differential genes between the normal and abnormal. Therefore I needed to start by creating a design matrix.

I created the design matrix, and ran lmFit, and eBayes. They run fine, no errors.

The p-values of my output after eBayes is too small (in the order of 10^-67), and I am sure I am doing something wrong when getting the data ready for the analysis.

I have tried the following command:

eset <- GDS2eSet(GDS, do.log = TRUE) setting do.log to both TRUE and FALSE. With FALSE, the numbers are around 10^-54, with true around 10^-67. I also used topTreat, and that gives even smaller p-values in the order of 10^110.

These are some of the entries from my source GDS file:

ID_REF IDENTIFIER         GSM549121    GSM549102      GSM549104

1 1007_s_at MIR4640        10.8591 1         1.1874                9.33047

2 1053_at RFC2                    6.53584         6.33768              6.26199

3 117_at HSPA6                    5.72289         5.79694              5.64208

4 121_at PAX8                       9.28742         8.38915              8.19872

5 1255_g_at GUCA1A           3.65037         3.77146              3.63181

Any tips are appreciated.

model.matrix lmFit eBayes GEOquery limma • 1.9k views
ADD COMMENT
1
Entering edit mode
@gordon-smyth
Last seen 13 hours ago
WEHI, Melbourne, Australia

You certainly want do.log=FALSE, because (as GEO tells you) the data has been gcrma normalized and log2 transformed already.

I don't see how you can be running topTreat() since you haven't run treat(). Anyway, it is certainly not true that topTreat() gives smaller p-values than topTable().

Apart from that, I'm not quite sure what the problem is. You created the design matrix, ran lmFit, and eBayes and everything ran fine.

Why would it be a problem that some of the p-values are small? You have quite large samples sizes (or a microarray dataset), so very small p-values are quite possible.

If you have made a mistake, in model.matrix for example, how do you expect us to identify it if you don't show us the code that you ran?

ADD COMMENT
0
Entering edit mode

Hello Gordon, here is the code I ran:

source("http://bioconductor.org/biocLite.R")
library(limma)
library(preprocessCore)
library(GEOquery)


GDS4718 <- getGEO("GDS4718")
Meta(GDS4718)
Table(GDS4718)[1:10]
eset <- GDS2eSet(GDS4718, do.log = FALSE)
pData(eset)
sampleNames(eset)

design_matrix <- model.matrix(~ -1+factor(c(1,1,1,1,1,2,2,1,1,1,1,1,2,2,2,1,1,2,2,2,1,1,1,1,1,2,1,2,1,2,2,2,1,1,2,2,2,1,1,1,1,2,2,2)))
colnames(design_matrix) <-c("normal", "cancer")
design_matrix
fit_matrix <- lmFit(eset, design_matrix)
fit_Bayes <- eBayes(fit_matrix)

The design matrix entries are based on the value 'normal' or 'cancer' in pData(eset). I suspect therein lies the problem. Maybe I need to have a different design_matrix using the disease.state entries. The task was to find the differential genes between the normal and cancer samples.

Hints are appreciated.

ADD REPLY
0
Entering edit mode

Just remove the "-1" from the model.matrix() line and don't reset the design colnames. Then it will run correctly.

The only reason for suppressing the intercept from the design matrix would be define your own contrasts. Since you haven't defined any contrasts, you shouldn't do that.

ADD REPLY
0
Entering edit mode

Hello Gordon,

Thank you for the information, it really helped a lot.

I do have one more question, however, if I want to perform the analysis to determine which genes are differentially expressed between normal and cancer samples, is what I am doing the most appropriate way ? Would defining contrasts make these genes standout more ?

ADD REPLY
0
Entering edit mode

No. The results are already the same as a contrast would give you.

You only have two groups after all, there's only one comparison that can possibly be made.

ADD REPLY

Login before adding your answer.

Traffic: 753 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6