Question: model.matrix from GDS data
0
gravatar for darthratus
21 months ago by
darthratus0
darthratus0 wrote:

Hello,

I am new at using the packages limma, and GEOquery.

I have read the documentation available for both, and I am still a bit lost.

This is what I have done:

I downloaded a GDS data set and stored in my R session using GEOquery (i.e., GDS <- getGEO("GDSxxx"), where xxx is the identifier).

Examined Meta(GDS) entries and can see clear information about its contents (i.e: Table(gds)[1:5, ] for example).

I know there are 44 columns of sampled data with about 54K rows which correspond to genes. About 20 of these columns indicate that they are normal samples, and the rest are not.

I can also read the disease.state information and also in the description which columns are normal and which ones are NOT normal.

I need to identify the differential genes between the normal and abnormal. Therefore I needed to start by creating a design matrix.

I created the design matrix, and ran lmFit, and eBayes. They run fine, no errors.

The p-values of my output after eBayes is too small (in the order of 10^-67), and I am sure I am doing something wrong when getting the data ready for the analysis.

I have tried the following command:

eset <- GDS2eSet(GDS, do.log = TRUE) setting do.log to both TRUE and FALSE. With FALSE, the numbers are around 10^-54, with true around 10^-67. I also used topTreat, and that gives even smaller p-values in the order of 10^110.

These are some of the entries from my source GDS file:

ID_REF IDENTIFIER         GSM549121    GSM549102      GSM549104

1 1007_s_at MIR4640        10.8591 1         1.1874                9.33047

2 1053_at RFC2                    6.53584         6.33768              6.26199

3 117_at HSPA6                    5.72289         5.79694              5.64208

4 121_at PAX8                       9.28742         8.38915              8.19872

5 1255_g_at GUCA1A           3.65037         3.77146              3.63181

Any tips are appreciated.

ADD COMMENTlink modified 21 months ago • written 21 months ago by darthratus0
Answer: model.matrix from GDS data
1
gravatar for Gordon Smyth
21 months ago by
Gordon Smyth39k
Walter and Eliza Hall Institute of Medical Research, Melbourne, Australia
Gordon Smyth39k wrote:

You certainly want do.log=FALSE, because (as GEO tells you) the data has been gcrma normalized and log2 transformed already.

I don't see how you can be running topTreat() since you haven't run treat(). Anyway, it is certainly not true that topTreat() gives smaller p-values than topTable().

Apart from that, I'm not quite sure what the problem is. You created the design matrix, ran lmFit, and eBayes and everything ran fine.

Why would it be a problem that some of the p-values are small? You have quite large samples sizes (or a microarray dataset), so very small p-values are quite possible.

If you have made a mistake, in model.matrix for example, how do you expect us to identify it if you don't show us the code that you ran?

ADD COMMENTlink modified 21 months ago • written 21 months ago by Gordon Smyth39k

Hello Gordon, here is the code I ran:

source("http://bioconductor.org/biocLite.R")
library(limma)
library(preprocessCore)
library(GEOquery)


GDS4718 <- getGEO("GDS4718")
Meta(GDS4718)
Table(GDS4718)[1:10]
eset <- GDS2eSet(GDS4718, do.log = FALSE)
pData(eset)
sampleNames(eset)

design_matrix <- model.matrix(~ -1+factor(c(1,1,1,1,1,2,2,1,1,1,1,1,2,2,2,1,1,2,2,2,1,1,1,1,1,2,1,2,1,2,2,2,1,1,2,2,2,1,1,1,1,2,2,2)))
colnames(design_matrix) <-c("normal", "cancer")
design_matrix
fit_matrix <- lmFit(eset, design_matrix)
fit_Bayes <- eBayes(fit_matrix)

The design matrix entries are based on the value 'normal' or 'cancer' in pData(eset). I suspect therein lies the problem. Maybe I need to have a different design_matrix using the disease.state entries. The task was to find the differential genes between the normal and cancer samples.

Hints are appreciated.

ADD REPLYlink written 21 months ago by darthratus0

Just remove the "-1" from the model.matrix() line and don't reset the design colnames. Then it will run correctly.

The only reason for suppressing the intercept from the design matrix would be define your own contrasts. Since you haven't defined any contrasts, you shouldn't do that.

ADD REPLYlink written 21 months ago by Gordon Smyth39k

Hello Gordon,

Thank you for the information, it really helped a lot.

I do have one more question, however, if I want to perform the analysis to determine which genes are differentially expressed between normal and cancer samples, is what I am doing the most appropriate way ? Would defining contrasts make these genes standout more ?

ADD REPLYlink written 21 months ago by darthratus0

No. The results are already the same as a contrast would give you.

You only have two groups after all, there's only one comparison that can possibly be made.

ADD REPLYlink modified 21 months ago • written 21 months ago by Gordon Smyth39k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 229 users visited in the last hour