Question

model.matrix from GDS data

0

Entering edit mode

darthratus • 0

@darthratus-15044

Last seen 6.2 years ago

Hello,

I am new at using the packages limma, and GEOquery.

I have read the documentation available for both, and I am still a bit lost.

This is what I have done:

I downloaded a GDS data set and stored in my R session using GEOquery (i.e., GDS <- getGEO("GDSxxx"), where xxx is the identifier).

Examined Meta(GDS) entries and can see clear information about its contents (i.e: Table(gds)[1:5, ] for example).

I know there are 44 columns of sampled data with about 54K rows which correspond to genes. About 20 of these columns indicate that they are normal samples, and the rest are not.

I can also read the disease.state information and also in the description which columns are normal and which ones are NOT normal.

I need to identify the differential genes between the normal and abnormal. Therefore I needed to start by creating a design matrix.

I created the design matrix, and ran lmFit, and eBayes. They run fine, no errors.

The p-values of my output after eBayes is too small (in the order of 10^-67), and I am sure I am doing something wrong when getting the data ready for the analysis.

I have tried the following command:

eset <- GDS2eSet(GDS, do.log = TRUE) setting do.log to both TRUE and FALSE. With FALSE, the numbers are around 10^-54, with true around 10^-67. I also used topTreat, and that gives even smaller p-values in the order of 10^110.

These are some of the entries from my source GDS file:

ID_REF IDENTIFIER GSM549121 GSM549102 GSM549104

1 1007_s_at MIR4640 10.8591 1 1.1874 9.33047

2 1053_at RFC2 6.53584 6.33768 6.26199

3 117_at HSPA6 5.72289 5.79694 5.64208

4 121_at PAX8 9.28742 8.38915 8.19872

5 1255_g_at GUCA1A 3.65037 3.77146 3.63181

Any tips are appreciated.

model.matrix lmFit eBayes GEOquery limma • 1.6k views

ADD COMMENT • link 6.2 years ago darthratus • 0

score 1 · Answer 1 · 2018-02-19

1

Entering edit mode

Gordon Smyth 50k

@gordon-smyth

Last seen 3 hours ago

WEHI, Melbourne, Australia

You certainly want do.log=FALSE, because (as GEO tells you) the data has been gcrma normalized and log2 transformed already.

I don't see how you can be running topTreat() since you haven't run treat(). Anyway, it is certainly not true that topTreat() gives smaller p-values than topTable().

Apart from that, I'm not quite sure what the problem is. You created the design matrix, ran lmFit, and eBayes and everything ran fine.

Why would it be a problem that some of the p-values are small? You have quite large samples sizes (or a microarray dataset), so very small p-values are quite possible.

If you have made a mistake, in model.matrix for example, how do you expect us to identify it if you don't show us the code that you ran?

ADD COMMENT • link 6.2 years ago Gordon Smyth 50k

0

Entering edit mode

Hello Gordon, here is the code I ran:

source("http://bioconductor.org/biocLite.R")
library(limma)
library(preprocessCore)
library(GEOquery)

GDS4718 <- getGEO("GDS4718")
Meta(GDS4718)
Table(GDS4718)[1:10]
eset <- GDS2eSet(GDS4718, do.log = FALSE)
pData(eset)
sampleNames(eset)

design_matrix <- model.matrix(~ -1+factor(c(1,1,1,1,1,2,2,1,1,1,1,1,2,2,2,1,1,2,2,2,1,1,1,1,1,2,1,2,1,2,2,2,1,1,2,2,2,1,1,1,1,2,2,2)))
colnames(design_matrix) <-c("normal", "cancer")
design_matrix
fit_matrix <- lmFit(eset, design_matrix)
fit_Bayes <- eBayes(fit_matrix)

The design matrix entries are based on the value 'normal' or 'cancer' in pData(eset). I suspect therein lies the problem. Maybe I need to have a different design_matrix using the disease.state entries. The task was to find the differential genes between the normal and cancer samples.

Hints are appreciated.

ADD REPLY • link 6.2 years ago darthratus • 0

0

Entering edit mode

Just remove the "-1" from the model.matrix() line and don't reset the design colnames. Then it will run correctly.

The only reason for suppressing the intercept from the design matrix would be define your own contrasts. Since you haven't defined any contrasts, you shouldn't do that.

ADD REPLY • link 6.2 years ago Gordon Smyth 50k

0

Entering edit mode

Hello Gordon,

Thank you for the information, it really helped a lot.

I do have one more question, however, if I want to perform the analysis to determine which genes are differentially expressed between normal and cancer samples, is what I am doing the most appropriate way ? Would defining contrasts make these genes standout more ?

ADD REPLY • link 6.2 years ago darthratus • 0

0

Entering edit mode

No. The results are already the same as a contrast would give you.

You only have two groups after all, there's only one comparison that can possibly be made.

ADD REPLY • link 6.2 years ago Gordon Smyth 50k