Or:

Question

problem about (design matrix)

0

Entering edit mode

libya.tahani • 0

@libyatahani-9206

Last seen 7.6 years ago

Libyan Arab Jamahiriya

Hi all,

I'm analyzing the microarray data using affymetrix , my data has downloaded from GEO, Series_geo_accession "GSE7700".

They used microarray to look at the targte genes regulated by Yes-associated protein (YAP) in normal breast luminal cell lines and breast cancer cell lines.

I would like to analysis my data using a linear model "limma", but I have some problem about (design matrix) .

Can any one explain to me which one of the following is right ?

The samples were :

GSM187268   MDA MB231 control cell pmig / pRS-IRES-GFP rep1
GSM187269   MDA MB231 control cell pmig / pRS-IRES-GFP rep2
GSM187271   MDA MB231 YAP knock down cell pmig-YAP/pRS-IRES-GFP-YAP rep1
GSM187272   MDA MB231 YAP knock down cell pmig-YAP/pRS-IRES-GFP-YAP rep2
GSM187273   Normal luminal control cell pmig / pRS-IRES-GFP rep1
GSM187274   Normal luminal control cell pmig / pRS-IRES-GFP rep2
GSM187275   Normal luminal control cell pmig / pRS-IRES-GFP rep3
GSM187276   Normal luminal YAP knock down cell pmig-YAP/pRS-IRES-GFP-YAP rep1
GSM187277   Normal luminal YAP knock down cell pmig-YAP/pRS-IRES-GFP-YAP rep2
GSM187278   Normal luminal YAP knock down cell pmig-YAP/pRS-IRES-GFP-YAP rep3

As it shown above ,I think we have 2 type of cell lines (normal & cancer breast cells) and 4 groups of them (cancer control cells, cancer YAP knock down cells. normal control cells and normal YAP knock down cells)

........................................................................................................................................................................................

> groups<-c("CC", "CC", "CT", "CT", "NC", "NC", "NC", "NT", "NT", "NT")

> groups<-as.factor(groups)

> design<-model.matrix(~groups)

> design

   (Intercept) groupsCT groupsNC groupsNT

1            1        0        0        0

2            1        0        0        0

3            1        1        0        0

4            1        1        0        0

5            1        0        1        0

6            1        0        1        0

7            1        0        1        0

8            1        0        0        1

9            1        0        0        1

10           1        0        0        1

attr(,"assign")

[1] 0 1 1 1

attr(,"contrasts")

attr(,"contrasts")$groups

[1] "contr.treatment"

> fit<-lmFit(data.f, design)

> fit<-eBayes(fit)

> toptable(fit, coef=2)

                logFC         t      P.Value    adj.P.Val        B

202350_s_at -1.564580 -14.70096 1.327425e-08 0.0005127857 9.043119

202238_s_at -1.282140 -14.05648 2.128833e-08 0.0005127857 8.727672

228450_at   -1.505202 -13.68802 2.813639e-08 0.0005127857 8.536645

224895_at   -1.590910 -13.21231 4.074283e-08 0.0005569036 8.277724

223395_at    1.852093  12.44041 7.624968e-08 0.0008337903 7.825825

213342_at   -1.572131 -11.74764 1.378619e-07 0.0012562665 7.383662

219327_s_at  1.201714  11.42378 1.837460e-07 0.0014351875 7.164063

224894_at   -1.752598 -11.18234 2.286805e-07 0.0015628883 6.994676

206067_s_at -1.158799 -10.79151 3.286868e-07 0.0019967725 6.709761

222486_s_at -1.152186 -10.48365 4.408662e-07 0.0022142264 6.475568

.............................................................................................

Or:

> design<-model.matrix(~0+factor(c("CC", "CC", "CT", "CT", "NC", "NC", "NC", "NT", "NT", "NT")))

> colnames(design)<-c("CC","CT","NC","NT")

> colnames(design)

[1] "CC" "CT" "NC" "NT"

> design

   CC CT NC NT

1   1  0  0  0

2   1  0  0  0

3   0  1  0  0

4   0  1  0  0

5   0  0  1  0

6   0  0  1  0

7   0  0  1  0

8   0  0  0  1

9   0  0  0  1

10  0  0  0  1

attr(,"assign")

[1] 1 1 1 1

attr(,"contrasts")

attr(,"contrasts")$`factor(c("CC", "CC", "CT", "CT", "NC", "NC", "NC", "NT", "NT", "NT"))`

[1] "contr.treatment"

> fit<-lmFit(data.f, design)

> fit<-eBayes(fit)

> toptable(fit, coef=2)

                   logFC        t      P.Value    adj.P.Val        B

AFFX-hum_alu_at 13.84261 222.2422 1.563089e-21 2.535286e-18 34.98391

201492_s_at     13.88974 220.7521 1.683748e-21 2.535286e-18 34.95969

1553588_at      13.85768 220.3090 1.721551e-21 2.535286e-18 34.95241

208856_x_at     13.59860 217.8449 1.949441e-21 2.535286e-18 34.91130

201257_x_at     13.60074 217.6522 1.968609e-21 2.535286e-18 34.90803

200095_x_at     13.62670 217.6191 1.971917e-21 2.535286e-18 34.90747

201429_s_at     13.87166 217.0458 2.030260e-21 2.535286e-18 34.89772

1553570_x_at    13.64131 216.8168 2.054088e-21 2.535286e-18 34.89381

200834_s_at     13.49999 216.4295 2.095081e-21 2.535286e-18 34.88717

213084_x_at     13.91086 216.1968 2.120133e-21 2.535286e-18 34.88317

........................................................................................................................................................................................

Which of the above was right ?
were there any significantly differentially expressed genes from the topTable?

Cheers,

Tahani.H

limma limma design matrix design matrix • 1.9k views

ADD COMMENT • link updated 8.2 years ago by Aaron Lun ★ 28k • written 8.2 years ago by libya.tahani • 0

score 1 · Answer 1 · 2016-02-15

1

Entering edit mode

Aaron Lun ★ 28k

@alun

Last seen 13 hours ago

The city by the bay

Both approaches will test whether the second coefficient is significantly non-zero. In the first approach, the second coefficient represents the log-fold change of CT over CC (i.e., the intercept). Thus, dropping it will identify DE genes between the CC and CT lines, which is sensible enough. In the second approach, the second coefficient represents the average log-expression of the CT group. Dropping it will identify genes where the average log-expression in the CT group is significantly different from zero. This is a silly hypothesis to test against; obviously there will be many genes where the log-expression is non-zero.

In summary, the first approach is much more sensible, and is the "right" approach if you want to detect DE between CT and CC. If you want other contrasts, then you'll have to use a different coef or go through contrasts.fit. As for whether there's any DE genes; you can figure that out yourself, based on whether there are any genes with adjusted p-values below a pre-specified threshold for the false discovery rate (usually 5%). Check out decideTests if you want to get a summary of the numbers and direction of DE for each contrast.

ADD COMMENT • link 8.2 years ago Aaron Lun ★ 28k

0

Entering edit mode

Thanks a lot dear Aaron Lun for your answer . So I'll use the first approach because I really want to detect DE between CT and CC .

ADD REPLY • link 8.2 years ago libya.tahani • 0

0

Entering edit mode

Here from the topTable above , is the first column (without a name) mean the names of the genes or ?

because I read it in some papers that the first column is the row
number in the original data!!!

ADD REPLY • link 8.2 years ago libya.tahani • 0

1

Entering edit mode

The unnamed "first column" represents the row names of the table. In this case, they're probe names, as the *_at suffixes suggest. Clearly they're character strings, not numbers, so whatever papers you're reading don't apply here. If you want to convert from probe names to gene names, you need the relevant annotation package depending on the chip being used, e.g., hgu133a.db. You can then select for symbol or Entrez ID using the probe names as keys.

ADD REPLY • link 8.2 years ago Aaron Lun ★ 28k

0

Entering edit mode

Thanks so much sir,

It's clear to me now .

Regards,

Tahani.H

ADD REPLY • link 8.2 years ago libya.tahani • 0

0

Entering edit mode

Hi Aaron Lun

While I was trying to performing the gene set test for KEGG pathways I had a problem with (globaltest):

> library(hgu133plus2.db)

> pathway2probe<-get("hgu133plus2PATH2PROBE")

> kegg<-as.list(pathway2probe)

> library(globaltest)

> groups<-c(0,0,1,1,0,0,0,1,1,1)

> test.kegg<-globaltest(as.matrix(data.m), groups, kegg)

Error: could not find function "globaltest"

Also I tried:

> test.kegg<-gtKEGG(as.matrix(data.m), groups, kegg)

Error in gtKEGG(as.matrix(data.m), groups, kegg) :

argument "annotation" is missing with no default.

where was the wrong? Any advice please?

ADD REPLY • link 8.2 years ago libya.tahani • 0

0

Entering edit mode

There's no globaltest function in limma.

ADD REPLY • link 8.2 years ago Aaron Lun ★ 28k

0

Entering edit mode

Is there any method for performing the gene set test for KEGG pathways ??

Which function could I use?

ADD REPLY • link 8.2 years ago libya.tahani • 0

0

Entering edit mode

?kegga. Or ?mroast.

ADD REPLY • link 8.2 years ago Aaron Lun ★ 28k

0

Entering edit mode

Thank you .

ADD REPLY • link 8.2 years ago libya.tahani • 0

0

Entering edit mode

Dear Aaron Lun :

I'm trying to Visualizing the K-means clustering of the same data above, the optimal number of clusters was about 49.

I use the following code :

> max.data.m<-max(data.m)
>  min.data.m<-min(data.m)
>  par(mfrow=c(ceiling(sqrt(k)), ceiling(sqrt(k))))
>  for(i in 1:k) {
+  matplot(t(data.m[km$cluster==i,]), type="l",
+  main=paste("cluster:", i), ylab="log expression",
+  col=1, lty=1, ylim=c(min.data.m, max.data.m))
+  }
Error in plot.new() : figure margins too large

where was the wrong?

Cheers.

Tahani.H

ADD REPLY • link 8.2 years ago libya.tahani • 0

0

Entering edit mode

Well, the error message explains itself. You're probably trying to stuff too many plots onto a single device with the mfrow setting, such that the margins of the plots take up more space than the plotting area itself. You either need to increase with size of the image, or put each plot onto a separate page. Personally, I would wrap the code above in a pdf call without setting par(mfrow) to anything:

pdf("blah.pdf")
# ... stuff (don't set mfrow) ...
dev.off()

... so each plot gets its own page in the PDF.

ADD REPLY • link 8.2 years ago Aaron Lun ★ 28k

0

Entering edit mode

Yes I got it . Thank you so much

ADD REPLY • link 8.2 years ago libya.tahani • 0