0
3.7 years ago by
Libyan Arab Jamahiriya
libya.tahani0 wrote:

Hi all,

I'm analyzing the microarray data  using affymetrix , my data has downloaded from GEO, Series_geo_accession    "GSE7700".

They used microarray to look at the targte genes regulated by Yes-associated protein (YAP) in normal breast luminal cell lines and breast cancer cell lines.

I would like to analysis my data using a linear model "limma", but I have some problem about (design matrix) .

Can any one explain to me which one of the following is right ?

The samples were :

GSM187268    MDA MB231           control cell  pmig / pRS-IRES-GFP rep1
GSM187269    MDA MB231           control cell pmig / pRS-IRES-GFP rep2
GSM187271    MDA MB231           YAP knock down cell pmig-YAP/pRS-IRES-GFP-YAP rep1
GSM187272    MDA MB231           YAP knock down cell pmig-YAP/pRS-IRES-GFP-YAP rep2
GSM187273    Normal luminal        control cell pmig / pRS-IRES-GFP rep1
GSM187274    Normal luminal        control cell pmig / pRS-IRES-GFP rep2
GSM187275    Normal luminal        control cell pmig / pRS-IRES-GFP rep3
GSM187276    Normal luminal        YAP knock down cell pmig-YAP/pRS-IRES-GFP-YAP rep1
GSM187277    Normal luminal        YAP knock down cell pmig-YAP/pRS-IRES-GFP-YAP rep2
GSM187278    Normal luminal        YAP knock down cell pmig-YAP/pRS-IRES-GFP-YAP rep3

As it shown above ,I think we have 2 type of cell lines (normal & cancer breast cells) and 4 groups of them (cancer control cells, cancer YAP knock down cells. normal control cells and normal YAP knock down cells)

........................................................................................................................................................................................

> groups<-c("CC", "CC", "CT", "CT", "NC", "NC", "NC", "NT", "NT", "NT")

> groups<-as.factor(groups)

> design<-model.matrix(~groups)

> design

(Intercept) groupsCT groupsNC groupsNT

1            1        0        0        0

2            1        0        0        0

3            1        1        0        0

4            1        1        0        0

5            1        0        1        0

6            1        0        1        0

7            1        0        1        0

8            1        0        0        1

9            1        0        0        1

10           1        0        0        1

attr(,"assign")

[1] 0 1 1 1

attr(,"contrasts")

attr(,"contrasts")$groups [1] "contr.treatment" > fit<-lmFit(data.f, design) > fit<-eBayes(fit) > toptable(fit, coef=2) logFC t P.Value adj.P.Val B 202350_s_at -1.564580 -14.70096 1.327425e-08 0.0005127857 9.043119 202238_s_at -1.282140 -14.05648 2.128833e-08 0.0005127857 8.727672 228450_at -1.505202 -13.68802 2.813639e-08 0.0005127857 8.536645 224895_at -1.590910 -13.21231 4.074283e-08 0.0005569036 8.277724 223395_at 1.852093 12.44041 7.624968e-08 0.0008337903 7.825825 213342_at -1.572131 -11.74764 1.378619e-07 0.0012562665 7.383662 219327_s_at 1.201714 11.42378 1.837460e-07 0.0014351875 7.164063 224894_at -1.752598 -11.18234 2.286805e-07 0.0015628883 6.994676 206067_s_at -1.158799 -10.79151 3.286868e-07 0.0019967725 6.709761 222486_s_at -1.152186 -10.48365 4.408662e-07 0.0022142264 6.475568 ............................................................................................. # Or: > design<-model.matrix(~0+factor(c("CC", "CC", "CT", "CT", "NC", "NC", "NC", "NT", "NT", "NT"))) > colnames(design)<-c("CC","CT","NC","NT") > colnames(design) [1] "CC" "CT" "NC" "NT" > design CC CT NC NT 1 1 0 0 0 2 1 0 0 0 3 0 1 0 0 4 0 1 0 0 5 0 0 1 0 6 0 0 1 0 7 0 0 1 0 8 0 0 0 1 9 0 0 0 1 10 0 0 0 1 attr(,"assign") [1] 1 1 1 1 attr(,"contrasts") attr(,"contrasts")$factor(c("CC", "CC", "CT", "CT", "NC", "NC", "NC", "NT", "NT", "NT"))

[1] "contr.treatment"

> fit<-lmFit(data.f, design)

> fit<-eBayes(fit)

> toptable(fit, coef=2)

AFFX-hum_alu_at 13.84261 222.2422 1.563089e-21 2.535286e-18 34.98391

201492_s_at     13.88974 220.7521 1.683748e-21 2.535286e-18 34.95969

1553588_at      13.85768 220.3090 1.721551e-21 2.535286e-18 34.95241

208856_x_at     13.59860 217.8449 1.949441e-21 2.535286e-18 34.91130

201257_x_at     13.60074 217.6522 1.968609e-21 2.535286e-18 34.90803

200095_x_at     13.62670 217.6191 1.971917e-21 2.535286e-18 34.90747

201429_s_at     13.87166 217.0458 2.030260e-21 2.535286e-18 34.89772

1553570_x_at    13.64131 216.8168 2.054088e-21 2.535286e-18 34.89381

200834_s_at     13.49999 216.4295 2.095081e-21 2.535286e-18 34.88717

213084_x_at     13.91086 216.1968 2.120133e-21 2.535286e-18 34.88317

........................................................................................................................................................................................

Which of the above was right ?
were there any significantly differentially expressed genes from the topTable?

Cheers,

Tahani.H

modified 3.7 years ago by Aaron Lun25k • written 3.7 years ago by libya.tahani0
1
3.7 years ago by
Aaron Lun25k
Cambridge, United Kingdom
Aaron Lun25k wrote:

Both approaches will test whether the second coefficient is significantly non-zero. In the first approach, the second coefficient represents the log-fold change of CT over CC (i.e., the intercept). Thus, dropping it will identify DE genes between the CC and CT lines, which is sensible enough. In the second approach, the second coefficient represents the average log-expression of the CT group. Dropping it will identify genes where the average log-expression in the CT group is significantly different from zero. This is a silly hypothesis to test against; obviously there will be many genes where the log-expression is non-zero.

In summary, the first approach is much more sensible, and is the "right" approach if you want to detect DE between CT and CC. If you want other contrasts, then you'll have to use a different coef or go through contrasts.fit. As for whether there's any DE genes; you can figure that out yourself, based on whether there are any genes with adjusted p-values below a pre-specified threshold for the false discovery rate (usually 5%). Check out decideTests if you want to get a summary of the numbers and direction of DE for each contrast.

Thanks a lot dear Aaron Lun for your answer . So I'll use the first approach because I really want to detect DE between CT and CC .

Here from the topTable above , is the first column (without a name) mean the names of the genes or ?

because I read it in some papers that the first column is the row
number in the original data!!!

1

The unnamed "first column" represents the row names of the table. In this case, they're probe names, as the *_at suffixes suggest. Clearly they're character strings, not numbers, so whatever papers you're reading don't apply here. If you want to convert from probe names to gene names, you need the relevant annotation package depending on the chip being used, e.g., hgu133a.db. You can then select for symbol or Entrez ID using the probe names as keys.

Thanks so much sir,

It's clear to me now .

Regards,

Tahani.H

While I was trying to performing the gene set test for KEGG pathways I had  a problem with (globaltest):

> library(hgu133plus2.db)
> pathway2probe<-get("hgu133plus2PATH2PROBE")
> kegg<-as.list(pathway2probe)

> library(globaltest)
> groups<-c(0,0,1,1,0,0,0,1,1,1)

> test.kegg<-globaltest(as.matrix(data.m), groups, kegg)
Error: could not find function "globaltest"

Also I tried:

> test.kegg<-gtKEGG(as.matrix(data.m), groups, kegg)
Error in gtKEGG(as.matrix(data.m), groups, kegg) :
argument "annotation" is missing with no default.

There's no globaltest function in limma.

Is there any method for performing the gene set test for KEGG pathways ??

Which function could I use?

?kegga. Or ?mroast.

Thank you .

Dear  Aaron Lun :

I'm  trying to Visualizing the K-means clustering of the same data above, the optimal number of clusters was about 49.

I use the following code :

> max.data.m<-max(data.m)
>  min.data.m<-min(data.m)
>  par(mfrow=c(ceiling(sqrt(k)), ceiling(sqrt(k))))
>  for(i in 1:k) {
+  matplot(t(data.m[km\$cluster==i,]), type="l",
+  main=paste("cluster:", i), ylab="log expression",
+  col=1, lty=1, ylim=c(min.data.m, max.data.m))
+  }
Error in plot.new() : figure margins too large

where was the wrong?

Cheers.

Tahani.H

Well, the error message explains itself. You're probably trying to stuff too many plots onto a single device with the mfrow setting, such that the margins of the plots take up more space than the plotting area itself. You either need to increase with size of the image, or put each plot onto a separate page. Personally, I would wrap the code above in a pdf call without setting par(mfrow) to anything:

pdf("blah.pdf")
# ... stuff (don't set mfrow) ...
dev.off()

... so each plot gets its own page in the PDF.