Question: Right use of model.matrix?
0
2.6 years ago by
Maximilian10
Maximilian10 wrote:

Dear bioconductor forum,

I am using the Agilent RAW .txt microarray data from https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=+GSE59408. My goal is to determine a presence/absence (on/off) pattern for the respective genes on the microarrays.

It would be very kind if someone experienced with limma, especially the functions lmfit and model.matrix , could help me with my questions. Thanks in advance.

I have read the limma userguide (https://www.bioconductor.org/packages/devel/bioc/vignettes/limma/inst/doc/usersguide.pdf) and tried to do it like 17.4 in the userguide.

First I have read in the platform design file from https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GPL18948:

> library(limma)

> platform <- read.table(file="GPL18948_platform_design_file2.txt", sep="\t")

Then I have read in the different RAW files with help of the taget.txt file. Target .txt has the following format:

SampleNumber FileName Condition
1 GSM1436498_Parent_replicate1.txt NONE
2 GSM1436499_Parent_replicate2.txt NONE
3 GSM1436500_CPZ1.txt CPZ
4 GSM1436501_CPZ2.txt CPZ

I read in the data and process it:

> x <- read.maimages(targets, source="agilent.median",green.only=TRUE)

> y <- backgroundCorrect(x, method="normexp")

> y <- normalizeBetweenArrays(y,method="quantile")

Then I filter the data according to the limma userguide 17.4:

> neg95 <- apply(y$E[y$genes$ControlType==-1,],2,function(x) quantile(x,p=0.95)) > cutoff <- matrix(1.1*neg95,nrow(y),ncol(y),byrow=TRUE) > isexpr <- rowSums(y$E > cutoff) >= 2

> table(isexpr)
isexpr
FALSE  TRUE
3489 59478

> y0 <- y[y$genes$ControlType==0 & isexpr,]

The different conditions are the different antibiotica treatments of the 42 strains/microarrays (NONE, CPZ, CFI, AMK, NM, DOXY, CP, AZM, TP, ENX, CPFX => 11, NONE=2 replicates, Rest=4 replicates, altogether=42).

I then create the model.matrix also according to limma userguide 17.4:

> Treatment <- factor(targets$Condition, levels = unique(targets$Condition))

> design <- model.matrix(~Treatment)

> colnames(design) <- levels(Treatment)

And then get the results:

> fit <- lmFit(y0,design)

> fit <- eBayes(fit,trend=TRUE)

> plotSA(fit, main="Probe-level")

> summary(decideTests(fit[,-1]))
CPZ   CFI   AMK    NM  DOXY    CP   AZM    TP   ENX  CPFX
-1  2504  1924   586  1116  1662  2851   878  1896  1963  1258
0  54929 56244 57860 56986 56412 54460 57527 55701 56305 56886
1   1132   397   119   463   491  1254   160   968   297   421

Same analysis, but now averaging probes for each gene:

> yave <- avereps(y0,ID=y0$genes[,"SystematicName"])​ > fit <- lmFit(yave,design) > fit <- eBayes(fit,trend=TRUE) > summary(decideTests(fit[,-1])) CPZ CFI AMK NM DOXY CP AZM TP ENX CPFX -1 2431 1870 552 1080 1628 2819 843 1838 1938 1177 0 53447 54702 56298 55462 54861 52907 55967 54172 54733 55383 1 1091 397 119 427 480 1243 159 959 298 409 As one can see, this is way too big. The problem in my case is that the yave$genes$ProbeNames are the same as the yave$genes$SytematicNames, so there can be no average. I use the GeneSymbol from the platform design file and swap it with the SystematicNames to get a better average value: > x$genes$SystematicName <- platform$GeneSymbol

> y <- backgroundCorrect(x, method="normexp")

> y <- normalizeBetweenArrays(y,method="quantile")

> y0 <- y[y$genes$ControlType==0 & isexpr,]

> yave <- avereps(y0,ID=y0$genes[,"SystematicName"]) > fit <- lmFit(yave,design) > fit <- eBayes(fit,trend=TRUE) > summary(decideTests(fit[,-1])) CPZ CFI AMK NM DOXY CP AZM TP ENX CPFX -1 184 143 38 72 128 189 93 170 141 115 0 4204 4307 4426 4372 4310 4185 4364 4208 4316 4320 1 83 21 7 27 33 97 14 93 14 36 This looks way better. My questions are now: 1) Is my contrast.matrix design right in terms of my goal of finding a presence/absence pattern for the genes? 2) Can I swap GeneSymbol with SystematicNames to get a better average value for the respective genes? 3) Can I use the results of the last table as a presence/absence pattern for the genes? Thanks in advance, Max ADD COMMENTlink modified 2.6 years ago by Gordon Smyth37k • written 2.6 years ago by Maximilian10 Answer: Right use of model.matrix? 0 2.6 years ago by Gordon Smyth37k Walter and Eliza Hall Institute of Medical Research, Melbourne, Australia Gordon Smyth37k wrote: You write: My goal is to determine a presence/absence (on/off) pattern for the respective genes on the microarrays. Well, you could do that simply by Present <- (y$E > cutoff)

It's not clear to me what you're trying the achieve with the rest of the analysis. limma does differential expression analyses, which are concerned with the relative expression of genes, not just with whether they are present or not.

Are you really sure that you just want to determine presence/absence of genes? That would be too crude an analysis for any biological problem that I've seen over the years.

Dear Mr. Smyth,

You are right, with the posted code above I ﬁnd the diﬀerentially expressed genes.

However, my goal is to find a present/absent pattern for the genes of the mentioned dataset and then use the COBRA toolbox to create tissue specific models (function createTissueSpecificModel).

In a tutorial of this COBRA function limma was used to process the data and then the function mas5calls was used to get the presence/absence calls of the probes on the microarrays. But I was told that this function doesn't work for Agilent single-color microarrays.

As you wrote above, to distinguish the present genes I could just use the computed cutoff value. But do you think that I could still use the results of summary(decideTests(fit[,-1])) ​from my code above and use the regular expressed genes (with value 0) to create tissue specific models?

Kind regards, Max