Right use of model.matrix?
1
0
Entering edit mode
Maximilian ▴ 10
@maximilian-11742
Last seen 7.6 years ago

Dear bioconductor forum,

I am using the Agilent RAW .txt microarray data from https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=+GSE59408. My goal is to determine a presence/absence (on/off) pattern for the respective genes on the microarrays.

It would be very kind if someone experienced with limma, especially the functions lmfit and model.matrix , could help me with my questions. Thanks in advance.

I have read the limma userguide (https://www.bioconductor.org/packages/devel/bioc/vignettes/limma/inst/doc/usersguide.pdf) and tried to do it like 17.4 in the userguide.

First I have read in the platform design file from https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GPL18948:

> library(limma)

> platform <- read.table(file="GPL18948_platform_design_file2.txt", sep="\t")

Then I have read in the different RAW files with help of the taget.txt file. Target .txt has the following format:

SampleNumber FileName Condition 
1 GSM1436498_Parent_replicate1.txt NONE
2 GSM1436499_Parent_replicate2.txt NONE
3 GSM1436500_CPZ1.txt CPZ 
4 GSM1436501_CPZ2.txt CPZ

I read in the data and process it:

> x <- read.maimages(targets, source="agilent.median",green.only=TRUE)

> y <- backgroundCorrect(x, method="normexp") 

> y <- normalizeBetweenArrays(y,method="quantile")

Then I filter the data according to the limma userguide 17.4:

> neg95 <- apply(y$E[y$genes$ControlType==-1,],2,function(x) quantile(x,p=0.95))

> cutoff <- matrix(1.1*neg95,nrow(y),ncol(y),byrow=TRUE)

> isexpr <- rowSums(y$E > cutoff) >= 2

> table(isexpr)
isexpr
FALSE  TRUE
3489 59478

> y0 <- y[y$genes$ControlType==0 & isexpr,]

The different conditions are the different antibiotica treatments of the 42 strains/microarrays (NONE, CPZ, CFI, AMK, NM, DOXY, CP, AZM, TP, ENX, CPFX => 11, NONE=2 replicates, Rest=4 replicates, altogether=42).

I then create the model.matrix also according to limma userguide 17.4:

> Treatment <- factor(targets$Condition, levels = unique(targets$Condition))

> design <- model.matrix(~Treatment)

> colnames(design) <- levels(Treatment)

And then get the results:

> fit <- lmFit(y0,design)

> fit <- eBayes(fit,trend=TRUE)

> plotSA(fit, main="Probe-level")

> summary(decideTests(fit[,-1]))
     CPZ   CFI   AMK    NM  DOXY    CP   AZM    TP   ENX  CPFX
-1  2504  1924   586  1116  1662  2851   878  1896  1963  1258
0  54929 56244 57860 56986 56412 54460 57527 55701 56305 56886
1   1132   397   119   463   491  1254   160   968   297   421

Same analysis, but now averaging probes for each gene:

> yave <- avereps(y0,ID=y0$genes[,"SystematicName"])​

> fit <- lmFit(yave,design)

> fit <- eBayes(fit,trend=TRUE)

> summary(decideTests(fit[,-1]))
     CPZ   CFI   AMK    NM  DOXY    CP   AZM    TP   ENX  CPFX
-1  2431  1870   552  1080  1628  2819   843  1838  1938  1177
0  53447 54702 56298 55462 54861 52907 55967 54172 54733 55383
1   1091   397   119   427   480  1243   159   959   298   409

As one can see, this is way too big. The problem in my case is that the yave$genes$ProbeNames are the same as the     yave$genes$SytematicNames, so there can be no average. I use the GeneSymbol from the platform design file and swap it with the SystematicNames to get a better average value:

> x$genes$SystematicName <- platform$GeneSymbol

> y <- backgroundCorrect(x, method="normexp")

> y <- normalizeBetweenArrays(y,method="quantile")

> y0 <- y[y$genes$ControlType==0 & isexpr,]

> yave <- avereps(y0,ID=y0$genes[,"SystematicName"])

> fit <- lmFit(yave,design)

> fit <- eBayes(fit,trend=TRUE)

> summary(decideTests(fit[,-1]))
    CPZ  CFI  AMK   NM DOXY   CP  AZM   TP  ENX CPFX
-1  184  143   38   72  128  189   93  170  141  115
0  4204 4307 4426 4372 4310 4185 4364 4208 4316 4320
1    83   21    7   27   33   97   14   93   14   36

This looks way better. My questions are now:

1) Is my contrast.matrix design right in terms of my goal of finding a presence/absence pattern for the genes?

2) Can I swap GeneSymbol with SystematicNames to get a better average value for the respective genes?

3) Can I use the results of the last table as a presence/absence pattern for the genes?

Thanks in advance, Max

limma agilent microarray lmfit model.matrix • 1.5k views
ADD COMMENT
0
Entering edit mode
@gordon-smyth
Last seen 14 hours ago
WEHI, Melbourne, Australia

You write:

My goal is to determine a presence/absence (on/off) pattern for the respective genes on the microarrays.

Well, you could do that simply by

Present <- (y$E > cutoff)

It's not clear to me what you're trying the achieve with the rest of the analysis. limma does differential expression analyses, which are concerned with the relative expression of genes, not just with whether they are present or not.

Are you really sure that you just want to determine presence/absence of genes? That would be too crude an analysis for any biological problem that I've seen over the years.

ADD COMMENT
0
Entering edit mode

Dear Mr. Smyth,

thank you for your answer. I apologize for my unclear question.

You are right, with the posted code above I find the differentially expressed genes.

However, my goal is to find a present/absent pattern for the genes of the mentioned dataset and then use the COBRA toolbox to create tissue specific models (function createTissueSpecificModel).

In a tutorial of this COBRA function limma was used to process the data and then the function mas5calls was used to get the presence/absence calls of the probes on the microarrays. But I was told that this function doesn't work for Agilent single-color microarrays.

As you wrote above, to distinguish the present genes I could just use the computed cutoff value. But do you think that I could still use the results of summary(decideTests(fit[,-1])) ​from my code above and use the regular expressed genes (with value 0) to create tissue specific models?

Kind regards, Max

 

ADD REPLY

Login before adding your answer.

Traffic: 645 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6