summarization for G4110B two colors agilent array: how to?
Entering edit mode
Andrea Grilli ▴ 240
Last seen 6.3 years ago
Italy, Bologna, Rizzoli Orthopaedic Ins…

Dear BioC list,

I'm analyzing for the first time old data from Agilent two colors arrays (array version G4110B). As experimental design, I have 3 time points x 2 replicates at each time point in a treated vs control experiment. Arrays were hybridized at 2 different times, so I decided to reduce an evident batch effect using also the "NormalizeBetweenArrays" function after "NormalizeWithinArrays".

I analyzed data according to chapter 4 of Gentleman's et al book using marray package. Only at the ending point of detecting differentially expressed genes, I discovered that no summarization of identical probes was done; checking platform design, I've seen that only a minority of transcritps are detected by multiple and identical probes, instead majority are not.

I googled a lot but I cannot find an easy way to perform summarization within marray package or alternatives. I've seen that this step exists in "Agi4x44PreProcess" package (function "summarize.probe"), but I have some problem to make it working (function was not found after library is loaded).

So my questions:

- I'm puzzled that summarization does not exist in marray package: maybe I'm wrong in my googling and reading?

- If I can make "summarize.probe" working, is it correct to do it after NormalizeWithin?

Below there are script and sessionInfo. As general consideration, I did a very simple design matrix because I had some problem in setting a more complex design type.

I'm less familiar with gene expression Agilent arrays, so any comment is really appreciated.
Thank you in advance,

############### CODE ###############
files <- c("1.txt", "2.txt", "3.txt", "4.txt","5.txt", "6.txt")
# change array names
names <- c("30m_rep1", "30m_rep2", "60m_rep1", "60m_rep2", "120m_rep1", "120m_rep2")
RG <- read.maimages(files, source = "agilent", names = c(names))
RGb <- backgroundCorrect(RG, method = "subtract")

MA.median <- normalizeWithinArrays(RGb, method = "median")
    # NORMALIZE BETWEEN # <- normalizeBetweenArrays(MA.median, method = "quantile")
sel <- ($genes$ControlType)== 0
normdata <-[sel,]

# ---->>>> ADD SUMMARIZATION??? ###

 # ANALYSIS at single time point level #
array <- c(rep((1), 6))
time30 <- c(rep(c(0, 1, 1), each = 2)); time60 <- c(rep(c(1, 0, 1), each = 2)); time120 <- c(rep(c(1, 1, 0), each = 2))

design30 <- model.matrix(~0 + array + time30); design60 <- model.matrix(~0 + array + time60); design120 <- model.matrix(~0 + array + time120)

fit30 <- lmFit(normdata, design30); fit2.30 <- eBayes(fit30)
fit60 <- lmFit(normdata, design60); fit2.60 <- eBayes(fit60)
fit120 <- lmFit(normdata, design120); fit2.120 <- eBayes(fit120)

library(hgug4110b.db); library(annotate); library(annaffy)

fit30$genes$ID <- (getSYMBOL(fit30$genes$ProbeName, "hgug4110b.db"))
geneID <- cbind(ProbeID=c(fit30$genes$ProbeName), geneID=c(fit30$genes$ID), geneDescription=c(fit30$genes$Description))

fit2.30$genes$ID <- (getSYMBOL(fit2.30$genes$ProbeName, "hgug4110b.db")); fit2.60$genes$ID <- (getSYMBOL(fit2.60$genes$ProbeName, "hgug4110b.db"));

it2.120$genes$ID <- (getSYMBOL(fit2.120$genes$ProbeName, "hgug4110b.db"))

# set parameters
corr <- c("fdr"); p <- 0.05; len <- length(fit30$genes[,1]); log <- 0.5849625

# get significant genes
top_sign.30 <- toptable(fit2.30, adjust=corr, p.value=p, genelist=cbind(geneID), number=len, lfc=log)
top_sign.60 <- toptable(fit2.60, adjust=corr, p.value=p, genelist=cbind(geneID), number=len, lfc=log)
top_sign.120 <- toptable(fit2.120, adjust=corr, p.value=p, genelist=cbind(geneID), number=len, lfc=log)

# save d.e. lists
if (Save == "yes"){
    write.table(top_sign.30, file = "./EWS_6647.0662_vs_6647_MedQuantNorm_log2_Time30.txt", sep = "\t", col.names = NA)
    write.table(top_sign.60, file = "./EWS_6647.0662_vs_6647_MedQuantNorm_log2_Time60.txt", sep = "\t", col.names = NA)
    write.table(top_sign.120, file = "./EWS_6647.0662_vs_6647_MedQuantNorm_log2_Time120.txt", sep = "\t", col.names = NA)

############### SESSION INFO ###############
> sessionInfo()
R version 2.14.2 (2012-02-29)
Platform: i386-pc-mingw32/i386 (32-bit)

[1] LC_COLLATE=Italian_Italy.1252  LC_CTYPE=Italian_Italy.1252    LC_MONETARY=Italian_Italy.1252 LC_NUMERIC=C                  
[5] LC_TIME=Italian_Italy.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] Biobase_2.14.0      BiocInstaller_1.2.1 marray_1.32.0       limma_3.10.3       

loaded via a namespace (and not attached):
[1] BiocGenerics_0.4.0 DBI_0.2-5          parallel_2.14.2    RSQLite_0.11.2     stats4_2.14.2      tools_2.14.2       xtable_1.7-1
marray agilent twochannel normalization limma • 1.4k views
Entering edit mode
Last seen 7 hours ago
WEHI, Melbourne, Australia

Actually you are not using the marray package at all. All the code you give uses the limma package, which is automatically loaded when you type library(marray).

You haven't followed the advice of Chapter 4 of the Bioconductor book in detail. That chapter recommends loess normalization, whereas you have used the far more primitive median normalization. Why? The marray and limma authors all recommend loess normalization for Agilent data. There is no need for a normalizeBetweenArrays() step. We recommend that you background correct and normalize like this:

RGb <- backgroundCorrect(RG, method="normexp")
MA <- normalizeBetweenArrays(RGb, method="loess")

As Axel has pointed out, you are using out-of-date software. Please update to the current version of limma and Bioconductor so that we can help you properly. Having a look through a recent version of the limma User's Guide will I think be more effective than doing google searches.

Probe summarization is readily available though the avereps() function of limma, but I question whether you really need to do this. Agilent arrays do not contain a lot of repeated probes.

You didn't ask about the DE analysis, but your DE analysis isn't correct and getting this right is more important than whether or not to average the duplicate probes. You shouldn't be creating a different design matrix for each comparison. From your description, the experimental design seems to be like this:

> targets
  FileName     Cy3  Cy5
1    1.txt Control  30m
2    2.txt Control  30m
3    3.txt Control  60m
4    4.txt Control  60m
5    5.txt Control 120m
6    6.txt Control 120m

You can test for DE between each of the three times vs control like this:

> design <- modelMatrix(targets,ref="Control")
Found unique target names:
 120m 30m 60m Control 
> fit <- lmFit(MA, design)
> fit <- eBayes(fit)
> topTable(fit, coef="30m")
> topTable(fit, coef="60m")
> topTable(fit, coef="120m")
Entering edit mode
Axel Klenk ▴ 980
Last seen 11 hours ago

Dear Andrea,

I cannot comment on packages marray and Agi4x44PreProcess since I have never used them but package limma has a function avereps() that you could use. And yes, I would do the summarization after the normalization steps as you propose.

Some comments you have not asked for:

your annotation code is pretty old-fashioned -- you could/should use the select interface as described in the vignette of package AnnotationDbi. And, by the way, your R and Bioc versions are pretty old, too.

And you could/should address your batch effect by including hybridization batches in your limma model (or using function ComBat() from package sva) rather than additional normalization.


 - axel



Login before adding your answer.

Traffic: 267 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6