Dear Community,
I'm relatively new to R and i have recently processed and analyzed through limma an Agilent-014850 Whole Human Genome Microarray 4x44K G4112F (Probe Name version) dataset, concerning some statistical comparisons. My basic code is presented below:
SDRF <- read.delim("GSE12435.sdrf.txt",check.names=FALSE,stringsAsFactors=FALSE)
files <- list.files(pattern = "GSM")
dat <- read.maimages(files, source="agilent", green.only=T)
pdat <- read.table("GSE12435.sdrf.txt", sep="\t",header=TRUE)
pdat2 <- pdat[,2:3]
dat$targets <- pdat2
y <- backgroundCorrect(dat,method="normexp")
y <- normalizeBetweenArrays(y,method="quantile")
rownames(y)<-y$genes$ProbeName
control.index <- which(y$genes$ControlType %in% c("1", "-1"))
y2 <- y[-control.index,] # to remove control probes
head(rownames(y2))
[1] "A_24_P66027" "A_32_P77178" "A_23_P212522" "A_24_P9671" "A_24_P801451" "A_32_P30710"
ind <- rowSums(y2$E > 6) >=4
y2 <- y2[ind,] # some naive basic intensity filtering
group <- factor(y2$targets$Sample.and.Data.Relationship.Format,levels=c("control","bystander 4h","irradiated 4h"))
batch <- factor(y2$targets$Batch)
design <- model.matrix(~group + batch)
fit <- lmFit(y2, design)
fit2 <- eBayes(fit, trend=TRUE)
top2 <- toptable(fit2, coef=2, number=nrow(fit2), adjust.method = "fdr", sort="none")
head(top2)
ID logFC t P.Value adj.P.Val B
1 A_24_P66027 0.18871738 1.0400444 0.32212991 0.6150887 -5.778509
2 A_32_P77178 -0.17539127 -0.6735214 0.51545197 0.7570036 -6.088395
3 A_23_P212522 0.41854183 2.5192718 0.02981977 0.2639606 -3.720464
4 A_24_P9671 0.32246171 2.3800601 0.03794132 0.2854077 -3.945223
5 A_24_P801451 -0.58180526 -3.1227178 0.01046536 0.2044593 -2.726698
6 A_32_P30710 -0.02033664 -0.1193851 0.90726804 0.9595242 -6.316518
prob.names <- as.character(top2$ID)
gns <- select(hgug4112a.db, prob.names,columns=c("SYMBOL"))
'select()' returned many:1 mapping between keys and columns
dim(gns)
[1] 28592 2
# Thus, there is not something wrong here ? in detail, when i type
sum(duplicated(gns$PROBEID))
[1] 2007
# whereas dim(y2)
[1] 28592 12
In other words, with the many:1 mapping the resulted gns dataframe should not have more rows that the length of my probe.sets-probeIDs of my expression set ?
Thank you,
Konstantinos

Dear Gordon,
sorry to reply with delay and thank you for your response !! but the annotation frame after using the function select() should not have more rows ? due to the duplicate PROBEIDs i posted above-- ? l
like also in the following example in the vignette
(http://bioconductor.org/packages/release/bioc/vignettes/affycoretools/inst/doc/RefactoredAffycoretools.pdf)
Thank you !!
Konstadinos
You are incorrectly interpreting the message. You have duplicated probes on the array, and when you ask for the HUGO symbol, the select function is telling you that you are doing a many:1 mapping. But this doesn't mean that you should have different numbers of rows. In fact, for a many:1 mapping, you are guaranteed to have the same number of rows, as well as the same order. As an example:
> z <- read.maimages(dir(".", "txt.gz"), "agilent.median", green.only = TRUE) Read GSM312299.txt.gz Read GSM312300.txt.gz Read GSM312301.txt.gz Read GSM312302.txt.gz Read GSM312303.txt.gz Read GSM312304.txt.gz Read GSM312305.txt.gz Read GSM312306.txt.gz Read GSM312307.txt.gz Read GSM312308.txt.gz Read GSM312309.txt.gz Read GSM312310.txt.gz > sum(duplicated(z$genes$ProbeName)) [1] 3922 > z <- z[!z$genes$ControlType %in% c("1","-1"),] > sum(duplicated(z$genes$ProbeName)) [1] 2376 > library(hgug4112a.db) > gns <- select(hgug4112a.db, z$genes$ProbeName, "SYMBOL") 'select()' returned many:1 mapping between keys and columns > all.equal(gns$PROBEID, z$genes$ProbeName) [1] TRUEAnd to further illustrate my point:
> select(hgug4112a.db, rep(z$genes$ProbeName[4321], 5), "SYMBOL") 'select()' returned many:1 mapping between keys and columns PROBEID SYMBOL 1 A_23_P309973 PPP1R36 2 A_23_P309973 PPP1R36 3 A_23_P309973 PPP1R36 4 A_23_P309973 PPP1R36 5 A_23_P309973 PPP1R36So if I ask for the symbol for the same probe ID five times, I am returned a five-row data.frame, with a message that I have a many:1 mapping (five probe IDs -> one HUGO symbol).