Question: Annotation problem while quering hgug4112a.db with function select and duplicates for agilent microarray dataset
1
3.6 years ago by
University of Salerno, Salerno, Italy
Konstantinos Yeles20 wrote:

Dear Community,

I'm relatively new to R and i have recently processed and analyzed through limma an Agilent-014850 Whole Human Genome Microarray 4x44K G4112F (Probe Name version) dataset, concerning some statistical comparisons.  My basic code is presented below:

SDRF <- read.delim("GSE12435.sdrf.txt",check.names=FALSE,stringsAsFactors=FALSE)
files <- list.files(pattern = "GSM")

pdat2 <- pdat[,2:3]
dat$targets <- pdat2 y <- backgroundCorrect(dat,method="normexp") y <- normalizeBetweenArrays(y,method="quantile") rownames(y)<-y$genes$ProbeName control.index <- which(y$genes$ControlType %in% c("1", "-1")) y2 <- y[-control.index,] # to remove control probes head(rownames(y2)) [1] "A_24_P66027" "A_32_P77178" "A_23_P212522" "A_24_P9671" "A_24_P801451" "A_32_P30710" ind <- rowSums(y2$E > 6) >=4
y2 <- y2[ind,] # some naive basic intensity filtering

group <- factor(y2$targets$Sample.and.Data.Relationship.Format,levels=c("control","bystander 4h","irradiated 4h"))
batch <- factor(y2$targets$Batch)
design <- model.matrix(~group + batch)
fit <- lmFit(y2, design)
fit2 <- eBayes(fit, trend=TRUE)
top2 <- toptable(fit2, coef=2, number=nrow(fit2), adjust.method = "fdr", sort="none")
ID       logFC          t    P.Value adj.P.Val         B
1  A_24_P66027  0.18871738  1.0400444 0.32212991 0.6150887 -5.778509
2  A_32_P77178 -0.17539127 -0.6735214 0.51545197 0.7570036 -6.088395
3 A_23_P212522  0.41854183  2.5192718 0.02981977 0.2639606 -3.720464
4   A_24_P9671  0.32246171  2.3800601 0.03794132 0.2854077 -3.945223
5 A_24_P801451 -0.58180526 -3.1227178 0.01046536 0.2044593 -2.726698
6  A_32_P30710 -0.02033664 -0.1193851 0.90726804 0.9595242 -6.316518

prob.names <- as.character(top2$ID) gns <- select(hgug4112a.db, prob.names,columns=c("SYMBOL")) 'select()' returned many:1 mapping between keys and columns dim(gns) [1] 28592 2 # Thus, there is not something wrong here ? in detail, when i type sum(duplicated(gns$PROBEID))
[1] 2007

# whereas dim(y2)
[1] 28592    12

In other words, with the many:1 mapping the resulted gns dataframe should not have more rows that the length of my probe.sets-probeIDs of my expression set ?

Thank you,

Konstantinos

modified 3.6 years ago • written 3.6 years ago by Konstantinos Yeles20
Answer: Annotation problem while quering hgug4112a.db with function select and duplicate
0
3.6 years ago by
Gordon Smyth38k
Walter and Eliza Hall Institute of Medical Research, Melbourne, Australia
Gordon Smyth38k wrote:

There is no problem here. Your data set and the annotation frame gns both have 28592 rows.

Dear Gordon,

sorry to reply with delay and thank you for your response !! but the annotation frame after using the function select() should not have more rows ? due to the duplicate PROBEIDs i posted above-- ? l

sum(duplicated(gns$PROBEID)) [1] 2007  like also in the following example in the vignette Thank you !! Konstadinos ADD REPLYlink written 3.6 years ago by Konstantinos Yeles20 You are incorrectly interpreting the message. You have duplicated probes on the array, and when you ask for the HUGO symbol, the select function is telling you that you are doing a many:1 mapping. But this doesn't mean that you should have different numbers of rows. In fact, for a many:1 mapping, you are guaranteed to have the same number of rows, as well as the same order. As an example: > z <- read.maimages(dir(".", "txt.gz"), "agilent.median", green.only = TRUE) Read GSM312299.txt.gz Read GSM312300.txt.gz Read GSM312301.txt.gz Read GSM312302.txt.gz Read GSM312303.txt.gz Read GSM312304.txt.gz Read GSM312305.txt.gz Read GSM312306.txt.gz Read GSM312307.txt.gz Read GSM312308.txt.gz Read GSM312309.txt.gz Read GSM312310.txt.gz > sum(duplicated(z$genes$ProbeName)) [1] 3922 > z <- z[!z$genes$ControlType %in% c("1","-1"),] > sum(duplicated(z$genes$ProbeName)) [1] 2376 > library(hgug4112a.db) > gns <- select(hgug4112a.db, z$genes$ProbeName, "SYMBOL") 'select()' returned many:1 mapping between keys and columns > all.equal(gns$PROBEID, z$genes$ProbeName)
[1] TRUE

And to further illustrate my point:

> select(hgug4112a.db, rep(z$genes$ProbeName[4321], 5), "SYMBOL")
'select()' returned many:1 mapping between keys and columns
PROBEID  SYMBOL
1 A_23_P309973 PPP1R36
2 A_23_P309973 PPP1R36
3 A_23_P309973 PPP1R36
4 A_23_P309973 PPP1R36
5 A_23_P309973 PPP1R36

So if I ask for the symbol for the same probe ID five times, I am returned a five-row data.frame, with a message that I have a many:1 mapping (five probe IDs -> one HUGO symbol).

Answer: Annotation problem while quering hgug4112a.db with function select and duplicate
0
3.6 years ago by
University of Salerno, Salerno, Italy
Konstantinos Yeles20 wrote:

Dear James,

Thank you for your very hepful answer !! Just to make sure that I get fully your notion:

1) I was confused with the above tutorial regarding the Affymetrix dataset, because after select() a bigger number of rows was returned—with various probesets to return more than one gene symbols!! Thus, if I understand correct (without having knowledge on how the Agilent chips are created) these probes are replicated from the beginning in the chip and exist from the start of my analysis right? That’s why you pinpointed  sum(duplicated(z$genes$ProbeName))

[1] 3922

?

2) Secondly, because I also have possible gene symbol duplicates, how should I continue my annotation steps from the part I stopped above, after the select step, in order to get a final list of unique ProbeIDs with unique gene symbols ?

1

1. Yes. Affymetrix arrays have no duplicated probeset IDs, whereas Agilent arrays do (at least sometimes).
2. If you want unique ProbeIDs, use avereps() first. This won't necessarily result in unique HUGO symbols, because it's always possible that two different ProbeIDs are intended to measure the same gene (or different transcript variants or whatever). I sort of doubt that's true, as you should have got a message about many:many mappings in that case. But if there are still duplicated gene symbols, you could either use avereps() on the gene symbol column, or you could just sort the data by p-value and remove the duplicates (which will then choose the gene symbols with the most evidence for differential expression).

Dear James,

thank you again for your updated answers---regarding the duplicate gene symbols, i understand that different probeIDs could account still for the same gene symbol, so i will follow your advice with p-value, or other options. But most importantly, regarding the duplicate probeIDs: the function avereps() you have mentioned, in which part of the analysis i should implement it ? like the tutorial of limma, prior of statistical inference, with the following line :

final <- avereps(y2,ID=y2$genes$ProbeName) ? # y2 from my code above

and then proceed with limma and my DE list ?

Best,

Konstantinos

Following the limma User's Guide is always a good plan.

Usually there is no harm in leaving the duplicated probes as they are. You should only use avereps() when you have a particular need for unique probes or unique genes (and nothing in your posts suggests that you do).

Dear Gordon, actually my initial goal of creating this post (and excuse me if I wasn’t clear), is to acquire through Limma and annotation from R, a gene list with unique probesets-gene symbols, as I will like to use it for functional enrichment, as well as to compare for common differentially expressed genes from other gene lists. Thus, to conclude:

1. For this purpose, my above line of code before implementing Limma would be fine, concerning the duplicate ProbeIDs, correct?
2. Moreover, I saw from the Limma vignette in a similar example, that :

yave <- avereps(y0,ID=y0\$genes[,"SystematicName"])

So, it has any significant difference or it is more appropriate to use instead of ProbeName for the downstream annotation process?

3. Finally, if I want to perform some diagnostic plots before statistical inference, I should use my dataset after the removal of duplicate ProbeIDs, or it would not matter?

Best Regards,

Konstantinos