Question

Annotation problem while quering hgug4112a.db with function select and duplicates for agilent microarray dataset

1

Entering edit mode

Konstantinos Yeles ▴ 80

@konstantinos-yeles-8961

Last seen 16 months ago

Italy

Dear Community,

I'm relatively new to R and i have recently processed and analyzed through limma an Agilent-014850 Whole Human Genome Microarray 4x44K G4112F (Probe Name version) dataset, concerning some statistical comparisons. My basic code is presented below:

SDRF <- read.delim("GSE12435.sdrf.txt",check.names=FALSE,stringsAsFactors=FALSE)
files <- list.files(pattern = "GSM")
dat <- read.maimages(files, source="agilent", green.only=T)
pdat <- read.table("GSE12435.sdrf.txt", sep="\t",header=TRUE)

pdat2 <- pdat[,2:3]
dat$targets <- pdat2

y <- backgroundCorrect(dat,method="normexp")

y <- normalizeBetweenArrays(y,method="quantile")

rownames(y)<-y$genes$ProbeName

control.index <- which(y$genes$ControlType %in% c("1", "-1"))
y2 <- y[-control.index,] # to remove control probes

head(rownames(y2))
[1] "A_24_P66027"  "A_32_P77178"  "A_23_P212522" "A_24_P9671"   "A_24_P801451" "A_32_P30710"

ind <- rowSums(y2$E > 6) >=4
y2 <- y2[ind,] # some naive basic intensity filtering

group <- factor(y2$targets$Sample.and.Data.Relationship.Format,levels=c("control","bystander 4h","irradiated 4h"))
batch <- factor(y2$targets$Batch)
design <- model.matrix(~group + batch)
fit <- lmFit(y2, design)
fit2 <- eBayes(fit, trend=TRUE)
top2 <- toptable(fit2, coef=2, number=nrow(fit2), adjust.method = "fdr", sort="none")
head(top2)
            ID       logFC          t    P.Value adj.P.Val         B
1  A_24_P66027  0.18871738  1.0400444 0.32212991 0.6150887 -5.778509
2  A_32_P77178 -0.17539127 -0.6735214 0.51545197 0.7570036 -6.088395
3 A_23_P212522  0.41854183  2.5192718 0.02981977 0.2639606 -3.720464
4   A_24_P9671  0.32246171  2.3800601 0.03794132 0.2854077 -3.945223
5 A_24_P801451 -0.58180526 -3.1227178 0.01046536 0.2044593 -2.726698
6  A_32_P30710 -0.02033664 -0.1193851 0.90726804 0.9595242 -6.316518

prob.names <- as.character(top2$ID)
gns <- select(hgug4112a.db, prob.names,columns=c("SYMBOL"))
'select()' returned many:1 mapping between keys and columns
dim(gns)
[1] 28592     2

# Thus, there is not something wrong here ? in detail, when i type

sum(duplicated(gns$PROBEID))
[1] 2007

# whereas dim(y2)
[1] 28592    12

In other words, with the many:1 mapping the resulted gns dataframe should not have more rows that the length of my probe.sets-probeIDs of my expression set ?

Thank you,

Konstantinos

limma annotation agilent hgug4112a microarray • 2.9k views

ADD COMMENT • link 9.2 years ago Konstantinos Yeles ▴ 80

0

Entering edit mode

Konstantinos Yeles ▴ 80

@konstantinos-yeles-8961

Last seen 16 months ago

Italy

Dear James,

Thank you for your very hepful answer !! Just to make sure that I get fully your notion:

1) I was confused with the above tutorial regarding the Affymetrix dataset, because after select() a bigger number of rows was returned—with various probesets to return more than one gene symbols!! Thus, if I understand correct (without having knowledge on how the Agilent chips are created) these probes are replicated from the beginning in the chip and exist from the start of my analysis right? That’s why you pinpointed sum(duplicated(z$genes$ProbeName))

[1] 3922

?

2) Secondly, because I also have possible gene symbol duplicates, how should I continue my annotation steps from the part I stopped above, after the select step, in order to get a final list of unique ProbeIDs with unique gene symbols ?

ADD COMMENT • link 9.2 years ago Konstantinos Yeles ▴ 80

1

Entering edit mode

First off, don't use the 'Add your answer' box to reply to a post. Instead, click the 'ADD COMMENT' link at the bottom of the post you are responding to and then write in the box that comes up.

Yes. Affymetrix arrays have no duplicated probeset IDs, whereas Agilent arrays do (at least sometimes).
If you want unique ProbeIDs, use avereps() first. This won't necessarily result in unique HUGO symbols, because it's always possible that two different ProbeIDs are intended to measure the same gene (or different transcript variants or whatever). I sort of doubt that's true, as you should have got a message about many:many mappings in that case. But if there are still duplicated gene symbols, you could either use avereps() on the gene symbol column, or you could just sort the data by p-value and remove the duplicates (which will then choose the gene symbols with the most evidence for differential expression).

ADD REPLY • link 9.2 years ago James W. MacDonald 68k

0

Entering edit mode

Dear James,

thank you again for your updated answers---regarding the duplicate gene symbols, i understand that different probeIDs could account still for the same gene symbol, so i will follow your advice with p-value, or other options. But most importantly, regarding the duplicate probeIDs: the function avereps() you have mentioned, in which part of the analysis i should implement it ? like the tutorial of limma, prior of statistical inference, with the following line :

final <- avereps(y2,ID=y2$genes$ProbeName) ? # y2 from my code above

and then proceed with limma and my DE list ?

Best,

Konstantinos

ADD REPLY • link 9.2 years ago Konstantinos Yeles ▴ 80

0

Entering edit mode

Following the limma User's Guide is always a good plan.

ADD REPLY • link 9.2 years ago James W. MacDonald 68k

0

Entering edit mode

Usually there is no harm in leaving the duplicated probes as they are. You should only use avereps() when you have a particular need for unique probes or unique genes (and nothing in your posts suggests that you do).

ADD REPLY • link 9.2 years ago Gordon Smyth 52k

0

Entering edit mode

Dear Gordon, actually my initial goal of creating this post (and excuse me if I wasn’t clear), is to acquire through Limma and annotation from R, a gene list with unique probesets-gene symbols, as I will like to use it for functional enrichment, as well as to compare for common differentially expressed genes from other gene lists. Thus, to conclude:

For this purpose, my above line of code before implementing Limma would be fine, concerning the duplicate ProbeIDs, correct?
Moreover, I saw from the Limma vignette in a similar example, that :

yave <- avereps(y0,ID=y0$genes[,"SystematicName"])

So, it has any significant difference or it is more appropriate to use instead of ProbeName for the downstream annotation process?

3. Finally, if I want to perform some diagnostic plots before statistical inference, I should use my dataset after the removal of duplicate ProbeIDs, or it would not matter?

Best Regards,

Konstantinos

ADD REPLY • link 9.2 years ago Konstantinos Yeles ▴ 80

score 1 · Accepted Answer · 2016-01-26

1

Entering edit mode

Gordon Smyth 52k

@gordon-smyth

Last seen 7 hours ago

WEHI, Melbourne, Australia

There is no problem here. Your data set and the annotation frame gns both have 28592 rows.

ADD COMMENT • link 9.2 years ago Gordon Smyth 52k

0

Entering edit mode

Dear Gordon,

sorry to reply with delay and thank you for your response !! but the annotation frame after using the function select() should not have more rows ? due to the duplicate PROBEIDs i posted above-- ? l

sum(duplicated(gns$PROBEID))
[1] 2007

like also in the following example in the vignette

(http://bioconductor.org/packages/release/bioc/vignettes/affycoretools/inst/doc/RefactoredAffycoretools.pdf)

Thank you !!

Konstadinos

ADD REPLY • link 9.2 years ago Konstantinos Yeles ▴ 80

0

Entering edit mode

You are incorrectly interpreting the message. You have duplicated probes on the array, and when you ask for the HUGO symbol, the select function is telling you that you are doing a many:1 mapping. But this doesn't mean that you should have different numbers of rows. In fact, for a many:1 mapping, you are guaranteed to have the same number of rows, as well as the same order. As an example:

> z <- read.maimages(dir(".", "txt.gz"), "agilent.median", green.only = TRUE)
Read GSM312299.txt.gz
Read GSM312300.txt.gz
Read GSM312301.txt.gz
Read GSM312302.txt.gz
Read GSM312303.txt.gz
Read GSM312304.txt.gz
Read GSM312305.txt.gz
Read GSM312306.txt.gz
Read GSM312307.txt.gz
Read GSM312308.txt.gz
Read GSM312309.txt.gz
Read GSM312310.txt.gz
> sum(duplicated(z$genes$ProbeName))
[1] 3922
> z <- z[!z$genes$ControlType %in% c("1","-1"),]
> sum(duplicated(z$genes$ProbeName))
[1] 2376
> library(hgug4112a.db)

> gns <- select(hgug4112a.db, z$genes$ProbeName, "SYMBOL")
'select()' returned many:1 mapping between keys and columns
> all.equal(gns$PROBEID, z$genes$ProbeName)
[1] TRUE

And to further illustrate my point:

> select(hgug4112a.db, rep(z$genes$ProbeName[4321], 5), "SYMBOL")
'select()' returned many:1 mapping between keys and columns
       PROBEID  SYMBOL
1 A_23_P309973 PPP1R36
2 A_23_P309973 PPP1R36
3 A_23_P309973 PPP1R36
4 A_23_P309973 PPP1R36
5 A_23_P309973 PPP1R36

So if I ask for the symbol for the same probe ID five times, I am returned a five-row data.frame, with a message that I have a many:1 mapping (five probe IDs -> one HUGO symbol).

ADD REPLY • link 9.2 years ago James W. MacDonald 68k