Question

Do I map probes to gene symbols BEFORE or AFTER differentially expressed gene analysis?

0

Entering edit mode

JackieMe • 0

@jackieme-14035

Last seen 5.6 years ago

I am learning to do differentially expressed gene analysis on microarray data(typically Affaymetrix GeneChip Human Genome U133 Plus 2.0 Array) of tumor vs normal. I've read some tutorials and know the basic steps. But I never saw people talk about when exactly to annotate the probes to gene symbols.
I now do DEG analysis on probe level and map probes to gene symbols after I get a list of differentially expressed probes. However, many differentially expressed probes map to the same gene, which resluts in much less DEGs than differentially expressed probes. I've also seen poeple map probes to gene symbols before DEG analysis, exactly after they get eset( eset <- exprs(rma(affyBatch)) ) . I am really confused.
I give the briefing of my code below:

library(GEOquery)
library(affycoretools)
library(limma)
library(dplyr)

gse54129 <- getGEO('GSE54129', destdir = '../GSE/series_matrix', getGPL = FALSE)
pdata <- pData(phenoData(gse54129[[1]]))
group <- ifelse(pdata$characteristics_ch1 == 'tissue: normal', 'Normal', 'Tumor')
group <- factor(group, levels=c('Normal', 'Tumor'))
table(group)
# Normal  Tumor 
#     21    111
###### load CEL files #####
library(affy)
files <- paste('../GSE/CEL/GSE54129/', list.files('../GSE/CEL/GSE54129/', pattern = 'CEL.gz'), sep = '')
affyBatch <- ReadAffy(filenames=files)
sampleNames(affyBatch)
sampleNames(affyBatch) <- substr(sampleNames(affyBatch), 1, 10)
sampleNames(affyBatch)

data.rma <- rma(affyBatch)

eset <- exprs(data.rma)

plotPCA(eset, addtext=colnames(eset), groups=factor(pdata$characteristics_ch1), legend=FALSE)

##### DEG #######
mode(eset)
# [1] "numeric"

design <- model.matrix(~group)

fit <- lmFit(eset, design)

fit2 <- eBayes(fit)
topTable(fit2, coef = 2, adjust.method = 'BH')
res <- decideTests(fit2, lfc = 2)
summary(res)
#    (Intercept) groupTumor
# -1           0        554
# 0            0      53609
# 1        54675        512

deg.54129 <- topTable(fit2, coef = 2, number = Inf, adjust.method = 'BH')
deg.54129$gene <- rownames(deg.54129)

deg.54129.sig2 <- filter(deg.54129, adj.P.Val < 0.05 & abs(logFC) >= 2)
dim(deg.54129.sig2)
# [1] 1066    7

library(hgu133plus2.db)
probes <- deg.54129.sig2$gene
gene.symbol <- as.character(as.list(hgu133plus2SYMBOL[probes]))
gene.entrez <- as.character(as.list(hgu133plus2ENTREZID[probes]))

deg.54129.sig2 <- cbind(PROBE=deg.54129.sig2$gene, SYMBOL=gene.symbol, ENTREZID=gene.entrez,
                        deg.54129.sig2[, -ncol(deg.54129.sig2)])
deg.54129.sig2 <- arrange(deg.54129.sig2, SYMBOL, desc(abs(logFC)))
deg.54129.sig2 <- deg.54129.sig2[!duplicated(deg.54129.sig2$SYMBOL), ]
dim(deg.54129.sig2)
# [1] 729   9
save(deg.54129.sig2, file = 'gse54129.deg.probe_level.rda')

limma deg microarray • 1.8k views

ADD COMMENT • link updated 7.4 years ago by Gordon Smyth 52k • written 7.4 years ago by JackieMe • 0

score 0 · Answer 1 · 2017-10-12

0

Entering edit mode

Gordon Smyth 52k

@gordon-smyth

Last seen 55 minutes ago

WEHI, Melbourne, Australia

It doesn't make any difference whether you map to probes before the DE analysis or after, although I myself would always do it before.

You could follow the example case study in Section 17.2 of the limma User's Guide. The example is for hgu95 affy arrays instead of u133plus2 but the code should be pretty straightforward to adapt.

ADD COMMENT • link 7.4 years ago Gordon Smyth 52k

0

Entering edit mode

Thank you Gordon! I am sorry if I am asking some strange questions.

In addition to this, multiple mapping ( dual probes mapping to a gene ) is common in hgu133plus array. I've seen some posts online said that to replace the expression values of the probes with thier mean/max expression values.

Is there a recommended way to deal with this ? And again, should I deal this this BEFORE or AFTER differentially expressed gene analysis? If possible, can you show me some example code or recommend a package to do this?

ADD REPLY • link 7.4 years ago JackieMe • 0

0

Entering edit mode

I usually don't see any need to do anything about multiple probes for a given gene. If however there really is a need to keep just one probe for each gene, then I do it like described here: Duplicate gene ID's returned from limma microarray analysis