Search
Question: Do I map probes to gene symbols BEFORE or AFTER differentially expressed gene analysis?
0
gravatar for JackieMe
12 days ago by
JackieMe0
JackieMe0 wrote:

I am learning to do differentially expressed gene analysis on microarray data(typically Affaymetrix GeneChip Human Genome U133 Plus 2.0 Array) of tumor vs normal. I've read some tutorials and know the basic steps. But I never saw people talk about when exactly to annotate the probes to gene symbols.
I now do DEG analysis on probe level and map probes to gene symbols after I get a list of differentially expressed probes. However, many differentially expressed probes map to the same gene, which resluts in much less DEGs than differentially expressed probes. I've also seen poeple map probes to gene symbols before DEG analysis, exactly after they get eset( eset <- exprs(rma(affyBatch)) ) . I am really confused. 
I give the briefing of my code below:

library(GEOquery)
library(affycoretools)
library(limma)
library(dplyr)

gse54129 <- getGEO('GSE54129', destdir = '../GSE/series_matrix', getGPL = FALSE)
pdata <- pData(phenoData(gse54129[[1]]))
group <- ifelse(pdata$characteristics_ch1 == 'tissue: normal', 'Normal', 'Tumor')
group <- factor(group, levels=c('Normal', 'Tumor'))
table(group)
# Normal  Tumor 
#     21    111
###### load CEL files #####
library(affy)
files <- paste('../GSE/CEL/GSE54129/', list.files('../GSE/CEL/GSE54129/', pattern = 'CEL.gz'), sep = '')
affyBatch <- ReadAffy(filenames=files)
sampleNames(affyBatch)
sampleNames(affyBatch) <- substr(sampleNames(affyBatch), 1, 10)
sampleNames(affyBatch)

data.rma <- rma(affyBatch)

eset <- exprs(data.rma)

plotPCA(eset, addtext=colnames(eset), groups=factor(pdata$characteristics_ch1), legend=FALSE)

##### DEG #######
mode(eset)
# [1] "numeric"

design <- model.matrix(~group)

fit <- lmFit(eset, design)

fit2 <- eBayes(fit)
topTable(fit2, coef = 2, adjust.method = 'BH')
res <- decideTests(fit2, lfc = 2)
summary(res)
#    (Intercept) groupTumor
# -1           0        554
# 0            0      53609
# 1        54675        512

deg.54129 <- topTable(fit2, coef = 2, number = Inf, adjust.method = 'BH')
deg.54129$gene <- rownames(deg.54129)

deg.54129.sig2 <- filter(deg.54129, adj.P.Val < 0.05 & abs(logFC) >= 2)
dim(deg.54129.sig2)
# [1] 1066    7

library(hgu133plus2.db)
probes <- deg.54129.sig2$gene
gene.symbol <- as.character(as.list(hgu133plus2SYMBOL[probes]))
gene.entrez <- as.character(as.list(hgu133plus2ENTREZID[probes]))

deg.54129.sig2 <- cbind(PROBE=deg.54129.sig2$gene, SYMBOL=gene.symbol, ENTREZID=gene.entrez,
                        deg.54129.sig2[, -ncol(deg.54129.sig2)])
deg.54129.sig2 <- arrange(deg.54129.sig2, SYMBOL, desc(abs(logFC)))
deg.54129.sig2 <- deg.54129.sig2[!duplicated(deg.54129.sig2$SYMBOL), ]
dim(deg.54129.sig2)
# [1] 729   9
save(deg.54129.sig2, file = 'gse54129.deg.probe_level.rda')
ADD COMMENTlink modified 11 days ago by Gordon Smyth32k • written 12 days ago by JackieMe0
0
gravatar for Gordon Smyth
11 days ago by
Gordon Smyth32k
Walter and Eliza Hall Institute of Medical Research, Melbourne, Australia
Gordon Smyth32k wrote:

It doesn't make any difference whether you map to probes before the DE analysis or after, although I myself would always do it before.

You could follow the example case study in Section 17.2 of the limma User's Guide. The example is for hgu95 affy arrays instead of u133plus2 but the code should be pretty straightforward to adapt.

ADD COMMENTlink modified 11 days ago • written 11 days ago by Gordon Smyth32k

Thank you Gordon! I am sorry if I am asking some strange questions.

In addition to this, multiple mapping ( dual probes mapping to a gene ) is common in hgu133plus array. I've seen some posts online said that to replace the expression values of the probes with thier mean/max expression values.    

Is there a recommended way to deal with this ? And again, should I deal this this  BEFORE or AFTER differentially expressed gene analysis?  If possible, can you show me some example code or recommend a package to do this? 
 

ADD REPLYlink written 11 days ago by JackieMe0

I usually don't see any need to do anything about multiple probes for a given gene. If however there really is a need to keep just one probe for each gene, then I do it like described here: Duplicate gene ID's returned from limma microarray analysis

ADD REPLYlink modified 11 days ago • written 11 days ago by Gordon Smyth32k

Thanks! That's exactly what I am looking for.

ADD REPLYlink written 11 days ago by JackieMe0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 192 users visited in the last hour