I am learning to do differentially expressed gene analysis on microarray data(typically Affaymetrix GeneChip Human Genome U133 Plus 2.0 Array) of tumor vs normal. I've read some tutorials and know the basic steps. But I never saw people talk about when exactly to annotate the probes to gene symbols.
I now do DEG analysis on probe level and map probes to gene symbols after I get a list of differentially expressed probes. However, many differentially expressed probes map to the same gene, which resluts in much less DEGs than differentially expressed probes. I've also seen poeple map probes to gene symbols before DEG analysis, exactly after they get eset( eset <- exprs(rma(affyBatch))
) . I am really confused.
I give the briefing of my code below:
library(GEOquery) library(affycoretools) library(limma) library(dplyr) gse54129 <- getGEO('GSE54129', destdir = '../GSE/series_matrix', getGPL = FALSE) pdata <- pData(phenoData(gse54129[[1]])) group <- ifelse(pdata$characteristics_ch1 == 'tissue: normal', 'Normal', 'Tumor') group <- factor(group, levels=c('Normal', 'Tumor')) table(group) # Normal Tumor # 21 111 ###### load CEL files ##### library(affy) files <- paste('../GSE/CEL/GSE54129/', list.files('../GSE/CEL/GSE54129/', pattern = 'CEL.gz'), sep = '') affyBatch <- ReadAffy(filenames=files) sampleNames(affyBatch) sampleNames(affyBatch) <- substr(sampleNames(affyBatch), 1, 10) sampleNames(affyBatch) data.rma <- rma(affyBatch) eset <- exprs(data.rma) plotPCA(eset, addtext=colnames(eset), groups=factor(pdata$characteristics_ch1), legend=FALSE) ##### DEG ####### mode(eset) # [1] "numeric" design <- model.matrix(~group) fit <- lmFit(eset, design) fit2 <- eBayes(fit) topTable(fit2, coef = 2, adjust.method = 'BH') res <- decideTests(fit2, lfc = 2) summary(res) # (Intercept) groupTumor # -1 0 554 # 0 0 53609 # 1 54675 512 deg.54129 <- topTable(fit2, coef = 2, number = Inf, adjust.method = 'BH') deg.54129$gene <- rownames(deg.54129) deg.54129.sig2 <- filter(deg.54129, adj.P.Val < 0.05 & abs(logFC) >= 2) dim(deg.54129.sig2) # [1] 1066 7 library(hgu133plus2.db) probes <- deg.54129.sig2$gene gene.symbol <- as.character(as.list(hgu133plus2SYMBOL[probes])) gene.entrez <- as.character(as.list(hgu133plus2ENTREZID[probes])) deg.54129.sig2 <- cbind(PROBE=deg.54129.sig2$gene, SYMBOL=gene.symbol, ENTREZID=gene.entrez, deg.54129.sig2[, -ncol(deg.54129.sig2)]) deg.54129.sig2 <- arrange(deg.54129.sig2, SYMBOL, desc(abs(logFC))) deg.54129.sig2 <- deg.54129.sig2[!duplicated(deg.54129.sig2$SYMBOL), ] dim(deg.54129.sig2) # [1] 729 9 save(deg.54129.sig2, file = 'gse54129.deg.probe_level.rda')
Thank you Gordon! I am sorry if I am asking some strange questions.
In addition to this, multiple mapping ( dual probes mapping to a gene ) is common in hgu133plus array. I've seen some posts online said that to replace the expression values of the probes with thier mean/max expression values.
Is there a recommended way to deal with this ? And again, should I deal this this BEFORE or AFTER differentially expressed gene analysis? If possible, can you show me some example code or recommend a package to do this?
I usually don't see any need to do anything about multiple probes for a given gene. If however there really is a need to keep just one probe for each gene, then I do it like described here: Duplicate gene ID's returned from limma microarray analysis
Thanks! That's exactly what I am looking for.