One question that puzzles me is that I use the same R package as the author. Why don't you get the same conclusion as the author?. Specifically, in the literature, the authors used Affy and limma packages to select differential genes, |logFC| > 1, P < 0.05 as threshold, and screened 752 differentially expressed genes. I used the same R package and threshold, but I screened only 625 differential genes. What causes this?. Why do the same data use the same analysis method, and the result will be different?
This is what the author mentioned in the literature about data processing :All the raw data and annotation files were downloaded for further analysis.The raw data were firstly preprocessed using the Affy package in R language [ 12 , 13 ] . Then, the DEGs in glomeruli tissues between patients with DN and healthy controls were analyzed by limma package in R [ 14 ] . The multiple testing correction was performed using a beniamini-Hochberg false discovery rate (HB FDR) [ 15 ] . Fold change (FC) of the expression of individual gene was also observed for diff erential expression test. The DEGs with FDR < 0.01 and |log FC| > 1 were considered to be signifi cant.The set of data used is GSE1009.
This is the code that I use :
library(affy)
library(tcltk)
dir <- tk_choose.dir(caption = "Select folder")
cel.files <- list.files(path = dir, pattern = ".+\\.cel$", ignore.case = TRUE,full.names = TRUE, recursive = TRUE)
basename(cel.files)
data.raw <- ReadAffy(filenames = cel.files)
sampleNames(data.raw) <- paste("CHIP", 1:length(cel.files), sep = "-")
sampleNames(data.raw)
pData(data.raw)$treatment <- rep(c("N", "T"), each=3)
pData(data.raw)
eset.rma <- rma(data.raw)
emat.rma.log2 <- exprs(eset.rma)
head(emat.rma.log2, 1)
results.rma <- data.frame((emat.rma.log2[,c(1,4)] + emat.rma.log2[,c(2,5)] + emat.rma.log2[,c(3,6)])/3)
results.rma$fc <- results.rma[,2] - results.rma[,1]
head(results.rma, 2)
data.mas5calls <- mas5calls(data.raw)
eset.mas5calls <- exprs(data.mas5calls)
head(eset.mas5calls)
AP <- apply(eset.mas5calls, 1, function(x)any(x=="P"))
present.probes <- names(AP[AP])
paste(length(present.probes),"/",length(AP))
results.present <- results.rma[present.probes,]
library(limma)
treatment <- factor(pData(eset.rma)$treatment)
design <- model.matrix(~ 0 + treatment)
colnames(design) <- c("N", "T")
design
contrast.matrix <- makeContrasts(T-N, levels=design)
contrast.matrix
fit <- lmFit(eset.rma[present.probes,], design)
fit2 <- contrasts.fit(fit, contrast.matrix)
fit2 <- eBayes(fit2)
results.lim <- topTable(fit2, coef=1, adjust.method="fdr", p.value=0.05, lfc=1, number=30000)
Thangk you
After re-reading your post I see that you are not trying to reproduce the original author's results, but somebody else's result who used those data. Unless the author states exactly what was done, or supplies the code (and you use the same exact versions of the packages), it isn't surprising that you get different results. For instance, you are filtering using MAS5.0 calls. Is that consistent with what the other person did?
Thank you very much for your reply, and I checked the data later. I think the reason for the different results may be different ways of data preprocessing. Although the number of differential genes is different, I feel that my approach seems to be all right. What do you think?
It seems OK, except for the part where you use lfc in your call to
topTable
. If you want to include a fold-change criterion you are better off using thetreat
function.