Question

Why can't I use the same R package to get the same conclusions as the author?

0

Entering edit mode

llkxiaolan ▴ 10

@llkxiaolan-13767

Last seen 6.6 years ago

One question that puzzles me is that I use the same R package as the author. Why don't you get the same conclusion as the author?. Specifically, in the literature, the authors used Affy and limma packages to select differential genes, |logFC| > 1, P < 0.05 as threshold, and screened 752 differentially expressed genes. I used the same R package and threshold, but I screened only 625 differential genes. What causes this?. Why do the same data use the same analysis method, and the result will be different?

This is what the author mentioned in the literature about data processing :All the raw data and annotation files were downloaded for further analysis.The raw data were firstly preprocessed using the Affy package in R language [ 12 , 13 ] . Then, the DEGs in glomeruli tissues between patients with DN and healthy controls were analyzed by limma package in R [ 14 ] . The multiple testing correction was performed using a beniamini-Hochberg false discovery rate (HB FDR) [ 15 ] . Fold change (FC) of the expression of individual gene was also observed for diff erential expression test. The DEGs with FDR < 0.01 and |log FC| > 1 were considered to be signifi cant.The set of data used is GSE1009.

This is the code that I use :

library(affy)

library(tcltk)

dir <- tk_choose.dir(caption = "Select folder")

cel.files <- list.files(path = dir, pattern = ".+\\.cel$", ignore.case = TRUE,full.names = TRUE, recursive = TRUE)

basename(cel.files)

data.raw <- ReadAffy(filenames = cel.files)

sampleNames(data.raw) <- paste("CHIP", 1:length(cel.files), sep = "-")

sampleNames(data.raw)

pData(data.raw)$treatment <- rep(c("N", "T"), each=3)

pData(data.raw)

eset.rma <- rma(data.raw)

emat.rma.log2 <- exprs(eset.rma)

head(emat.rma.log2, 1)

results.rma <- data.frame((emat.rma.log2[,c(1,4)] + emat.rma.log2[,c(2,5)] + emat.rma.log2[,c(3,6)])/3)

results.rma$fc <- results.rma[,2] - results.rma[,1]

head(results.rma, 2)

data.mas5calls <- mas5calls(data.raw)

eset.mas5calls <- exprs(data.mas5calls)

head(eset.mas5calls)

AP <- apply(eset.mas5calls, 1, function(x)any(x=="P"))

present.probes <- names(AP[AP])

paste(length(present.probes),"/",length(AP))

results.present <- results.rma[present.probes,]

library(limma)

treatment <- factor(pData(eset.rma)$treatment)

design <- model.matrix(~ 0 + treatment)

colnames(design) <- c("N", "T")

design

contrast.matrix <- makeContrasts(T-N, levels=design)

contrast.matrix

fit <- lmFit(eset.rma[present.probes,], design)

fit2 <- contrasts.fit(fit, contrast.matrix)

fit2 <- eBayes(fit2)

results.lim <- topTable(fit2, coef=1, adjust.method="fdr", p.value=0.05, lfc=1, number=30000)

Thangk you

limma affy • 1.5k views

ADD COMMENT • link updated 7.3 years ago by James W. MacDonald 67k • written 7.3 years ago by llkxiaolan ▴ 10

score 1 · Answer 1 · 2017-08-17

1

Entering edit mode

James W. MacDonald 67k

@james-w-macdonald-5106

Last seen 3 minutes ago

United States

You aren't using the same package as the authors. If you read the paper they link to in their GEO submission, you will find that they used Affy's MAS5.0 software, not the Bioconductor affy package, and certainly not RMA, as you have. If you use RMA and they use MAS5.0, is it surprising that you get different results?

ADD COMMENT • link 7.3 years ago James W. MacDonald 67k

0

Entering edit mode

After re-reading your post I see that you are not trying to reproduce the original author's results, but somebody else's result who used those data. Unless the author states exactly what was done, or supplies the code (and you use the same exact versions of the packages), it isn't surprising that you get different results. For instance, you are filtering using MAS5.0 calls. Is that consistent with what the other person did?

ADD REPLY • link 7.3 years ago James W. MacDonald 67k

0

Entering edit mode

Thank you very much for your reply, and I checked the data later. I think the reason for the different results may be different ways of data preprocessing. Although the number of differential genes is different, I feel that my approach seems to be all right. What do you think?

ADD REPLY • link 7.3 years ago llkxiaolan ▴ 10

0

Entering edit mode

It seems OK, except for the part where you use lfc in your call to topTable. If you want to include a fold-change criterion you are better off using the treat function.

ADD REPLY • link 7.3 years ago James W. MacDonald 67k

score 0 · Answer 2 · 2017-08-17

0

Entering edit mode

Aleksandra • 0

@aleksandra-13756

Last seen 7.3 years ago

Did the Author describe which method of data preprocessing was used? If he only wrote that "The raw data were firstly preprocessed using the Affy package in R language [ 12 , 13 ]", it is very imprecise. Did he use rma or mas? Meybe he just used a different preprocessing method than you.

ADD COMMENT • link 7.3 years ago Aleksandra • 0

score 0 · Answer 3 · 2017-08-17

0

Entering edit mode

Axel Klenk ★ 1.0k

@axel-klenk-3224

Last seen 3 minutes ago

UPF, Barcelona, Spain

The package version(s) are important, too, as the software evolves.

System libraries such as BLAS and LAPACK may cause differences.

Exactly reproducing such an analysis is not easy.

ADD COMMENT • link 7.3 years ago Axel Klenk ★ 1.0k