Search
Question: Not able to reproduce limma results from affy chip with or without CEL file from 2 GSE datasets
0
19 months ago by
carine.genet0 wrote:

Dear all,

I need your help because I was not able to obtain a list of DEGs with limma analysis from two GSE datasets.

Basically, I want to compare two sets of data generated on the same platform (GPL2112; single channel experiment) and corresponding to two GSEs (10 samples from GSE39589 (GSM972638 to GSM972647) and 10 samples from GSE49505 (GSM1200157 to GSM1200166)). There is no GDS for these datasets. In fact, I want to compare 10 granulosa cells obtain from small follicles to 10 thecal cells from small follicles.

I downloaded the two ExpressionSets (values were already RMA normalized and log2 transformed) from GEOQuery, combine then and directly performs limma analysis (see code below)

However the results of my limma analysis was quite surprising because high number of DEGs was obtained. I supposed that something went wrong with normalization (see boxplots) so I decided to use the CEL file from GEO, I untared , and used the affy package to read the CEL files. I performed rma normalization and limma analysis and all genes are differentially expressed….

I supposed that I missed something in the pipeline and any help will be greatly appreciate…

Regards

Carine

###############################################################
#  Differential expression Limma
rm(list=ls())
library(Biobase)
library(GEOquery)
library(limma)
workDir <- "C:/Users/cagenet/Documents/B/Bovin"
setwd(workDir)
Granu <- getGEO("GSE39589")[[1]]
Theca <- getGEO("GSE49505")[[1]]

## combine ExpressionSets, granulosa versus theca
eset <- combine(Granu[,1:10], Theca[,6:15])
dim(eset)
## get rid of extra featureSet columns that we don't need
fData(eset) <- fData(eset)[,c(1,18,8)]
## design matrix accounting for two different tissues
tissue <- c("Granulosa","Granulosa","Granulosa","Granulosa","Granulosa","Granulosa","Granulosa","Granulosa","Granulosa","Granulosa",
"Theca","Theca","Theca","Theca","Theca","Theca","Theca","Theca","Theca","Theca")
tissue
## design matrix accounting for two different tissues
design <- model.matrix(~factor(tissue))
design
colnames(design)<-c("Gran","thequevsGran")
## filter out and take off AFFX probes
eset <- eset[-grep("AFFX", featureNames(eset)),]
## fit model
fit <- lmFit(eset, design)
dim(eset)
fit2 <- eBayes(fit)
#save all results
tT <- topTable(fit2,sort="none", n=Inf)
write.table(tT, file="all.bov.txt",sep="\t")

########################################################
# downloaded CEL files corresponding to the 20 samples and put it in workdir
library("affy", lib.loc="~/R/win-library/3.3")
estbov<-rma(databov)
boxplot(databov)
design<-model.matrix(~0+factor(c(1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2)))
colnames(design)<-c("theque","granu")
fit<-lmFit(estbov,design)
dim(estbov)
fit2 <- eBayes(fit)
write.exprs(estbov[1:24128,],file="esetbov.txt")
tT <- topTable(fit2,sort="none", n=Inf)
write.table(tT, file="affybov.txt",sep="\t")
> sessionInfo()
R version 3.3.3 (2017-03-06)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

locale:
[1] LC_COLLATE=French_France.1252  LC_CTYPE=French_France.1252    LC_MONETARY=French_France.1252 LC_NUMERIC=C
[5] LC_TIME=French_France.1252

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] bovinecdf_2.18.0    limma_3.30.12       GEOquery_2.40.0     affy_1.52.0         Biobase_2.34.0      BiocGenerics_0.20.0

loaded via a namespace (and not attached):
[1] Rcpp_0.12.10          IRanges_2.8.2         XML_3.98-1.5          digest_0.6.12         bitops_1.0-6
[6] R6_2.2.0              DBI_0.6               stats4_3.3.3          RSQLite_1.1-2         BiocInstaller_1.24.0
[11] httr_1.2.1            zlibbioc_1.20.0       affyio_1.44.0         S4Vectors_0.12.2      preprocessCore_1.36.0
[16] tools_3.3.3           RCurl_1.95-4.8        rsconnect_0.7         AnnotationDbi_1.36.2  memoise_1.0.0  
modified 19 months ago by James W. MacDonald48k • written 19 months ago by carine.genet0
2
19 months ago by
United States
James W. MacDonald48k wrote:

You are fitting two different model types. The first is what I was taught to call a 'factor effects' model, in which one of your tissues (Granulosa) is the baseline, and the second coefficient is the difference between Theca and Granulosa. In that situation you want to call topTable with coef = 2, which is to say you want to test that the difference between Theca and Granulosa is different from zero. What you ARE doing is testing to see if the mean expression of Granulosa OR the difference between Theca and Granulosa are different from zero, and obviously that will be true for most if not all genes.

In the second case you are fitting a 'cell means' model where you are estimating the mean expression of the Theca and Granulosa samples. And then you are testing to see if the mean expression of either is different from zero, which again will be true for all genes. In this situation you need a contrasts matrix where you make the comparison between the two cell types (or you could use the factor effects model you used above), but either way you need to make comparisons between the two cell types.

But do note that this is almost surely a fool's errand. The two different cell types were processed by different people in different labs using different reagents and different lots of chips. The technical differences between the two groups almost surely swamps any biological difference. And even if it doesn't do so for some genes, you can't distinguish between technical differences and real biological differences, so there is no way to find the genes that might really be changing.

Hello James,

Thanks again for your help and explanation.I 'm not fluent with statistical models so your lights are very welcome.

In fact these two cells type were obtained from the same cows, in the same lab, by the same people, with the same lot of chips ...and was published in 2015 in Plos one. The authors identified cells markers for the two cell types. They used one way ANOVA using method of moment estimation. And as they only provided DEGs with 4<FC >4, and I need the whole list of cell markers, that’s why I want to analyse again their data. I want to obtain a whole genome view of the possible contamination between the two tissues which are close to each others.

So in light of your explanation, I can used  :

The factor effects model and the coding will be :

tissue <- c("G","G","G","G","G","G","G","G","G","G","T","T","T","T","T","T","T","T","T","T"),

## design matrix accounting for two different tissues

design <- model.matrix(~factor(tissue))

colnames(design)<-c("Gran","thequevsGran")

## fit model

fit <- lmFit(eset, design)

dim(eset)

fit2 <- eBayes(fit)

#save all results

tT <- topTable(fit2,sort="none", n=Inf)

Or the contrast matrix effect and the code will be :

tissue <- c("G","G","G","G","G","G","G","G","G","G","T","T","T","T","T","T","T","T","T","T"),

design <- model.matrix(~factor(tissue))

colnames(design)<-c("Gran","theque")

Contrast.matrix<-makeContrasts(theque-Gran, levels=design)

Fit2<-contrasts.fit(fit, contrast.matrix)

Fit2<-eBayes(fit2)

topTable(fit2,sort="none", n=Inf)

Do you validate the code ?

Thanks again and have a nice day

Carine

I think this example from the lmFit doc might help you. As stated by James, simply add the coef=2 argument to the topTable call in your first code. Hope I am right ;-)

# Simulate gene expression data for 100 probes and 6 microarrays
# Microarray are in two groups
# First two probes are differentially expressed in second group
# Std deviations vary between genes with prior df=4
sd <- 0.3*sqrt(4/rchisq(100,df=4))
y <- matrix(rnorm(100*6,sd=sd),100,6)
rownames(y) <- paste("Gene",1:100)
y[1:2,4:6] <- y[1:2,4:6] + 2
design <- cbind(Grp1=1,Grp2vs1=c(0,0,0,1,1,1))
options(digits=3)

# Ordinary fit
fit <- lmFit(y,design)
fit <- eBayes(fit)
topTable(fit,coef=2)
dim(fit)
colnames(fit)
rownames(fit)[1:10]
names(fit)

The underlying cause is that the OP has renamed the columns of the design matrix after calling model.matrix. If the column names were not changed, topTable would automatically detect the intercept and avoid using it for testing.

Aaron, please, could you tell me what "OP" means and what it does?

Thanks for the tip. The topTable is clearly looking for "(Intercept)" in the coefficients, and removes it making the computation straightforward.

Original poster, see https://en.wikipedia.org/wiki/OP#Internet.

Thank you SamGG and Aaron Lun. I Finally obtained the DEG list ...which is not at all the same find in the article...so maybe I will discard these data.

Regards

Carine