Hello there,
I am relatively new to R and I am analysing a small set of microarrays (Affymetrix Human Gene 2.0.ST) in order to find differentially expressed genes (DEG) among my samples.
There are a control and a treatment group, each in duplicates; so we are talking about four samples/arrays:
- control group 1
- sample or treatment group1
- control group 2 (duplicate)
- sample or treatment group 2 (duplicate)
I am starting out with the oligo package and at the end of the analysis (so far) I am using the limma package (on linux/debian, most recent version of R). However, I am not able to find any DEGs.
The problem is that these exact cel.files have been analysed before using the Windows based software package provide by Affymetrix - and in that case several DEGs have been identified.
So, something is not right but I do not really know what is going on. I have tried several things but nothing changed the outcome and I am running out of explanations. Below is the (current) script I am using for the microarray analysis and the session info.
I would be very thankful for any kind of ideas or suggestions.
Thanks a lot!
# loading some libraries library(oligo) library(limma) library(RSQLite) library(DBI) library(pd.hugene.2.0.st) library(GenomeInfoDb) library(AnnotationDbi) library(genefilter) # set working directory setwd("directory") # read cel files celFiles <- list.celfiles() affyRaw <- read.celfiles(celFiles) Platform design info loaded. Reading in : HuGene2_050715H_JP1_Ctrl-1.CEL Reading in : HuGene2_050715H_JP2_Sample-1.CEL Reading in : HuGene2_050715H_JP3_Ctrl-2.CEL Reading in : HuGene2_050715H_JP4_Sample-2.CEL # pre-processing using RMA eset <- rma(affyRaw) Background correcting Normalizing Calculating Expression write.exprs(eset,file="data_rma_processed.txt", sep="\t") # some quick statistics: MA plot (looks good to me) m <- rowMeans(e[,c(2,4)])-rowMeans(e[,c(1,3)]) a <- rowMeans(e) smoothScatter(a, m) abline(h=c(-1,1), col="red") # little modification on pData pData(eset) index HuGene2_050715H_JP1_Ctrl-1.CEL 1 HuGene2_050715H_JP2_Sample-1.CEL 2 HuGene2_050715H_JP3_Ctrl-2.CEL 3 HuGene2_050715H_JP4_Sample-2.CEL 4 pd <- data.frame(experiment = c(1,2,1,2), replicate = c(1,1,2,2)) rownames(pd) <- sampleNames(eset) pData(eset) <- pd pData(eset) experiment replicate HuGene2_050715H_JP1_Ctrl-1.CEL 1 1 HuGene2_050715H_JP2_Sample-1.CEL 2 1 HuGene2_050715H_JP3_Ctrl-2.CEL 1 2 HuGene2_050715H_JP4_Sample-2.CEL 2 2 # some quick statistics: t-test without multiple corrections and volcano plot tt <- rowttests(e, factor(eset$experiment)) lod <- -log10(tt$p.value) smoothScatter(m, lod, cex=0.25) abline(h=2, v=c(-1,1), col="red") # preliminary results (seems to be OK) table(tt$p.value <= 0.01) FALSE TRUE 52637 978
So far so good ...
# limma: create design matrix, contrast matrix and linear model fitting using eBayes design <- model.matrix(~0+factor(c(1,2,1,2))) colnames(design) <-c("control", "sample") fit <- lmFit(eset, design) contmatrix <- makeContrasts(sample-control, levels=design) fit2 <- contrasts.fit(fit, contmatrix) ebayes <- eBayes(fit2) Warning message: Zero sample variances detected, have been offset
And now things start to look weird
# show me the data results=decideTests(ebayes) summary(results) sample - control -1 0 0 53617 1 0 > vennDiagram(results) # multiple testing correction using “Benjamini and Hochberg” tab <- topTable(ebayes, coef=1, adjust.method="BH", n=Inf) write.table(tab, file="Hits_BH.csv", sep="\t") # no candidates for differential expressed genes …?!? sum(tab$adj.P.Val < 0.01) [1] 0 volcanoplot(ebayes, coef=1, cex=0.25) abline(h=2, v=c(-1,1), col="red")
Here is the session info
R version 3.2.1 (2015-06-18) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 14.04.1 LTS locale: [1] LC_CTYPE=en_CA.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_CA.UTF-8 LC_COLLATE=en_CA.UTF-8 [5] LC_MONETARY=en_CA.UTF-8 LC_MESSAGES=en_CA.UTF-8 [7] LC_PAPER=en_CA.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats4 parallel stats graphics grDevices utils datasets [8] methods base other attached packages: [1] genefilter_1.50.0 hugene20sttranscriptcluster.db_8.3.1 [3] org.Hs.eg.db_3.1.2 GO.db_3.1.2 [5] AnnotationDbi_1.30.1 GenomeInfoDb_1.4.1 [7] pd.hugene.2.0.st_3.14.1 RSQLite_1.0.0 [9] DBI_0.3.1 oligo_1.32.0 [11] Biostrings_2.36.1 XVector_0.8.0 [13] IRanges_2.2.5 S4Vectors_0.6.1 [15] Biobase_2.28.0 BiocGenerics_0.14.0 [17] limma_3.24.12 oligoClasses_1.30.0 loaded via a namespace (and not attached): [1] affxparser_1.40.0 GenomicRanges_1.20.5 splines_3.2.1 [4] zlibbioc_1.14.0 bit_1.1-12 xtable_1.7-4 [7] foreach_1.4.2 tools_3.2.1 ff_2.2-13 [10] KernSmooth_2.23-15 iterators_1.0.7 survival_2.38-3 [13] preprocessCore_1.30.0 affyio_1.36.0 codetools_0.2-11 [16] BiocInstaller_1.18.3 XML_3.98-1.3 annotate_1.46.0
Thanks for your suggestions. Regarding the adjusted p-value, I tried several versions and thresholds but all the values are larger than 0.5 (about 0.56 and up). But I'll try the filtering.
Dear Dennis,
i havent used the specific affymetrix Plaform and thus i dont know if can use some Detection p-value threshold to characterize a probeset as present or absent. Alternatively, you can use any of the following procedures to apply a kind of non-specific intensity filtering:
1) You can use the package genefilter and apply the following function i have slightly modified from the vignette
filter1<- function(ExpressionSet, threshold1, threshold2){
if(require(genefilter)){
set <- ExpressionSet
f1 <- kOverA(threshold1,threshold2)
ffun <- filterfun(f1)
filter <- genefilter(exprs(set),ffun)
filtered <- set[filter,]
return(filtered)
}
}
and where threshold 1 is the number of the samples(elements) to exceed threshold1
and threshold2 the value you want to exceed-For instance, if im not mistaken and you have 4 samples, you can use as threshold1=2(the half) and as threshold2 a value of log2(100)~6.64. But this is also very general and you have to also check via some histogram plots the intensity distribution of your probesets.
2. And secondly, you could apply the function shorth from the package genefilter, which calculates the midpoint of the shorth(the shortest interval containing the half of the data) & in a lot of examples is a reliable estimator of the peak of the distribution:
row.mean <- esApply(eset,1,mean) # or you could possibly use median instead of mean
sh <- shorth(row.mean)
hist(eset)
abline(v=sh) # to see the peak
filtered.set <- eset[row.mean >=sh,]
dim(filtered.set)
and again check your intensity distribution after the filtering
Hope it helps