Trouble identifying differentially expressed genes on Affymetrix Human Gene 2.0.ST using R with oligo and limma package
3
1
Entering edit mode
@dennisoezcelik-8335
Last seen 5.3 years ago

Hello there,

I am relatively new to R and I am analysing a small set of microarrays (Affymetrix Human Gene 2.0.ST) in order to find differentially expressed genes (DEG) among my samples.

There are a control and a treatment group, each in duplicates; so we are talking about four samples/arrays:

1. control group 1
2. sample or treatment group1
3. control group 2 (duplicate)
4. sample or treatment group 2 (duplicate)

I am starting out with the oligo package and at the end of the analysis (so far) I am using the limma package (on linux/debian, most recent version of R). However, I am not able to find any DEGs.

The problem is that these exact cel.files have been analysed before using the Windows based software package provide by Affymetrix - and in that case several DEGs have been identified.

So, something is not right but I do not really know what is going on. I have tried several things but nothing changed the outcome and I am running out of explanations. Below is the (current) script I am using for the microarray analysis and the session info.

I would be very thankful for any kind of ideas or suggestions.

Thanks a lot!

# loading some libraries
library(oligo)
library(limma)
library(RSQLite)
library(DBI)
library(pd.hugene.2.0.st)
library(GenomeInfoDb)
library(AnnotationDbi)
library(genefilter)

# set working directory
setwd("directory")

celFiles <- list.celfiles()

# pre-processing using RMA
eset <- rma(affyRaw)
Background correcting
Normalizing
Calculating Expression

write.exprs(eset,file="data_rma_processed.txt", sep="\t")

# some quick statistics: MA plot (looks good to me)
m <- rowMeans(e[,c(2,4)])-rowMeans(e[,c(1,3)])
a <- rowMeans(e)
smoothScatter(a, m)
abline(h=c(-1,1), col="red")

# little modification on pData
pData(eset)
index
HuGene2_050715H_JP1_Ctrl-1.CEL       1
HuGene2_050715H_JP2_Sample-1.CEL     2
HuGene2_050715H_JP3_Ctrl-2.CEL       3
HuGene2_050715H_JP4_Sample-2.CEL     4

pd <- data.frame(experiment = c(1,2,1,2), replicate = c(1,1,2,2))
rownames(pd) <- sampleNames(eset)
pData(eset) <- pd
pData(eset)                                                                                                                  experiment    replicate
HuGene2_050715H_JP1_Ctrl-1.CEL         1             1
HuGene2_050715H_JP2_Sample-1.CEL       2             1
HuGene2_050715H_JP3_Ctrl-2.CEL         1             2
HuGene2_050715H_JP4_Sample-2.CEL       2             2

# some quick statistics: t-test without multiple corrections and volcano plot
tt <- rowttests(e, factor(eset$experiment)) lod <- -log10(tt$p.value)
smoothScatter(m, lod, cex=0.25)
abline(h=2, v=c(-1,1), col="red")

# preliminary results (seems to be OK)
table(tt$p.value <= 0.01) FALSE TRUE 52637 978 So far so good ... # limma: create design matrix, contrast matrix and linear model fitting using eBayes design <- model.matrix(~0+factor(c(1,2,1,2))) colnames(design) <-c("control", "sample") fit <- lmFit(eset, design) contmatrix <- makeContrasts(sample-control, levels=design) fit2 <- contrasts.fit(fit, contmatrix) ebayes <- eBayes(fit2) Warning message: Zero sample variances detected, have been offset And now things start to look weird # show me the data results=decideTests(ebayes) summary(results) sample - control -1 0 0 53617 1 0 > vennDiagram(results) # multiple testing correction using “Benjamini and Hochberg” tab <- topTable(ebayes, coef=1, adjust.method="BH", n=Inf) write.table(tab, file="Hits_BH.csv", sep="\t") # no candidates for differential expressed genes …?!? sum(tab$adj.P.Val < 0.01)
[1] 0
volcanoplot(ebayes, coef=1, cex=0.25)
abline(h=2, v=c(-1,1), col="red")

Here is the session info

R version 3.2.1 (2015-06-18)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.1 LTS

locale:
[1] LC_CTYPE=en_CA.UTF-8       LC_NUMERIC=C
[3] LC_TIME=en_CA.UTF-8        LC_COLLATE=en_CA.UTF-8
[5] LC_MONETARY=en_CA.UTF-8    LC_MESSAGES=en_CA.UTF-8
[7] LC_PAPER=en_CA.UTF-8       LC_NAME=C
[11] LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets
[8] methods   base

other attached packages:
[1] genefilter_1.50.0                    hugene20sttranscriptcluster.db_8.3.1
[3] org.Hs.eg.db_3.1.2                   GO.db_3.1.2
[5] AnnotationDbi_1.30.1                 GenomeInfoDb_1.4.1
[7] pd.hugene.2.0.st_3.14.1              RSQLite_1.0.0
[9] DBI_0.3.1                            oligo_1.32.0
[11] Biostrings_2.36.1                    XVector_0.8.0
[13] IRanges_2.2.5                        S4Vectors_0.6.1
[15] Biobase_2.28.0                       BiocGenerics_0.14.0
[17] limma_3.24.12                        oligoClasses_1.30.0

loaded via a namespace (and not attached):
[1] affxparser_1.40.0     GenomicRanges_1.20.5  splines_3.2.1
[4] zlibbioc_1.14.0       bit_1.1-12            xtable_1.7-4
[7] foreach_1.4.2         tools_3.2.1           ff_2.2-13
[10] KernSmooth_2.23-15    iterators_1.0.7       survival_2.38-3
[13] preprocessCore_1.30.0 affyio_1.36.0         codetools_0.2-11
[16] BiocInstaller_1.18.3  XML_3.98-1.3          annotate_1.46.0  
0
Entering edit mode

Thanks for your suggestions. Regarding the adjusted p-value, I tried several versions and thresholds but all the values are larger than 0.5 (about 0.56 and up). But I'll try the filtering.

0
Entering edit mode

Dear Dennis,

i havent used the specific affymetrix Plaform and thus i dont know if can use some Detection p-value threshold to characterize a probeset as present or absent. Alternatively,  you can use any of the following procedures to apply a kind of non-specific intensity filtering:

1) You can use the package genefilter and apply the following function i have slightly modified from the vignette

filter1<- function(ExpressionSet, threshold1, threshold2){
if(require(genefilter)){
set <- ExpressionSet
f1 <- kOverA(threshold1,threshold2)
ffun <- filterfun(f1)
filter <- genefilter(exprs(set),ffun)
filtered <- set[filter,]
return(filtered)

}
}

and where threshold 1 is the number of the samples(elements) to exceed threshold1

and threshold2 the value you want to exceed-For instance, if im not mistaken and you have 4 samples, you can use as threshold1=2(the half) and as threshold2 a value of log2(100)~6.64. But this is also very general and you have to also check via some histogram plots the intensity distribution of your probesets.

2. And secondly, you could apply the function shorth from the package genefilter, which calculates the midpoint of the shorth(the shortest interval containing the half of the data) & in a lot of examples is a reliable estimator of the peak of the distribution:

row.mean <- esApply(eset,1,mean) # or you could possibly use median instead of mean

sh <- shorth(row.mean)

hist(eset)

abline(v=sh) # to see the peak

filtered.set <- eset[row.mean >=sh,]

dim(filtered.set)

and again check your intensity distribution after the filtering

Hope it helps

4
Entering edit mode
@gordon-smyth
Last seen 6 hours ago
WEHI, Melbourne, Australia

If you want to compare t-tests to limma moderated t-tests, you should compare like quantities with like. In the code you have given, you have compared raw p-values from the t-tests with FDR values from limma. Also keep in mind that, for a small experiment like this, t-tests are likely to give small p-values to probe-sets with small fold changes and even smaller standard deviations -- these may be false positives.

There are many things you can do to increase power to detect DE genes in limma. Two things that will probably make a big difference are:

1. Filter out probes that aren't expressed in your experiment, of which there are likely to be many.

2. Turn on robust eBayes estimation, because the rma() algorithm is returning identical normalized values for some of the probe sets.

For example,

design <- model.matrix(~factor(c(1,2,1,2)))
fit <- lmFit(eset, design)
keep <- fit$Amean > median(fit$Amean)
fit <- eBayes(fit[keep,], robust=TRUE, trend=TRUE)
topTable(fit, coef=2)

I've suggested here keeping just the 50% of the probe-sets with largest average log-expression. This may work well in this case because the Affymetrix platform that you are using tends to contain a large number of poorly annotated probe-sets that are often not highly expressed and because you don't have any DE genes to start with. However this filtering is just a quick example of what you might do. The use of fit$Amean > median(fit$Amean) is not offered as a general recommendation.

0
Entering edit mode

Dear Mr Gordon Smyth,

about  your very interesting second proposal about "robust eBayes estimation" : firstly, can be used to reduce more the probesets tested for differential expression after fitting a linear model to each gene ? Furthermore,  can be used along with an initial non-specific intensity filtering(i.e. detection p-value in Illumina ?)

Moreover, my final question is if it is appropriate to use for other normalization algorithms, like neqc ?

Thank you for your consideration on this matter !!

0
Entering edit mode

Dear Mr. Smyth,

Thank you very much for your helpful suggestions. I filtered the list of genes by the mean log-expression (as you have suggested) using the genefilter package (as svlachavas has suggested - thanks for that!) and have now been able to identify several differentially expressed genes, which is great.
Interestingly, the robust eBayes estimation did not make a difference in my case.

As a follow-up question - maybe a silly one - I was wondering about quality checks or statistic controls to be applied after the (mean log expression) filter. Basically, the filter removed about 40 % of the genes in the list and I am not sure if that affects subsequent data analysis. What do I have to consider?

Thank you very much.

2
Entering edit mode

The only consideration is that you might have filtered out some genes that are DE and interesting despite being lowly expressed. That seemed to me to be a very low risk in an experiment where you were finding nothing anyway.

You should always be using the usual quality check plots (plotMDS, plotMA, plotSA). However filtering generally improves the assumptions of the empirical Bayes linear modeling.

0
Entering edit mode

Dear Dennis,

firstly you could check via a histogram plot the intensity distribution after your filtering to inspect the intensity levels of your samples and see whether are still low expression values and also access the normality of your data-the second could be also checked via a Q-Q plot. Moreover, you could perform a PCA plot and/or a correlation dendrogram before and after the filtering to see how your samples group together

0
Entering edit mode

Sounds good - that's what I had in mind as well. Thank you both for your help.

0
Entering edit mode
@sean-davis-490
Last seen 20 days ago
United States

Your adjusted p-value is actually a False Discovery Rate (FDR), so using 0.01 is very stringent.  You could check at 0.1 or even 0.5 to see if there are any genes identified at that level.  Finally, consider filtering genes prior to differential expression analysis:

http://www.ncbi.nlm.nih.gov/pubmed/20460310

2
Entering edit mode

Beware that the variance filtering methods recommended by the article linked to must only be used with ordinary t-tests. They should not be used with limma. In a limma pipeline, filtering by mean log-expression is simpler and better.

0
Entering edit mode

Thanks, Gordon.  I always learn something from your comments.  I'm showing my naivete, but do you have a quick reference (so I can stop sharing incomplete/wrong information)?

1
Entering edit mode

The Bourgon et al article itself says that variance filtering is not appropriate with limma, but unfortunately not very prominently so many people miss it. Personally, I think that variance filtering is problematic in many or most pipelines, not just with limma. I don't think it would be a good idea with any expression platform that produces a decreasing mean-variance relationship, as most do. Nor is it appropropriate with any of the leading microarray statistical tests (limma, SAM, maanova, CyberT etc). To me, the Affymetrix-rma-ttest pipeline is one of the very few where variance filtering is beneficial, because it unusually produces an increasing mean-variance relationship at low intensities. Yet even here I would hope to get more benefit out of limma instead.

As far as what one should do, most people agree that filtering low expressed or non-expressed probes is a good idea -- it makes intuitive sense from both statistical and biological points of view. I must admit I never thought that a journal article was necessary on this, but maybe I was wrong.