Question: LIMMA producing too many significant results
0
4.1 years ago by
United States
akridgerunner20 wrote:

Hello Everyone,

I'm running LIMMA on a study based on HGU133plus2 chips I dug out of GEO. I normalized using RMA, GCRMA and MAS5 as usual, and ran AffyQC to look for outliers, of which I found none. There are 20 normals and 61 patients, and they cluster well in PCA. I decided to move forward with the GCRMA normalized set. I removed the 68 Affy control probes, and discarded the 13,875 / 54,613 probes (25.4% of all probes) which had an average log2 intensity < 2.27. Intensities for all arrays range from 2.268579 to 16.02096, so my belief is that these probes are uninformative. I finally calling LIMMA using the following:

eset_filtered$cohort <- factor(eset_filtered$cohort)
design <- model.matrix(~0 + eset_filtered$cohort) colnames(design) <- levels(eset_filtered$cohort)
fit <- lmFit(eset_filtered, design)
contrast.matrix <- makeContrasts(SLE-CTL,levels = design)
fit2 <- contrasts.fit(fit, contrast.matrix)
fit2 <- eBayes(fit2)
topTable.filtered <- topTable(fit2, coef=1, adjust="fdr", n=nrow(eset_filtered) )
rm( contrast.matrix, design, fit, fit2)
index <- which(topTable.filtered$adj.P.Val <= 0.05) # 11,763 nrow(eset_filtered) # 32,054 As you can see, I'm getting 11,763 (out of a total of 32,054) probes at an FDR of 0.05. The topmost gene has an adjusted pval of 5.360915e-19 (the last gene is 0.9999355). What am I missing here? Why am I getting so many significant results? Everything normalized well, the MA plots look good, there weren't any wild outliers, the cohorts are labeled properly because they cluster together in PCA. I'm really wracking myself on this one as I've had (I believe) success with LIMMA in the past. Any help our community can provide is PROFOUNDLY appreciated. Best wishes, Robert D. Robl affy limma limma design matrix • 1.4k views ADD COMMENTlink modified 4.1 years ago by Steve Lianoglou12k • written 4.1 years ago by akridgerunner20 Answer: LIMMA producing too many significant results 2 4.1 years ago by Aaron Lun23k Cambridge, United Kingdom Aaron Lun23k wrote: Well, most simply, the presence of a large number of DE genes might represent genuine biology in your system. It's not uncommon to have major differences in the transcriptional profiles between cancer and normal cells. If you want to reduce the number of DE genes (e.g., to get a smaller set for validation), you could consider using the TREAT method. This will test each gene for having a log-fold change that is significantly different from some threshold. In this manner, you can get the subset of genes that are changing noticeably between your conditions, rather than just the genes that are changing regardless of magnitude. ADD COMMENTlink modified 4.1 years ago • written 4.1 years ago by Aaron Lun23k Thanks for the idea! I found the write up on TREAT (t-tests relative to a threshold) at http://bioinformatics.oxfordjournals.org/content/25/6/765.long. I'll apply the approach tomorrow and report back. ADD REPLYlink modified 4.1 years ago • written 4.1 years ago by akridgerunner20 I incorporated the following which includes the results. That aside, is my original script free of errors? Thanks! topTreat <- topTreat(fit2, coef=1, lfc=1, adjust="BH", n=nrow(eset_filtered)) # 1,301 probes topTreat <- topTreat(fit2, coef=1, lfc=1.25, adjust="BH", n=nrow(eset_filtered)) # 680 probes topTreat <- topTreat(fit2, coef=1, lfc=1.5, adjust="BH", n=nrow(eset_filtered)) # 375 probes topTreat <- topTreat(fit2, coef=1, lfc=2, adjust="BH", n=nrow(eset_filtered)) # 149 probes ADD REPLYlink written 4.1 years ago by akridgerunner20 I would suggest a less extreme value of lfc. Remember, TREAT tests for significant differences from lfc, so a gene with a log-fold change of 1 would probably not be called as significant if you were also to set lfc=1. I often use a value of log2(1.5), or even 0.5, i.e., need a fold-change greater than square-root 2. Other than that, the code looks okay. ADD REPLYlink modified 4.1 years ago • written 4.1 years ago by Aaron Lun23k Answer: LIMMA producing too many significant results 1 4.1 years ago by United States Peter Langfelder1.9k wrote: Well, I'm not sure why you specified an intercept-free model (if I'm not mistaken, the zero in ~0 + ... forces an intercept-free model). You may want to re-run the analysis with a design=model.matrix(~eset_filtered$cohort) Aside from that, the clue is in your statement that cohorts cluster together in PC plot(s) - that means there are strong (global) differences in expression profiles of the samples and finding 11k differentially expressed probes is not entirely unreasonable.

2

The lack of an intercept shouldn't matter, as the refitting with contrasts.fit should ensure that dropping coef=1 will yield the desired DE comparison between SLE and CTL groups. I generally find that an intercept-free model is easier to interpret, especially for a one-way layout where each coefficient would represent a group mean. On the flip side, if you use an intercept in this particular example, then you don't need to run contrasts.fit at all, and you can just drop coef=2 to do the comparison (assuming there's only two factors in eset_filtered$cohort). ADD REPLYlink modified 4.1 years ago • written 4.1 years ago by Aaron Lun23k Hi Aaron, I'm honestly only using that script as I lifted it from a tutorial. I tried running design=model.matrix(~eset_filtered$cohort) but receive "Error: $operator is invalid for atomic vectors". eset_filtered$cohort is currently a factor object containing designations for "SLE" or "CTL". As I understand it, I'm not interested in the intercept term since that would only indicate any gene whose expression is not 0, correct? So, how would I simplify my script to compare the two cohorts only? Thanks!

3

The error you see is usually what you get when you try to subset a matrix as if it were a data.frame:

> z <- matrix(rnorm(100), 10)
> colnames(z) <- LETTERS[1:10]
> model.matrix(~z$A) Error in z$A : $operator is invalid for atomic vectors  Or more to the point > z$A
Error in z$A :$ operator is invalid for atomic vectors

So it appears you have at some point converted your eset_filtered data.frame into a matrix. You will get a more informative message if you use a different way to specify your design matrix

> model.matrix(~A, z)
Error in model.frame.default(object, data, xlev = xlev) :
'data' must be a data.frame, not a matrix or an array

And note that you can always subset your matrix using '[', which will work for both matrices and data.frames.

> model.matrix(~z[,"A"])
(Intercept)    z[, "A"]
1            1  0.88826017
2            1  0.55770878
3            1 -0.19293917
4            1  0.31596009
5            1  0.49786183
6            1 -0.30770279
7            1 -0.01445348
8            1  0.64846244
9            1 -0.24469273
10           1  1.32096021
attr(,"assign")
[1] 0 1

But that's all just 'inside baseball' for using R.

To your point; you can fit an ANOVA model in two ways. Either as a 'cell means model' where you are computing the mean of each group, or a 'factor effects model' where the intercept term is the mean of a 'baseline' and the other coefficients are differences from a given sample and that baseline. So if you do

design <- model.matrix(~ 0 + eset_filtered[,"cohort"])

then you are computing the means of the SLE and CTL groups, and if you then want to make comparisons you have to use contrasts.fit() to make that comparison. But if you do

design <- model.matrix(~eset_filtered[,"cohort"])

then the second coefficient is already the comparison you want (SLE - CTL). The intercept computes the mean of the CTL samples, and the second coefficient is the difference between SLE and CTL. In this case you don't have to explicitly make the comparison using contrasts.fit(), so you just do lmFit(), followed by eBayes().

But as Aaron notes, the factor effects model can get unwieldy really quickly if you have lots of comparisons to make, because you then have to do all the algebra to figure out what contrasts you need to get comparisons that don't already exist as coefficients. In those cases it is better to fit a cell means model and make the comparisons you want using contrasts.fit().