Limma, SAM and Fold Change Calculation
1
1
Entering edit mode
Peter White ▴ 80
@peter-white-2263
Last seen 7.1 years ago

I wondered if anyone could comment on what is generally considered the most appropriate method of calculating fold change values from Affy data. I have a data set from a test vs. control experiment (n=3 in each group) performed on the MOE430_2 array. I used the Affy package to read in the CEL file data and GCRMA to normalize. Finally I used limma to analyze (lmFit and eBayes). Now the FC coefficients I get back from limma appear to be log2(2^(mean(test replicates in log2))/(2^mean(control replicates in log2)). For example:

> exprs(eset)[18731,1:6]
TA2.CEL  TA5.CEL  TA7.CEL  TA1.CEL  TA8.CEL  TA9.CEL
3.294215 4.851795 4.403851 4.782934 7.716893 9.284909

> fit$coefficients[18731,1:2] #returns the mean of the above log2 values Fed_MT Fed_WT 4.183287 7.261578 > fit2$coefficients[18731,1]
[1] -3.078291

> -1/2^fit2\$coefficients[18731,1]
[1] -8.446136

So we have a probe that is reporting a -8.4 fold downregulation.

The problem is that SAM does not calculate the fold change values in the same manner. It appears to take the average of the unlogged data and then use those values to calculate fold changes. Thus:

> 2^exprs(eset)[18731,1:6]
TA2.CEL    TA5.CEL    TA7.CEL    TA1.CEL    TA8.CEL    TA9.CEL
9.809738  28.875927  21.168555  27.530016 210.385700 623.786694

> tmp <- c(mean(2^exprs(eset)[18731,1:3]),mean(2^exprs(eset)[18731,4:6]))
> tmp
[1]  19.95141 287.23414

> tmp[1]/tmp[2]
[1] 0.06946043

> -1/(tmp[1]/tmp[2])
[1] -14.39669

So now we have -14.4 fold downregulation.

The example given is of course an extreme, but using the limma method there are 282 probes with a fold change >2, while using the sam method there are only 159, with 141 probes in common. The probes that were called uniquely by limma were almost all downregulated (-2 to -5 fold).

MY QUESTION IS WHICH METHOD IS THE CORRECT WAY?

Thanks,

Peter

Peter White, Ph.D.
Technical Director
Functional Genomics Core
Department of Genetics
University of Pennsylvania
570 Clinical Research Building
415 Curie Boulevard
Tel: +1 (215) 898-0773
Fax: +1 (215) 573-2326
E-mail: pwhite at mail.med.upenn.edu
http://www.med.upenn.edu/pdc/cores_fgc.html
http://www.med.upenn.edu/kaestnerlab/
http://www.betacell.org/microarrays
http://www.cbil.upenn.edu/EPConDB/

affy limma gcrma siggenes • 5.5k views
0
Entering edit mode
@gordon-smyth
Last seen 1 hour ago
WEHI, Melbourne, Australia

Dear Peter,

Since I'm the limma author, it will come as no surprise to you that I view the limma fold change (diff of log-intensities) as the more natural, consistent and useful.

I don't know why SAM computes fold change as it does. It seems to me somewhat inconsistent, because SAM uses the limma log-fold-change as numerator in its own test statistic for computing statistical significant. So SAM reports a different fold change measure to that which it uses internally for determining statistical significance.

Best wishes
Gordon

BTW, here's an amusing example of the potential contradictions introduced by the SAM fold change definition. In this example, SAM finds the gene to be significantly up-regulated but the reported fold change is in the opposite direction!

 > y <- matrix(rnorm(100*20),100,20)
> y[1,] <- rep(c(-2,2),each=10); y[1,1] <- y[1,1]-0.1
> rownames(y) <- paste("Gene",1:100)
> y[1,1] <- 6
> out <- sam(y,cl)
> summary(out,0.4)
SAM Analysis for the Two-Class Unpaired Case Assuming Unequal
Variances

s0 = 0.8  (The 100 % quantile of the s values.)

Number of permutations: 100

MEAN number of falsely called variables is computed.

Delta: 0.4
cutlow: -Inf
cutup: 2
p0: 0.866
Identified Genes: 1
Falsely Called Genes: 0
FDR: 0

Identified Genes (using Delta = 0.4):

Row d.value stdev  rawp q.value R.fold   Name
1   1       2   0.8 1e-04 0.00866  0.604 Gene 1

Note that Gene1 is found to be significantly up-regulated (as indicated by a positive d.value=regularized t) but the reported fold change is less than one, i.e., indicating down-regulation.

0
Entering edit mode

Hi Peter, hi Gordon,

the reason for the fold change computation in SAM is simply that Tusher et al. also use these fold changes. In 2002, SAM was part of my diploma thesis and I tried to reproduce the results of Tusher et al. and the results that are provided by the Excel SAM version. Since then many things in sam have changed, mainly to make sam faster, more user-friendly and more flexible. But I didn't change the computation of the fold change to actually avoid the question why the fold changes differ between the siggenes SAM and the Excel SAM.

Admittedly, "because Tusher et al. do so" is not a very good reason, but I have stopped counting how often I was ask why this and that differs between Excel SAM and siggenes SAM.

Would it help you if I add another slot to the SAM object that contains the numerator of the test statistics? Note that you can compute these numerators by

> sam.out at d * (sam.out at s0 + sam.out at s)

if sam.out is the output of sam.

Best,
Holger

0
Entering edit mode

Dear Holger and Gordon,

Thanks a lot for your responses - that is very helpful. It is likely that both ways of calculating fold change are correct, but I agree with Gordon that it is most appropriate to report the fold change that was used to generate the significance statistic.

I was also thinking I could take the fold changes calculated by both methods and go with the largest (arrays seem to underestimate the magnitude of the fold change), but this would have the issue that merging them would not maintain the same confidence. Ultimately going with the fold change used for the statistic will be the simplest approach.

Best wishes,

Peter

0
Entering edit mode

In my opinion, mixing the two definitions would be truly horrible. It would be best to be consistent, even if you stick to SAM.

In general, you cannot take two different statistical techniques and play them off one against the other on an item by item basis, choosing whichever result you like best, and still expect sensible
results.

See my email to Holger for an example showing that the SAM fold change can even be in a different direction to the statistical significance.

Another potential confusions with the SAM fold change definition are that it doesn't generalise to linear models or to two colour arrays (because you can't compute it from log-ratios).

Best wishes
Gordon

0
Entering edit mode

Hi Peter,

I have just updated the siggenes package. In this version (1.11.11) which should be available soon in the developmental branch of Bioconductor, you can choose between 2^(mean log2 intensity group 1 - mean log2 intensity group 2) and the ratio of the mean intensities by setting use.dm to TRUE or to FALSE, respectively. Please note that the default is still the latter.

Best,
Holger

0
Entering edit mode

CSC1             csc2       CSC3             CSC4           csc5        csc6       csc7           csc8        csc9           csc10
5.140240291    8.095123098    6.685277883    7.527515356    6.715706306    7.248408216    7.996779468    7.185392937    6.89933254    6.939669394

nm    nm    nm    nm    nm    nm    nm    nm    nm    nm    nm    nm    nm    nm    nm    nm    nm    nm    nm    nm    nm    nm    nm    nm    nm    nm
5.544369347    4.908420525    5.555157151    4.905209135    6.366645906    4.598507724    6.303687164    4.013324959    6.050688235    6.047912488    7.137065771    5.099746299    6.153114387    4.85765068    5.951370585    4.803372917    5.138119875    5.734398524    5.725509213    5.798874389    6.404508766    5.154381751    5.642503138    5.332413138    4.57302405    4.943309708

Limma calculate the FC : 0.660284

but if I calculate by formula

avgcsc: 7.043345

avg NM: 5.490

diff: 1.553218

POWERcalculation: power(2,1.553218)= 2.93471

FC= -1/2,93471=0.340749

My questiion is where I did a mistake or the FC value by limma is wrongly calculated. Thanks  in advance for the response.

Sibun

0
Entering edit mode

Dear Sibun,

I can see that this is the first time you've posted a question, so welcome to Bioconductor. However this isn't the way to do it. First, you should post your own question rather than adding a comment to a discussion from 10 years ago. Second, you need to give the limma code and output you used rather than just claiming what limma does. Third you need to explain what the numbers you give actually are. Are they log expression values? Are they from a public microarrray dataset? No one can help you with your question as it is.

Also, just gently, which do you think is more likely? That the limma fold change calculation has been wrong for the past 15 years or that you've made a mistake?