Interpretation of results for treat & topTreat functions from limma regarding microarray abalysis
1
2
Entering edit mode
svlachavas ▴ 830
@svlachavas-7225
Last seen 13 months ago
Germany/Heidelberg/German Cancer Resear…

Dear Community,

i recently implemented treat & topTreat functions in a part of my specific microarray analysis, in order to use simultaneously a fold change cutoff with a corrected p-value. Here's a small part of the code used:

...

design1 <- model.matrix(~condition +pairs)

aw <- arrayWeights(eset.filtered, design1)

fit <- lmFit(eset.filtered, design1, weights=aw)
fit2 <- treat(fit,lfc=log2(1.1),trend=TRUE) # a fold-change cutoff as proposed in the TREAT paper
top <- topTreat(fit2, coef=2, adjust.method="fdr",p.value=0.05, confint=0.95, number=nrow(fit2))

1) I noticed, that when i used the above cutoff, i get a greatly increased number of DE genes with a subsequent cutoff of interest after topTreat (absolute log2-fold change bigger than 0.5), than when i use for istance lfc=0.3 in treat -which makes sense as the latter will require a stronger significance than the predifined cutoff. Thus, it is "correct" in my approach to use the notion of lfc as the cutoff in which "below anything would be no interesting", and then apply a further cutoff after topTreat(i.e abs(logFC)>0.5), in order to narrow my DE list for subsequent analyses, like functional enrichment ?

2) Moreover, i also saw that when i utilize treat & topTreat instead of eBayes & topTable, (the logFCs are the same as expected), but the confidence intervals have changed (not to such a great extend, but differ), as also the t statistic ( i did not mention the p-values, which are explained in the paper). Thus, this is a concequence due to the different null hypothesis TREAT tests?

limma treat topTreat microarray • 4.8k views
ADD COMMENT
2
Entering edit mode
@gordon-smyth
Last seen 59 minutes ago
WEHI, Melbourne, Australia

it is "correct" in my approach to use the notion of lfc as the cutoff in which "below anything would be no interesting", and then apply a further cutoff after topTreat (i.e abs(logFC)>0.5), in order to narrow my DE list for subsequent analyses, like functional enrichment?

No, that is unnecessarily complicated and not correct. If you have too many DE genes from treat, then just increase the treat lfc value. Please don't apply multiple thresholds -- they mess things up and defeat the purpose of treat.

Moreover, i also saw that when i utilize treat & topTreat instead of eBayes & topTable, (the logFCs are the same as expected), but the confidence intervals have changed (not to such a great extend, but differ),

No, that is not correct. The confidence intervals from treat() and topTreat() are identical to those from eBayes() and topTable().

also the t statistic. Thus, this is a concequence due to the different null hypothesis TREAT tests?

Yes. The t-statistic from treat is  (coef - lfc) / se  instead of  coef / se.

ADD COMMENT
0
Entering edit mode

Dear Gordon,

thank you for your comments and corrections !! However, regarding the "issue" of confidence interval values returned, i will post a quick example below to see my point:

In detail, after the basic step of lmFit, i use the two above mentioned "approaches": eBayes + topTable & treat + topTreat

 

fit <- lmFit(eset.filtered, design1, weights=aw)

# eBayes + topTable

fit2 <- eBayes(fit, trend=TRUE)

top2_non_treat <- topTable(fit2, coef=2, number=nrow(fit2), adjust.method="fdr", sort.by="none", p.value=0.05, lfc=0.5, confint=0.95)

top2 <- top2_non_treat[order(abs(top2_non_treat$logFC),decreasing=TRUE),]

treat + topTreat

fit_treat <- treat(fit, lfc=0.5, trend=TRUE)
top_treat <- topTreat(fit_treat,coef=2, number=nrow(fit_treat), adjust.method="fdr",p.value=0.05, confint=0.95)

and then, for a random example-gene picked--like AQP8- below:

head(top_treat[,c(2,4,5,6,8,10)])

            SYMBOL     logFC       CI.L        CI.R             t        adj.P.Val
343_at     AQP8   -4.913977     -5.017819  -4.810135     -11.535770      2.446537e-09
759_at      CA1   -4.462299   -4.603302    -4.321296      -10.423899     1.513851e-08
9429_at   ABCG2   -3.456660   -3.573870    -3.339449    -10.289565      1.513851e-08
2980_at  GUCA2A   -4.273073     -4.357430    -4.188716     -9.650087      5.513691e-08
54860_at MS4A12   -4.772296     -5.056184    -4.488408     -9.461384      6.139835e-08
762_at    CA4    -3.852401      -3.918334    -3.786468    -9.441665      6.139835e-08

 

head(top2[,c(2,4,5,6,8,10)])
             SYMBOL      logFC         CI.L       CI.R              t      adj.P.Val
343_at      AQP8     -4.913977      -5.058881  -4.769074     -12.842502    2.783136e-10
22802_at   CLCA4     -4.782187      -4.892338   -4.672036    -9.836663     4.183457e-09
54860_at  MS4A12     -4.772296      -5.481094   -4.063497    -10.568679    1.782925e-09
759_at       CA1      -4.462299     -4.554312   -4.370286     -11.739284   7.697939e-10
2980_at   GUCA2A     -4.273073      -4.378214   -4.167931    -10.928897    1.337343e-09
653808_at   ZG16     -4.079303      -4.234654   -3.923952     -9.242182    1.014976e-08

So, this "small" deviation in the CI values, is negligible ? Or there is something else here ?

 

ADD REPLY
0
Entering edit mode

Also the sessionInfo (which could be included above):

> sessionInfo()
R version 3.2.2 (2015-08-14)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 8 x64 (build 9200)

locale:
[1] LC_COLLATE=Greek_Greece.1253  LC_CTYPE=Greek_Greece.1253   
[3] LC_MONETARY=Greek_Greece.1253 LC_NUMERIC=C                 
[5] LC_TIME=Greek_Greece.1253    

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
[1] FactoMineR_1.32     limma_3.26.9        affy_1.48.0        
[4] Biobase_2.30.0      BiocGenerics_0.16.1

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.4           knitr_1.12.3          cluster_2.0.4        
 [4] magrittr_1.5          splines_3.2.2         zlibbioc_1.16.0      
 [7] MASS_7.3-45           leaps_2.9             scatterplot3d_0.3-36 
[10] lattice_0.20-33       R6_2.1.2              dplyr_0.4.3          
[13] tools_3.2.2           grid_3.2.2            a4Base_1.18.0        
[16] data.table_1.9.6      DBI_0.3.1             assertthat_0.1       
[19] preprocessCore_1.32.0 affyio_1.40.0         BiocInstaller_1.20.1 
[22] flashClust_1.01-2     a4Core_1.18.0         chron_2.3-47    

ADD REPLY
0
Entering edit mode

Thanks for your code, but I cannot reproduce the discrepancy you show. I find that CIs from treat and eBayes are identical to all decimal places, as they are supposed to be:

> sd <- 0.3*sqrt(4/rchisq(100,df=4))
> y <- matrix(rnorm(100*6,sd=sd),100,6)
> rownames(y) <- paste("Gene",1:100)
> design <- cbind(Grp1=1,Grp2vs1=c(0,0,0,1,1,1))
> y[1:2,4:6] <- y[1:2,4:6] + 2
> aw <- runif(6)
> fit <- lmFit(y,design,weights=aw)
> fit <- eBayes(fit,trend=TRUE)
> topTable(fit,coef=2,conf=0.95,sort="none")
             logFC       CI.L      CI.R     AveExpr         t     P.Value  adj.P.Val         B
Gene 1   2.2041832  1.7398390 2.6685275  1.13359211 10.619937 1.16890e-06 0.00011689  5.995423
Gene 2   2.1776703  1.1654081 3.1899324  1.31062540  4.812973 7.71738e-04 0.03858688 -0.838454
Gene 3   0.0968975 -0.4259556 0.6197507  0.11904920  0.414617 6.87435e-01 0.91061345 -7.219833
Gene 4  -0.0768782 -0.6188584 0.4651019  0.00431761 -0.317347 7.57700e-01 0.91289147 -7.258385
Gene 5   0.4898565 -0.0277139 1.0074269 -0.26273266  2.117453 6.10961e-02 0.49544729 -5.291235
Gene 6   0.1344993 -0.3993726 0.6683712 -0.07794255  0.563634 5.85786e-01 0.89977600 -7.141764
Gene 7   0.1133180 -0.6226684 0.8493044 -0.16774956  0.344464 7.37843e-01 0.91289147 -7.248641
Gene 8  -0.1158898 -0.5402183 0.3084387  0.01477902 -0.611023 5.55220e-01 0.89977600 -7.112253
Gene 9   0.0381164 -0.5264365 0.6026693  0.14807176  0.151050 8.83028e-01 0.96778687 -7.300885
Gene 10 -0.3363684 -0.7236405 0.0509038 -0.09758116 -1.943181 8.15270e-02 0.49544729 -5.563400
> fit.t <- treat(fit,lfc=log2(1.1),trend=TRUE)
> topTreat(fit.t,coef=2,conf=0.95,sort="none")
             logFC       CI.L      CI.R     AveExpr        t     P.Value   adj.P.Val
Gene 1   2.2041832  1.7398390 2.6685275  1.13359211  9.95743 1.37906e-06 0.000137906
Gene 2   2.1776703  1.1654081 3.1899324  1.31062540  4.50907 8.55673e-04 0.042783667
Gene 3   0.0968975 -0.4259556 0.6197507  0.11904920  0.00000 7.37290e-01 0.963533933
Gene 4  -0.0768782 -0.6188584 0.4651019  0.00431761  0.00000 7.94983e-01 0.963533933
Gene 5   0.4898565 -0.0277139 1.0074269 -0.26273266  1.52308 9.10206e-02 0.694899514
Gene 6   0.1344993 -0.3993726 0.6683712 -0.07794255  0.00000 6.45738e-01 0.963533933
Gene 7   0.1133180 -0.6226684 0.8493044 -0.16774956  0.00000 7.60515e-01 0.963533933
Gene 8  -0.1158898 -0.5402183 0.3084387  0.01477902  0.00000 6.50213e-01 0.963533933
Gene 9   0.0381164 -0.5264365 0.6026693  0.14807176  0.00000 9.00289e-01 0.991242076
Gene 10 -0.3363684 -0.7236405 0.0509038 -0.09758116 -1.14883 1.49802e-01 0.713343616
> sessionInfo()
R version 3.3.0 (2016-05-03)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

locale:
[1] LC_COLLATE=English_Australia.1252  LC_CTYPE=English_Australia.1252    LC_MONETARY=English_Australia.1252
[4] LC_NUMERIC=C                       LC_TIME=English_Australia.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] BiocInstaller_1.22.2 limma_3.27.11       

loaded via a namespace (and not attached):
[1] tools_3.3.0    splines_3.3.0  statmod_1.4.24
ADD REPLY
0
Entering edit mode

Dear Gordon,

thank you again for your answer-so this looks very strange !! you believe that is something due to the older version of R  im using-or even an older version of limma and i should be concern of ? (i still used these versions to reproduce my analysis, because my total pipeline analysis was performed on this version.)

ADD REPLY
0
Entering edit mode

There is absolutely no difference in the relevant code between the version of limma you have used and the current version.

The results you show cannot be correct. I suggest you rerun the analysis from the start. Upgrading to the current version of Bioconductor would also be a good idea.

ADD REPLY
0
Entering edit mode

Dear Gordon,

i runned your example above in my machine (without however setting a seed):

> topTable(fit,coef=2,conf=0.95,sort="none")
                   logFC       CI.L           CI.R         AveExpr           t
Gene 1   1.79170830  1.2862405 2.2971761  0.749581313  7.9429256
Gene 2   1.98538983  1.3962924 2.5744872  1.005111575  7.5520585
Gene 3   0.17306876 -0.3862442 0.7323817 -0.123517269  0.6933787
Gene 4   0.33210220 -0.3084640 0.9726684  0.163909135  1.1617546
Gene 5  -0.18644044 -0.6376320 0.2647512 -0.007629288 -0.9259463
Gene 6  -0.11947776 -0.7032756 0.4643201 -0.314986022 -0.4585970
Gene 7   0.20428896 -0.5455045 0.9540825 -0.014089852  0.6105341
Gene 8   0.22315728 -0.2667415 0.7130560 -0.032739514  1.0207312
Gene 9  -0.06332212 -0.6324495 0.5058053 -0.086547190 -0.2493174
Gene 10 -0.31835919 -1.2180570 0.5813386  0.166812813 -0.7929164
             P.Value   adj.P.Val         B
Gene 1  1.602493e-05 0.001223362  3.443573
Gene 2  2.446723e-05 0.001223362  3.007536
Gene 3  5.045028e-01 0.877024235 -6.762082
Gene 4  2.734072e-01 0.775267364 -6.327641
Gene 5  3.771661e-01 0.819926245 -6.569321
Gene 6  6.567308e-01 0.908020097 -6.905051
Gene 7  5.556865e-01 0.889967179 -6.818714
Gene 8  3.324176e-01 0.784476091 -6.477419
Gene 9  8.083688e-01 0.960429170 -6.985059
Gene 10 4.469813e-01 0.833698404 -6.685506
> fit.t <- treat(fit,lfc=log2(1.1),trend=TRUE)
> topTreat(fit.t,coef=2,conf=0.95,sort="none")
                     logFC         CI.L        CI.R         AveExpr          t
Gene 1   1.79170830  1.2862405 2.2971761  0.749581313  7.3333508
Gene 2   1.98538983  1.3962924 2.5744872  1.005111575  7.0290203
Gene 3   0.17306876 -0.3862442 0.7323817 -0.123517269  0.1424877
Gene 4   0.33210220 -0.3084640 0.9726684  0.163909135  0.6807420
Gene 5  -0.18644044 -0.6376320 0.2647512 -0.007629288 -0.2430425
Gene 6  -0.11947776 -0.7032756 0.4643201 -0.314986022  0.0000000
Gene 7   0.20428896 -0.5455045 0.9540825 -0.014089852  0.1995937
Gene 8   0.22315728 -0.2667415 0.7130560 -0.032739514  0.3917840
Gene 9  -0.06332212 -0.6324495 0.5058053 -0.086547190  0.0000000
Gene 10 -0.31835919 -1.2180570 0.5813386  0.166812813 -0.4504454
               P.Value          adj.P.Val
Gene 1  1.988366e-05 0.001454436
Gene 2  2.908871e-05 0.001454436
Gene 3  5.662828e-01 0.974604609
Gene 4  3.224276e-01 0.966464285
Gene 5  4.765440e-01 0.966464285
Gene 6  7.009478e-01 0.974604609
Gene 7  5.890229e-01 0.974604609
Gene 8  4.175356e-01 0.966464285
Gene 9  8.358818e-01 0.984476550
Gene 10 4.730792e-01 0.966464285

 

So my results above confirm again your notion--but i had checked many times my code and have carefully written every step--i don't know what is happening here--

ADD REPLY
0
Entering edit mode

Dear Gordon,

please excuse me to insist of this matter--as you are the specialist--and with all respect of course--but maybe your example above, is a bit far from my actual analysis? i mean, in no case i doubt your opinion, which i always follow so far from the time i participated to the group, but take a look in my below example:

fit_treat <- treat(fit, lfc=log2(1.1), trend=TRUE)
top_treat <- topTreat(fit_treat,coef=2, number=nrow(fit_treat), adjust.method="fdr",p.value=0.05, confint=0.95)

> head(top_treat[,c(2,4,5,6,8,10)])
              SYMBOL     logFC        CI.L              CI.R          t         adj.P.Val
343_at      AQP8      -4.913977    -5.040866 -4.787089 -12.48314 3.653124e-10
9429_at    ABCG2    -3.456660     -3.532986 -3.380334 -11.55110 1.318836e-09
759_at       CA1        -4.462299    -4.553227 -4.371371 -11.37754 1.394609e-09
6817_at  SULT1A1 -1.920571      -2.076531 -1.764611 -10.66296 3.155854e-09
65983_at  GRAMD3 -1.333407    -1.550819 -1.115994 -10.57545 3.155854e-09
1001_at     CDH3  2.295135         2.060515  2.529754  10.52476 3.155854e-09

head(top2[,c(2,4,5,6,8,10)])
             SYMBOL      logFC             CI.L       CI.R              t                adj.P.Val
343_at      AQP8     -4.913977      -5.058881  -4.769074     -12.842502    2.783136e-10
22802_at   CLCA4     -4.782187      -4.892338   -4.672036    -9.836663     4.183457e-09
54860_at  MS4A12     -4.772296      -5.481094   -4.063497    -10.568679    1.782925e-09
759_at       CA1      -4.462299     -4.554312   -4.370286     -11.739284   7.697939e-10
2980_at   GUCA2A     -4.273073      -4.378214   -4.167931    -10.928897    1.337343e-09
653808_at   ZG16     -4.079303      -4.234654   -3.923952     -9.242182    1.014976e-08

Thus, my question (notion) is that maybe the above has to do with the distribution of my expression set, as when i take a smaller cutoff in treat, the updated CIs come "more close" to the ones of topTable + eBayes ?

Best,

Efstathios

ADD REPLY
1
Entering edit mode

No, the CIs returned by topTreat() are completely unaffected by the lfc cutoff used for treat().

As I have already said, the CIs returned by topTreat() are identical to those from topTable(). This true for all datasets and all parameter options. And I do mean IDENTICAL, not just roughly equal.

If you are seeing different, then there are only two possibilities. Either you have made a mistake or there is a bug in limma. If you think the latter is a possibility, then please send me a reproducible example that illustrates the problem. I have to be able to run the example myself. Just posting output to this website won't help.

ADD REPLY
0
Entering edit mode

Dear Gordon,

thank you again for your clarification about the CIs !! Firstly, i tried the following "artificial example" :

set.seed(81)
mat <- matrix(data=sample(1:100), nrow=10, ncol=10)
new.set <- new("ExpressionSet", exprs=mat)
pdat <- data.frame(Condition=rep(c("Normal","Cancer"),each=5))

pairs <- factor(rep(1:5, each=2))
pairs
 [1] 1 1 2 2 3 3 4 4 5 5
Levels: 1 2 3 4 5

con <- factor(pdat$Condition)

f <- relevel(con, ref="Normal")

design <- model.matrix(~f + pairs)
aw <- arrayWeights(new.set, design)
fit <- lmFit(new.set, design,weights=aw)
fit2 <- eBayes(fit, trend=TRUE)

top2_example <- limma::topTable(fit2, coef=2, number=nrow(fit2), adjust.method="fdr", sort.by="none",confint=0.95)
head(top2_example)
  logFC       CI.L     CI.R    AveExpr     t       P.Value
1  27.4  -9.844791 64.644791    48.3  1.48685129 0.1448948
2 -31.0 -68.567483  6.567483    48.7 -1.66775447 0.1031759
3   2.8 -38.129908 43.729908    62.8  0.13826103 0.8907278
4  17.8 -20.546109 56.146109    50.1  0.93816929 0.3537880
5 -18.8 -69.419734 31.819734    57.0 -0.75062065 0.4572741
6   1.4 -49.210042 52.010042    64.1  0.05590799 0.9556934

fit_treat <- treat(fit,lfc=log2(1.1),trend=TRUE)
top_treat <- topTreat(fit_treat,coef=2,number=nrow(fit_treat),adjust.method="fdr",confint=0.95)

top_treat
   logFC       CI.L       CI.R    AveExpr       t     P.Value
7  -32.2 -59.884189 -4.5158115    41.1 -2.34071237 0.02375872
9  -29.8 -60.216607  0.6166067    42.5 -1.97096745 0.05460903
2  -31.0 -68.567483  6.5674827    48.7 -1.66035698 0.10318466
1   27.4  -9.844791 64.6447911    48.3  1.47938971 0.14490538
8  -18.6 -45.493179  8.2931794    40.3 -1.38749297 0.16989115
10 -17.4 -54.874234 20.0742341    50.1 -0.93100787 0.35367204
4   17.8 -20.546109 56.1461092    50.1  0.93092201 0.35380058
5  -18.8 -69.419734 31.8197336    57.0 -0.74513060 0.45728090
3    2.8 -38.129908 43.7299084    62.8  0.13147125 0.89073039
6    1.4 -49.210042 52.0100420    64.1  0.05041688 0.95569413

Thus, even my example is "random"--but i tried to replicate--my experimental design, the confidence intervals are identical !! However, this makes things even complicated, as i re-runed my analysis again from the import of my CEL files, and again noticed the same discrepancy !! My code is fixed and checked with caution in every step, as you can also see from the part i posted !!

The only thing i would like to pinpoint, is that my ExpressionSet used is batch effect corrected with ComBat--so, could even this hamper the distributions of the CIs, or is still irelevant and has nothing to do ?

Best,

Efstathios

ADD REPLY

Login before adding your answer.

Traffic: 441 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6