Question

Interpretation of results for treat & topTreat functions from limma regarding microarray abalysis

2

Entering edit mode

svlachavas ▴ 840

@svlachavas-7225

Last seen 1 hour ago

Germany/Heidelberg/German Cancer Resear…

Dear Community,

i recently implemented treat & topTreat functions in a part of my specific microarray analysis, in order to use simultaneously a fold change cutoff with a corrected p-value. Here's a small part of the code used:

...

design1 <- model.matrix(~condition +pairs)

aw <- arrayWeights(eset.filtered, design1)

fit <- lmFit(eset.filtered, design1, weights=aw)
fit2 <- treat(fit,lfc=log2(1.1),trend=TRUE) # a fold-change cutoff as proposed in the TREAT paper
top <- topTreat(fit2, coef=2, adjust.method="fdr",p.value=0.05, confint=0.95, number=nrow(fit2))

1) I noticed, that when i used the above cutoff, i get a greatly increased number of DE genes with a subsequent cutoff of interest after topTreat (absolute log2-fold change bigger than 0.5), than when i use for istance lfc=0.3 in treat -which makes sense as the latter will require a stronger significance than the predifined cutoff. Thus, it is "correct" in my approach to use the notion of lfc as the cutoff in which "below anything would be no interesting", and then apply a further cutoff after topTreat(i.e abs(logFC)>0.5), in order to narrow my DE list for subsequent analyses, like functional enrichment ?

2) Moreover, i also saw that when i utilize treat & topTreat instead of eBayes & topTable, (the logFCs are the same as expected), but the confidence intervals have changed (not to such a great extend, but differ), as also the t statistic ( i did not mention the p-values, which are explained in the paper). Thus, this is a concequence due to the different null hypothesis TREAT tests?

limma treat topTreat microarray • 5.1k views

ADD COMMENT • link updated 8.9 years ago by Gordon Smyth 52k • written 8.9 years ago by svlachavas ▴ 840

score 2 · Answer 1 · 2016-05-16

2

Entering edit mode

Gordon Smyth 52k

@gordon-smyth

Last seen 10 hours ago

WEHI, Melbourne, Australia

it is "correct" in my approach to use the notion of lfc as the cutoff in which "below anything would be no interesting", and then apply a further cutoff after topTreat (i.e abs(logFC)>0.5), in order to narrow my DE list for subsequent analyses, like functional enrichment?

No, that is unnecessarily complicated and not correct. If you have too many DE genes from treat, then just increase the treat lfc value. Please don't apply multiple thresholds -- they mess things up and defeat the purpose of treat.

Moreover, i also saw that when i utilize treat & topTreat instead of eBayes & topTable, (the logFCs are the same as expected), but the confidence intervals have changed (not to such a great extend, but differ),

No, that is not correct. The confidence intervals from treat() and topTreat() are identical to those from eBayes() and topTable().

also the t statistic. Thus, this is a concequence due to the different null hypothesis TREAT tests?

Yes. The t-statistic from treat is (coef - lfc) / se instead of coef / se.

ADD COMMENT • link 8.9 years ago Gordon Smyth 52k

0

Entering edit mode

Dear Gordon,

thank you for your comments and corrections !! However, regarding the "issue" of confidence interval values returned, i will post a quick example below to see my point:

In detail, after the basic step of lmFit, i use the two above mentioned "approaches": eBayes + topTable & treat + topTreat

fit <- lmFit(eset.filtered, design1, weights=aw)

# eBayes + topTable

fit2 <- eBayes(fit, trend=TRUE)

top2_non_treat <- topTable(fit2, coef=2, number=nrow(fit2), adjust.method="fdr", sort.by="none", p.value=0.05, lfc=0.5, confint=0.95)

top2 <- top2_non_treat[order(abs(top2_non_treat$logFC),decreasing=TRUE),]

# treat + topTreat

fit_treat <- treat(fit, lfc=0.5, trend=TRUE) top_treat <- topTreat(fit_treat,coef=2, number=nrow(fit_treat), adjust.method="fdr",p.value=0.05, confint=0.95)

and then, for a random example-gene picked--like AQP8- below:

head(top_treat[,c(2,4,5,6,8,10)])

SYMBOL logFC CI.L CI.R t adj.P.Val 343_at AQP8 -4.913977 -5.017819 -4.810135 -11.535770 2.446537e-09 759_at CA1 -4.462299 -4.603302 -4.321296 -10.423899 1.513851e-08 9429_at ABCG2 -3.456660 -3.573870 -3.339449 -10.289565 1.513851e-08 2980_at GUCA2A -4.273073 -4.357430 -4.188716 -9.650087 5.513691e-08 54860_at MS4A12 -4.772296 -5.056184 -4.488408 -9.461384 6.139835e-08 762_at CA4 -3.852401 -3.918334 -3.786468 -9.441665 6.139835e-08

head(top2[,c(2,4,5,6,8,10)]) SYMBOL logFC CI.L CI.R t adj.P.Val 343_at AQP8 -4.913977 -5.058881 -4.769074 -12.842502 2.783136e-10 22802_at CLCA4 -4.782187 -4.892338 -4.672036 -9.836663 4.183457e-09 54860_at MS4A12 -4.772296 -5.481094 -4.063497 -10.568679 1.782925e-09 759_at CA1 -4.462299 -4.554312 -4.370286 -11.739284 7.697939e-10 2980_at GUCA2A -4.273073 -4.378214 -4.167931 -10.928897 1.337343e-09 653808_at ZG16 -4.079303 -4.234654 -3.923952 -9.242182 1.014976e-08

So, this "small" deviation in the CI values, is negligible ? Or there is something else here ?

ADD REPLY • link 8.9 years ago svlachavas ▴ 840

0

Entering edit mode

Also the sessionInfo (which could be included above):

> sessionInfo()
R version 3.2.2 (2015-08-14)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 8 x64 (build 9200)

locale:
[1] LC_COLLATE=Greek_Greece.1253 LC_CTYPE=Greek_Greece.1253
[3] LC_MONETARY=Greek_Greece.1253 LC_NUMERIC=C
[5] LC_TIME=Greek_Greece.1253

attached base packages:
[1] parallel stats graphics grDevices utils datasets methods
[8] base

other attached packages:
[1] FactoMineR_1.32 limma_3.26.9 affy_1.48.0
[4] Biobase_2.30.0 BiocGenerics_0.16.1

loaded via a namespace (and not attached):
[1] Rcpp_0.12.4 knitr_1.12.3 cluster_2.0.4
[4] magrittr_1.5 splines_3.2.2 zlibbioc_1.16.0
[7] MASS_7.3-45 leaps_2.9 scatterplot3d_0.3-36
[10] lattice_0.20-33 R6_2.1.2 dplyr_0.4.3
[13] tools_3.2.2 grid_3.2.2 a4Base_1.18.0
[16] data.table_1.9.6 DBI_0.3.1 assertthat_0.1
[19] preprocessCore_1.32.0 affyio_1.40.0 BiocInstaller_1.20.1
[22] flashClust_1.01-2 a4Core_1.18.0 chron_2.3-47

ADD REPLY • link 8.9 years ago svlachavas ▴ 840

0

Entering edit mode

Thanks for your code, but I cannot reproduce the discrepancy you show. I find that CIs from treat and eBayes are identical to all decimal places, as they are supposed to be:

> sd <- 0.3*sqrt(4/rchisq(100,df=4))
> y <- matrix(rnorm(100*6,sd=sd),100,6)
> rownames(y) <- paste("Gene",1:100)
> design <- cbind(Grp1=1,Grp2vs1=c(0,0,0,1,1,1))
> y[1:2,4:6] <- y[1:2,4:6] + 2
> aw <- runif(6)
> fit <- lmFit(y,design,weights=aw)
> fit <- eBayes(fit,trend=TRUE)
> topTable(fit,coef=2,conf=0.95,sort="none")
             logFC       CI.L      CI.R     AveExpr         t     P.Value  adj.P.Val         B
Gene 1   2.2041832  1.7398390 2.6685275  1.13359211 10.619937 1.16890e-06 0.00011689  5.995423
Gene 2   2.1776703  1.1654081 3.1899324  1.31062540  4.812973 7.71738e-04 0.03858688 -0.838454
Gene 3   0.0968975 -0.4259556 0.6197507  0.11904920  0.414617 6.87435e-01 0.91061345 -7.219833
Gene 4  -0.0768782 -0.6188584 0.4651019  0.00431761 -0.317347 7.57700e-01 0.91289147 -7.258385
Gene 5   0.4898565 -0.0277139 1.0074269 -0.26273266  2.117453 6.10961e-02 0.49544729 -5.291235
Gene 6   0.1344993 -0.3993726 0.6683712 -0.07794255  0.563634 5.85786e-01 0.89977600 -7.141764
Gene 7   0.1133180 -0.6226684 0.8493044 -0.16774956  0.344464 7.37843e-01 0.91289147 -7.248641
Gene 8  -0.1158898 -0.5402183 0.3084387  0.01477902 -0.611023 5.55220e-01 0.89977600 -7.112253
Gene 9   0.0381164 -0.5264365 0.6026693  0.14807176  0.151050 8.83028e-01 0.96778687 -7.300885
Gene 10 -0.3363684 -0.7236405 0.0509038 -0.09758116 -1.943181 8.15270e-02 0.49544729 -5.563400
> fit.t <- treat(fit,lfc=log2(1.1),trend=TRUE)
> topTreat(fit.t,coef=2,conf=0.95,sort="none")
             logFC       CI.L      CI.R     AveExpr        t     P.Value   adj.P.Val
Gene 1   2.2041832  1.7398390 2.6685275  1.13359211  9.95743 1.37906e-06 0.000137906
Gene 2   2.1776703  1.1654081 3.1899324  1.31062540  4.50907 8.55673e-04 0.042783667
Gene 3   0.0968975 -0.4259556 0.6197507  0.11904920  0.00000 7.37290e-01 0.963533933
Gene 4  -0.0768782 -0.6188584 0.4651019  0.00431761  0.00000 7.94983e-01 0.963533933
Gene 5   0.4898565 -0.0277139 1.0074269 -0.26273266  1.52308 9.10206e-02 0.694899514
Gene 6   0.1344993 -0.3993726 0.6683712 -0.07794255  0.00000 6.45738e-01 0.963533933
Gene 7   0.1133180 -0.6226684 0.8493044 -0.16774956  0.00000 7.60515e-01 0.963533933
Gene 8  -0.1158898 -0.5402183 0.3084387  0.01477902  0.00000 6.50213e-01 0.963533933
Gene 9   0.0381164 -0.5264365 0.6026693  0.14807176  0.00000 9.00289e-01 0.991242076
Gene 10 -0.3363684 -0.7236405 0.0509038 -0.09758116 -1.14883 1.49802e-01 0.713343616
> sessionInfo()
R version 3.3.0 (2016-05-03)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

locale:
[1] LC_COLLATE=English_Australia.1252  LC_CTYPE=English_Australia.1252    LC_MONETARY=English_Australia.1252
[4] LC_NUMERIC=C                       LC_TIME=English_Australia.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] BiocInstaller_1.22.2 limma_3.27.11       

loaded via a namespace (and not attached):
[1] tools_3.3.0    splines_3.3.0  statmod_1.4.24

ADD REPLY • link 8.9 years ago Gordon Smyth 52k

0

Entering edit mode

Dear Gordon,

thank you again for your answer-so this looks very strange !! you believe that is something due to the older version of R im using-or even an older version of limma and i should be concern of ? (i still used these versions to reproduce my analysis, because my total pipeline analysis was performed on this version.)

ADD REPLY • link 8.9 years ago svlachavas ▴ 840

0

Entering edit mode

There is absolutely no difference in the relevant code between the version of limma you have used and the current version.

The results you show cannot be correct. I suggest you rerun the analysis from the start. Upgrading to the current version of Bioconductor would also be a good idea.

ADD REPLY • link 8.9 years ago Gordon Smyth 52k

0

Entering edit mode

Dear Gordon,

i runned your example above in my machine (without however setting a seed):

> topTable(fit,coef=2,conf=0.95,sort="none")
logFC CI.L CI.R AveExpr t
Gene 1 1.79170830 1.2862405 2.2971761 0.749581313 7.9429256
Gene 2 1.98538983 1.3962924 2.5744872 1.005111575 7.5520585
Gene 3 0.17306876 -0.3862442 0.7323817 -0.123517269 0.6933787
Gene 4 0.33210220 -0.3084640 0.9726684 0.163909135 1.1617546
Gene 5 -0.18644044 -0.6376320 0.2647512 -0.007629288 -0.9259463
Gene 6 -0.11947776 -0.7032756 0.4643201 -0.314986022 -0.4585970
Gene 7 0.20428896 -0.5455045 0.9540825 -0.014089852 0.6105341
Gene 8 0.22315728 -0.2667415 0.7130560 -0.032739514 1.0207312
Gene 9 -0.06332212 -0.6324495 0.5058053 -0.086547190 -0.2493174
Gene 10 -0.31835919 -1.2180570 0.5813386 0.166812813 -0.7929164
P.Value adj.P.Val B
Gene 1 1.602493e-05 0.001223362 3.443573
Gene 2 2.446723e-05 0.001223362 3.007536
Gene 3 5.045028e-01 0.877024235 -6.762082
Gene 4 2.734072e-01 0.775267364 -6.327641
Gene 5 3.771661e-01 0.819926245 -6.569321
Gene 6 6.567308e-01 0.908020097 -6.905051
Gene 7 5.556865e-01 0.889967179 -6.818714
Gene 8 3.324176e-01 0.784476091 -6.477419
Gene 9 8.083688e-01 0.960429170 -6.985059
Gene 10 4.469813e-01 0.833698404 -6.685506
> fit.t <- treat(fit,lfc=log2(1.1),trend=TRUE)
> topTreat(fit.t,coef=2,conf=0.95,sort="none")
logFC CI.L CI.R AveExpr t
Gene 1 1.79170830 1.2862405 2.2971761 0.749581313 7.3333508
Gene 2 1.98538983 1.3962924 2.5744872 1.005111575 7.0290203
Gene 3 0.17306876 -0.3862442 0.7323817 -0.123517269 0.1424877
Gene 4 0.33210220 -0.3084640 0.9726684 0.163909135 0.6807420
Gene 5 -0.18644044 -0.6376320 0.2647512 -0.007629288 -0.2430425
Gene 6 -0.11947776 -0.7032756 0.4643201 -0.314986022 0.0000000
Gene 7 0.20428896 -0.5455045 0.9540825 -0.014089852 0.1995937
Gene 8 0.22315728 -0.2667415 0.7130560 -0.032739514 0.3917840
Gene 9 -0.06332212 -0.6324495 0.5058053 -0.086547190 0.0000000
Gene 10 -0.31835919 -1.2180570 0.5813386 0.166812813 -0.4504454
P.Value adj.P.Val
Gene 1 1.988366e-05 0.001454436
Gene 2 2.908871e-05 0.001454436
Gene 3 5.662828e-01 0.974604609
Gene 4 3.224276e-01 0.966464285
Gene 5 4.765440e-01 0.966464285
Gene 6 7.009478e-01 0.974604609
Gene 7 5.890229e-01 0.974604609
Gene 8 4.175356e-01 0.966464285
Gene 9 8.358818e-01 0.984476550
Gene 10 4.730792e-01 0.966464285

So my results above confirm again your notion--but i had checked many times my code and have carefully written every step--i don't know what is happening here--

ADD REPLY • link 8.9 years ago svlachavas ▴ 840

0

Entering edit mode

Dear Gordon,

please excuse me to insist of this matter--as you are the specialist--and with all respect of course--but maybe your example above, is a bit far from my actual analysis? i mean, in no case i doubt your opinion, which i always follow so far from the time i participated to the group, but take a look in my below example:

fit_treat <- treat(fit, lfc=log2(1.1), trend=TRUE)
top_treat <- topTreat(fit_treat,coef=2, number=nrow(fit_treat), adjust.method="fdr",p.value=0.05, confint=0.95)

> head(top_treat[,c(2,4,5,6,8,10)])
SYMBOL logFC CI.L CI.R t adj.P.Val
343_at AQP8 -4.913977 -5.040866 -4.787089 -12.48314 3.653124e-10
9429_at ABCG2 -3.456660 -3.532986 -3.380334 -11.55110 1.318836e-09
759_at CA1 -4.462299 -4.553227 -4.371371 -11.37754 1.394609e-09
6817_at SULT1A1 -1.920571 -2.076531 -1.764611 -10.66296 3.155854e-09
65983_at GRAMD3 -1.333407 -1.550819 -1.115994 -10.57545 3.155854e-09
1001_at CDH3 2.295135 2.060515 2.529754 10.52476 3.155854e-09

head(top2[,c(2,4,5,6,8,10)])
SYMBOL logFC CI.L CI.R t adj.P.Val
343_at AQP8 -4.913977 -5.058881 -4.769074 -12.842502 2.783136e-10
22802_at CLCA4 -4.782187 -4.892338 -4.672036 -9.836663 4.183457e-09
54860_at MS4A12 -4.772296 -5.481094 -4.063497 -10.568679 1.782925e-09
759_at CA1 -4.462299 -4.554312 -4.370286 -11.739284 7.697939e-10
2980_at GUCA2A -4.273073 -4.378214 -4.167931 -10.928897 1.337343e-09
653808_at ZG16 -4.079303 -4.234654 -3.923952 -9.242182 1.014976e-08

Thus, my question (notion) is that maybe the above has to do with the distribution of my expression set, as when i take a smaller cutoff in treat, the updated CIs come "more close" to the ones of topTable + eBayes ?

Best,

Efstathios

ADD REPLY • link 8.9 years ago svlachavas ▴ 840

1

Entering edit mode

No, the CIs returned by topTreat() are completely unaffected by the lfc cutoff used for treat().

As I have already said, the CIs returned by topTreat() are identical to those from topTable(). This true for all datasets and all parameter options. And I do mean IDENTICAL, not just roughly equal.

If you are seeing different, then there are only two possibilities. Either you have made a mistake or there is a bug in limma. If you think the latter is a possibility, then please send me a reproducible example that illustrates the problem. I have to be able to run the example myself. Just posting output to this website won't help.

ADD REPLY • link 8.9 years ago Gordon Smyth 52k

0

Entering edit mode

Dear Gordon,

thank you again for your clarification about the CIs !! Firstly, i tried the following "artificial example" :

set.seed(81) mat <- matrix(data=sample(1:100), nrow=10, ncol=10) new.set <- new("ExpressionSet", exprs=mat) pdat <- data.frame(Condition=rep(c("Normal","Cancer"),each=5))

pairs <- factor(rep(1:5, each=2)) pairs [1] 1 1 2 2 3 3 4 4 5 5 Levels: 1 2 3 4 5

con <- factor(pdat$Condition)

f <- relevel(con, ref="Normal")

design <- model.matrix(~f + pairs) aw <- arrayWeights(new.set, design) fit <- lmFit(new.set, design,weights=aw) fit2 <- eBayes(fit, trend=TRUE)

top2_example <- limma::topTable(fit2, coef=2, number=nrow(fit2), adjust.method="fdr", sort.by="none",confint=0.95) head(top2_example) logFC CI.L CI.R AveExpr t P.Value 1 27.4 -9.844791 64.644791 48.3 1.48685129 0.1448948 2 -31.0 -68.567483 6.567483 48.7 -1.66775447 0.1031759 3 2.8 -38.129908 43.729908 62.8 0.13826103 0.8907278 4 17.8 -20.546109 56.146109 50.1 0.93816929 0.3537880 5 -18.8 -69.419734 31.819734 57.0 -0.75062065 0.4572741 6 1.4 -49.210042 52.010042 64.1 0.05590799 0.9556934

fit_treat <- treat(fit,lfc=log2(1.1),trend=TRUE) top_treat <- topTreat(fit_treat,coef=2,number=nrow(fit_treat),adjust.method="fdr",confint=0.95)

top_treat logFC CI.L CI.R AveExpr t P.Value 7 -32.2 -59.884189 -4.5158115 41.1 -2.34071237 0.02375872 9 -29.8 -60.216607 0.6166067 42.5 -1.97096745 0.05460903 2 -31.0 -68.567483 6.5674827 48.7 -1.66035698 0.10318466 1 27.4 -9.844791 64.6447911 48.3 1.47938971 0.14490538 8 -18.6 -45.493179 8.2931794 40.3 -1.38749297 0.16989115 10 -17.4 -54.874234 20.0742341 50.1 -0.93100787 0.35367204 4 17.8 -20.546109 56.1461092 50.1 0.93092201 0.35380058 5 -18.8 -69.419734 31.8197336 57.0 -0.74513060 0.45728090 3 2.8 -38.129908 43.7299084 62.8 0.13147125 0.89073039 6 1.4 -49.210042 52.0100420 64.1 0.05041688 0.95569413

Thus, even my example is "random"--but i tried to replicate--my experimental design, the confidence intervals are identical !! However, this makes things even complicated, as i re-runed my analysis again from the import of my CEL files, and again noticed the same discrepancy !! My code is fixed and checked with caution in every step, as you can also see from the part i posted !!

The only thing i would like to pinpoint, is that my ExpressionSet used is batch effect corrected with ComBat--so, could even this hamper the distributions of the CIs, or is still irelevant and has nothing to do ?

Best,

Efstathios

ADD REPLY • link 8.9 years ago svlachavas ▴ 840