Dear Community,
i recently implemented treat & topTreat functions in a part of my specific microarray analysis, in order to use simultaneously a fold change cutoff with a corrected p-value. Here's a small part of the code used:
...
design1 <- model.matrix(~condition +pairs)
aw <- arrayWeights(eset.filtered, design1)fit <- lmFit(eset.filtered, design1, weights=aw) fit2 <- treat(fit,lfc=log2(1.1),trend=TRUE) # a fold-change cutoff as proposed in the TREAT paper top <- topTreat(fit2, coef=2, adjust.method="fdr",p.value=0.05, confint=0.95, number=nrow(fit2))
1) I noticed, that when i used the above cutoff, i get a greatly increased number of DE genes with a subsequent cutoff of interest after topTreat (absolute log2-fold change bigger than 0.5), than when i use for istance lfc=0.3 in treat -which makes sense as the latter will require a stronger significance than the predifined cutoff. Thus, it is "correct" in my approach to use the notion of lfc as the cutoff in which "below anything would be no interesting", and then apply a further cutoff after topTreat(i.e abs(logFC)>0.5), in order to narrow my DE list for subsequent analyses, like functional enrichment ?
2) Moreover, i also saw that when i utilize treat & topTreat instead of eBayes & topTable, (the logFCs are the same as expected), but the confidence intervals have changed (not to such a great extend, but differ), as also the t statistic ( i did not mention the p-values, which are explained in the paper). Thus, this is a concequence due to the different null hypothesis TREAT tests?
Dear Gordon,
thank you for your comments and corrections !! However, regarding the "issue" of confidence interval values returned, i will post a quick example below to see my point:
In detail, after the basic step of lmFit, i use the two above mentioned "approaches": eBayes + topTable & treat + topTreat
fit <- lmFit(eset.filtered, design1, weights=aw)
# eBayes + topTable
fit2 <- eBayes(fit, trend=TRUE)
top2_non_treat <- topTable(fit2, coef=2, number=nrow(fit2), adjust.method="fdr", sort.by="none", p.value=0.05, lfc=0.5, confint=0.95)
top2 <- top2_non_treat[order(abs(top2_non_treat$logFC),decreasing=TRUE),]
# treat + topTreat
fit_treat <- treat(fit, lfc=0.5, trend=TRUE)
top_treat <- topTreat(fit_treat,coef=2, number=nrow(fit_treat), adjust.method="fdr",p.value=0.05, confint=0.95)
and then, for a random example-gene picked--like AQP8- below:
head(top_treat[,c(2,4,5,6,8,10)])
SYMBOL logFC CI.L CI.R t adj.P.Val
343_at AQP8 -4.913977 -5.017819 -4.810135 -11.535770 2.446537e-09
759_at CA1 -4.462299 -4.603302 -4.321296 -10.423899 1.513851e-08
9429_at ABCG2 -3.456660 -3.573870 -3.339449 -10.289565 1.513851e-08
2980_at GUCA2A -4.273073 -4.357430 -4.188716 -9.650087 5.513691e-08
54860_at MS4A12 -4.772296 -5.056184 -4.488408 -9.461384 6.139835e-08
762_at CA4 -3.852401 -3.918334 -3.786468 -9.441665 6.139835e-08
head(top2[,c(2,4,5,6,8,10)])
SYMBOL logFC CI.L CI.R t adj.P.Val
343_at AQP8 -4.913977 -5.058881 -4.769074 -12.842502 2.783136e-10
22802_at CLCA4 -4.782187 -4.892338 -4.672036 -9.836663 4.183457e-09
54860_at MS4A12 -4.772296 -5.481094 -4.063497 -10.568679 1.782925e-09
759_at CA1 -4.462299 -4.554312 -4.370286 -11.739284 7.697939e-10
2980_at GUCA2A -4.273073 -4.378214 -4.167931 -10.928897 1.337343e-09
653808_at ZG16 -4.079303 -4.234654 -3.923952 -9.242182 1.014976e-08
So, this "small" deviation in the CI values, is negligible ? Or there is something else here ?
Also the sessionInfo (which could be included above):
> sessionInfo()
R version 3.2.2 (2015-08-14)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 8 x64 (build 9200)
locale:
[1] LC_COLLATE=Greek_Greece.1253 LC_CTYPE=Greek_Greece.1253
[3] LC_MONETARY=Greek_Greece.1253 LC_NUMERIC=C
[5] LC_TIME=Greek_Greece.1253
attached base packages:
[1] parallel stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] FactoMineR_1.32 limma_3.26.9 affy_1.48.0
[4] Biobase_2.30.0 BiocGenerics_0.16.1
loaded via a namespace (and not attached):
[1] Rcpp_0.12.4 knitr_1.12.3 cluster_2.0.4
[4] magrittr_1.5 splines_3.2.2 zlibbioc_1.16.0
[7] MASS_7.3-45 leaps_2.9 scatterplot3d_0.3-36
[10] lattice_0.20-33 R6_2.1.2 dplyr_0.4.3
[13] tools_3.2.2 grid_3.2.2 a4Base_1.18.0
[16] data.table_1.9.6 DBI_0.3.1 assertthat_0.1
[19] preprocessCore_1.32.0 affyio_1.40.0 BiocInstaller_1.20.1
[22] flashClust_1.01-2 a4Core_1.18.0 chron_2.3-47
Thanks for your code, but I cannot reproduce the discrepancy you show. I find that CIs from treat and eBayes are identical to all decimal places, as they are supposed to be:
Dear Gordon,
thank you again for your answer-so this looks very strange !! you believe that is something due to the older version of R im using-or even an older version of limma and i should be concern of ? (i still used these versions to reproduce my analysis, because my total pipeline analysis was performed on this version.)
There is absolutely no difference in the relevant code between the version of limma you have used and the current version.
The results you show cannot be correct. I suggest you rerun the analysis from the start. Upgrading to the current version of Bioconductor would also be a good idea.
Dear Gordon,
i runned your example above in my machine (without however setting a seed):
> topTable(fit,coef=2,conf=0.95,sort="none")
logFC CI.L CI.R AveExpr t
Gene 1 1.79170830 1.2862405 2.2971761 0.749581313 7.9429256
Gene 2 1.98538983 1.3962924 2.5744872 1.005111575 7.5520585
Gene 3 0.17306876 -0.3862442 0.7323817 -0.123517269 0.6933787
Gene 4 0.33210220 -0.3084640 0.9726684 0.163909135 1.1617546
Gene 5 -0.18644044 -0.6376320 0.2647512 -0.007629288 -0.9259463
Gene 6 -0.11947776 -0.7032756 0.4643201 -0.314986022 -0.4585970
Gene 7 0.20428896 -0.5455045 0.9540825 -0.014089852 0.6105341
Gene 8 0.22315728 -0.2667415 0.7130560 -0.032739514 1.0207312
Gene 9 -0.06332212 -0.6324495 0.5058053 -0.086547190 -0.2493174
Gene 10 -0.31835919 -1.2180570 0.5813386 0.166812813 -0.7929164
P.Value adj.P.Val B
Gene 1 1.602493e-05 0.001223362 3.443573
Gene 2 2.446723e-05 0.001223362 3.007536
Gene 3 5.045028e-01 0.877024235 -6.762082
Gene 4 2.734072e-01 0.775267364 -6.327641
Gene 5 3.771661e-01 0.819926245 -6.569321
Gene 6 6.567308e-01 0.908020097 -6.905051
Gene 7 5.556865e-01 0.889967179 -6.818714
Gene 8 3.324176e-01 0.784476091 -6.477419
Gene 9 8.083688e-01 0.960429170 -6.985059
Gene 10 4.469813e-01 0.833698404 -6.685506
> fit.t <- treat(fit,lfc=log2(1.1),trend=TRUE)
> topTreat(fit.t,coef=2,conf=0.95,sort="none")
logFC CI.L CI.R AveExpr t
Gene 1 1.79170830 1.2862405 2.2971761 0.749581313 7.3333508
Gene 2 1.98538983 1.3962924 2.5744872 1.005111575 7.0290203
Gene 3 0.17306876 -0.3862442 0.7323817 -0.123517269 0.1424877
Gene 4 0.33210220 -0.3084640 0.9726684 0.163909135 0.6807420
Gene 5 -0.18644044 -0.6376320 0.2647512 -0.007629288 -0.2430425
Gene 6 -0.11947776 -0.7032756 0.4643201 -0.314986022 0.0000000
Gene 7 0.20428896 -0.5455045 0.9540825 -0.014089852 0.1995937
Gene 8 0.22315728 -0.2667415 0.7130560 -0.032739514 0.3917840
Gene 9 -0.06332212 -0.6324495 0.5058053 -0.086547190 0.0000000
Gene 10 -0.31835919 -1.2180570 0.5813386 0.166812813 -0.4504454
P.Value adj.P.Val
Gene 1 1.988366e-05 0.001454436
Gene 2 2.908871e-05 0.001454436
Gene 3 5.662828e-01 0.974604609
Gene 4 3.224276e-01 0.966464285
Gene 5 4.765440e-01 0.966464285
Gene 6 7.009478e-01 0.974604609
Gene 7 5.890229e-01 0.974604609
Gene 8 4.175356e-01 0.966464285
Gene 9 8.358818e-01 0.984476550
Gene 10 4.730792e-01 0.966464285
So my results above confirm again your notion--but i had checked many times my code and have carefully written every step--i don't know what is happening here--
Dear Gordon,
please excuse me to insist of this matter--as you are the specialist--and with all respect of course--but maybe your example above, is a bit far from my actual analysis? i mean, in no case i doubt your opinion, which i always follow so far from the time i participated to the group, but take a look in my below example:
fit_treat <- treat(fit, lfc=log2(1.1), trend=TRUE)
top_treat <- topTreat(fit_treat,coef=2, number=nrow(fit_treat), adjust.method="fdr",p.value=0.05, confint=0.95)
> head(top_treat[,c(2,4,5,6,8,10)])
SYMBOL logFC CI.L CI.R t adj.P.Val
343_at AQP8 -4.913977 -5.040866 -4.787089 -12.48314 3.653124e-10
9429_at ABCG2 -3.456660 -3.532986 -3.380334 -11.55110 1.318836e-09
759_at CA1 -4.462299 -4.553227 -4.371371 -11.37754 1.394609e-09
6817_at SULT1A1 -1.920571 -2.076531 -1.764611 -10.66296 3.155854e-09
65983_at GRAMD3 -1.333407 -1.550819 -1.115994 -10.57545 3.155854e-09
1001_at CDH3 2.295135 2.060515 2.529754 10.52476 3.155854e-09
head(top2[,c(2,4,5,6,8,10)])
SYMBOL logFC CI.L CI.R t adj.P.Val
343_at AQP8 -4.913977 -5.058881 -4.769074 -12.842502 2.783136e-10
22802_at CLCA4 -4.782187 -4.892338 -4.672036 -9.836663 4.183457e-09
54860_at MS4A12 -4.772296 -5.481094 -4.063497 -10.568679 1.782925e-09
759_at CA1 -4.462299 -4.554312 -4.370286 -11.739284 7.697939e-10
2980_at GUCA2A -4.273073 -4.378214 -4.167931 -10.928897 1.337343e-09
653808_at ZG16 -4.079303 -4.234654 -3.923952 -9.242182 1.014976e-08
Thus, my question (notion) is that maybe the above has to do with the distribution of my expression set, as when i take a smaller cutoff in treat, the updated CIs come "more close" to the ones of topTable + eBayes ?
Best,
Efstathios
No, the CIs returned by topTreat() are completely unaffected by the lfc cutoff used for treat().
As I have already said, the CIs returned by topTreat() are identical to those from topTable(). This true for all datasets and all parameter options. And I do mean IDENTICAL, not just roughly equal.
If you are seeing different, then there are only two possibilities. Either you have made a mistake or there is a bug in limma. If you think the latter is a possibility, then please send me a reproducible example that illustrates the problem. I have to be able to run the example myself. Just posting output to this website won't help.
Dear Gordon,
thank you again for your clarification about the CIs !! Firstly, i tried the following "artificial example" :
set.seed(81)
mat <- matrix(data=sample(1:100), nrow=10, ncol=10)
new.set <- new("ExpressionSet", exprs=mat)
pdat <- data.frame(Condition=rep(c("Normal","Cancer"),each=5))
pairs <- factor(rep(1:5, each=2))
pairs
[1] 1 1 2 2 3 3 4 4 5 5
Levels: 1 2 3 4 5
con <- factor(pdat$Condition)
f <- relevel(con, ref="Normal")
design <- model.matrix(~f + pairs)
aw <- arrayWeights(new.set, design)
fit <- lmFit(new.set, design,weights=aw)
fit2 <- eBayes(fit, trend=TRUE)
top2_example <- limma::topTable(fit2, coef=2, number=nrow(fit2), adjust.method="fdr", sort.by="none",confint=0.95)
head(top2_example)
logFC CI.L CI.R AveExpr t P.Value
1 27.4 -9.844791 64.644791 48.3 1.48685129 0.1448948
2 -31.0 -68.567483 6.567483 48.7 -1.66775447 0.1031759
3 2.8 -38.129908 43.729908 62.8 0.13826103 0.8907278
4 17.8 -20.546109 56.146109 50.1 0.93816929 0.3537880
5 -18.8 -69.419734 31.819734 57.0 -0.75062065 0.4572741
6 1.4 -49.210042 52.010042 64.1 0.05590799 0.9556934
fit_treat <- treat(fit,lfc=log2(1.1),trend=TRUE)
top_treat <- topTreat(fit_treat,coef=2,number=nrow(fit_treat),adjust.method="fdr",confint=0.95)
top_treat
logFC CI.L CI.R AveExpr t P.Value
7 -32.2 -59.884189 -4.5158115 41.1 -2.34071237 0.02375872
9 -29.8 -60.216607 0.6166067 42.5 -1.97096745 0.05460903
2 -31.0 -68.567483 6.5674827 48.7 -1.66035698 0.10318466
1 27.4 -9.844791 64.6447911 48.3 1.47938971 0.14490538
8 -18.6 -45.493179 8.2931794 40.3 -1.38749297 0.16989115
10 -17.4 -54.874234 20.0742341 50.1 -0.93100787 0.35367204
4 17.8 -20.546109 56.1461092 50.1 0.93092201 0.35380058
5 -18.8 -69.419734 31.8197336 57.0 -0.74513060 0.45728090
3 2.8 -38.129908 43.7299084 62.8 0.13147125 0.89073039
6 1.4 -49.210042 52.0100420 64.1 0.05041688 0.95569413
Thus, even my example is "random"--but i tried to replicate--my experimental design, the confidence intervals are identical !! However, this makes things even complicated, as i re-runed my analysis again from the import of my CEL files, and again noticed the same discrepancy !! My code is fixed and checked with caution in every step, as you can also see from the part i posted !!
The only thing i would like to pinpoint, is that my ExpressionSet used is batch effect corrected with ComBat--so, could even this hamper the distributions of the CIs, or is still irelevant and has nothing to do ?
Best,
Efstathios