Question

Interpretation of confidence interval values for logFC returned from topTable function in limma with microarray dataset

1

Entering edit mode

Konstantinos Yeles ▴ 80

@konstantinos-yeles-8961

Last seen 4 months ago

Italy

Dear Community,

i would like to ask you an important question about the interpretation of the topTable function output results. Specifically, i know that this is not a general statistics blog, but i checked the argument confint and i used after a specific implementation with limma, i used confint=0.95 in order to return confidence intervals for logFCs.

In detail, here is a small output of some selected genes (after i have subsetted my topTable):

> head(significant, 20)
             GENE_SYMBOL     logFC   adj.P.Val      MAD      CI.L       CI.R
A_23_P114903       HSPA6  3.595423 0.048103409 3.691256  1.729956  5.4608897
A_24_P245379    SERPINB2  2.910139 0.027037437 2.955397  1.676020  4.1442576
A_23_P161698        MMP3  2.581726 0.022127328 2.692339  1.569174  3.5942793
A_23_P66241         MT1M  2.857147 0.030223818 2.517837  1.601280  4.1130140
A_23_P206724        MT1E  2.574222 0.023364607 2.464120  1.535268  3.6131756
A_32_P87013        CXCL8  3.531625 0.015656484 2.316576  2.304604  4.7586457
A_24_P125096        MT1X  2.528568 0.017441204 2.314788  1.622389  3.4347482
A_23_P37983         MT1B  2.421996 0.018654318 2.282815  1.531431  3.3125618
A_23_P206707        MT1G  2.395756 0.025898285 2.259628  1.395022  3.3964891
A_23_P71037          IL6  2.135955 0.037906167 2.179081  1.119542  3.1523685
A_23_P427703        MT1L  2.364336 0.017614832 2.136130  1.514988  3.2136838
A_23_P163782      MT1HL1  2.284161 0.020399185 2.112673  1.414799  3.1535226
A_23_P315364       CXCL2  2.426432 0.006127387 1.995657  1.900835  2.9520286
A_23_P414343        MT1H  2.380918 0.014319566 1.995214  1.598750  3.1630865
A_23_P365738         ARC  2.102056 0.030560036 1.993690  1.170771  3.0333421
A_23_P1691          MMP1  2.154745 0.020770068 1.982411  1.329721  2.9797693
A_23_P108842       DUSP2  1.972652 0.014964790 1.973428  1.306712  2.6385916
A_23_P54840         MT1A  1.974725 0.024914831 1.839428  1.158456  2.7909943
A_23_P15727       FKBP10 -1.908220 0.040797553 1.724086 -2.840906 -0.9755334
A_24_P251764       CXCL3  1.938549 0.007849170 1.667274  1.424668  2.4524294

Thus, how i can interpret and "evaluate" the returned confidence intervals about a specific gene with a specific logFC ? That for instance, for the first gene, HSPA6 which has a "relatively" big fold change, is more "significant" due to the fact that both CI.L & CI.R >1 ? Or even the case that one of these is >1 ? as here this specific gene is upregulated ? Or my approach to this matter is completely wrong ? For instance, if a gene above with a significant p-value-i.e FDR < 0.05--and a logFC of -0.5, had CI.R=-0.4 & CI.L=-0.8, which is the "evaluation" of this example ?

Thank you,

Konstantinos

limma topTable confidence interval effect size microarray • 4.9k views

ADD COMMENT • link updated 8.1 years ago by Gordon Smyth 50k • written 8.1 years ago by Konstantinos Yeles ▴ 80

1

Entering edit mode

Gordon Smyth 50k

@gordon-smyth

Last seen 6 minutes ago

WEHI, Melbourne, Australia

It's a little unfortunate that you've deleted the P.Value column from the topTable, because the CI relates directly to the unadjusted p-value.

If the p-value is > 0.05, then the CI will overlap 0, i.e., CI.L<0 and CI.R>0. (Note that logFC=0 corresponds to no DE.)

If the p-value is < 0.05, then the CI will be entirely above zero or entirely below zero.

Your gene HSPA6 has a large logFC but also a high variability. If you were focusing on this gene as being of a priori interest, then the 95% confidence interval for the log2-fold-change is (1.73, 5.46). Another way to say the same thing is that the log2-fold change is 3.60 +- 1.86.

If I unlog the logFC and CI values for HSPA6, then the estimated fold change is 2^3.60 = 12.1, and the range of feasible values for the fold change is from as low as 2^1.73 = 3.32 to as high as 2^5.46 = 44.0.

PS. You seem to be ranking genes by fold change in the table in your post. The limma documents recommends against that, see help("topTable"). If you want to rank genes by importance. it is usually better to stick to p-value, possibly after using treat.

ADD COMMENT • link 8.1 years ago Gordon Smyth 50k

0

Entering edit mode

Dear Gordon,

thank you for your suggestion and valuable explanations !! So, for your example: bigger numeric values for both confidence intervals, indicate a bigger variability for a specific example-gene, like HSPA6 ? which could imply that the estimate of logFC is then not so "precise" ? or as you pinpointed that confidence intervals are directly related to raw p-values ? For example, if i have two hypothetic genes: one with raw p-value 0.03 & logFC=0.5 and CI.L=0.4 & CI.R=0.8 & another gene with p-value 0.01, logFC=0.8, CI.L=1.2 & CI.R=1.5, what is the main "interpretation" of these two genes ? Regarding my initial question ? That the case both intervals to be >1 if a logFC > 0 or <1 if logFC <0 is "incorrect" ? And i should primarily see my undjusted p-value( as also initial adjusted p-value with a logFC of course # than zero), in order to make any comments of the confidence intervals ?

And finally, how you estimated that "the interval for the fold change itself is from as low as 3.32 to as high as 44.0" ?

ADD REPLY • link 8.1 years ago Konstantinos Yeles ▴ 80

1

Entering edit mode

Sorry, but your comments about confidence intervals are not correct, and I'm not sure how to answer. The CI is not required to be >1 or <1, I'm not sure why you would expect that. It would be best for you to read about confidence intervals in an introductory statistics textbook, or by searching the web.

I hope that you understand that the estimated logFC value is exactly in the middle of the CI for that gene. The CI just gives a wider range of feasible values. In other words it gives a plus-or-minus margin of error for the logFC.

The CI tells you nothing at all about statistical significance over and above the p-value and the FDR. It is entirely concerned with the magnitude of the fold change, not with its significance.

I estimated the fold change range for HSPA6 as 3.32 to 44.0 because 2^1.73=3.32 and 2^5.46=44.0. In other words, I simply unlogged the CI.

ADD REPLY • link 8.1 years ago Gordon Smyth 50k

0

Entering edit mode

Dear Gordon,

thank you again for your explanations and please excuse me for any inappropriate or un-related question. I will definately read further for confidence intervals and their interpretation but your comments so far were very helpful !!

ADD REPLY • link 8.1 years ago Konstantinos Yeles ▴ 80

0

Entering edit mode

Dear Gordon, also for your PS message:

actually, here is just an output of topTable, but ordered with the Median Absolute Deviation metric(MAD), in order to remove any duplicated probeIDs matching to the same gene symbol, so i did not order anything with logFC. I will also take your suggestions into account and search also for treat.

ADD REPLY • link 8.1 years ago Konstantinos Yeles ▴ 80

score 3 · Accepted Answer · 2016-03-12

3

Entering edit mode

Aaron Lun ★ 28k

@alun

Last seen 5 hours ago

The city by the bay

The confidence interval gives you a measure of how precisely the log-fold change is estimated. The tighter the interval, the more precise the estimate. This can be useful if the actual value of the log-fold change estimate is going to be used for further work. For example, I often use the confidence intervals to check if near-zero log-fold changes are being precisely estimated; this allows me to identify genes that are likely to be non-DE, which is hard to do with conventional significance testing. Confidence intervals may also be helpful for visualizing this estimation uncertainty in plots. For your results, the CIs are basically saying that the true log-fold change is probably quite large, given that the interval is quite a fair distance from zero.

However, if you want to interpret significance of DE, you're better off looking at the p-value. The p-value calculation already accounts for the estimation precision by using the standard error of the log-fold change in the moderated t-test. The p-values can also be adjusted for multiple testing, whereas that's not done (as easily) for the CIs. Of course, all of these things are interrelated. If you have tight CIs, you're more likely to get a significant result. If you have a large log-fold change, then having a wider CI doesn't matter so much as the evidence against the null is still strong.

For your example, the interpretation would be something like: "The gene is significant because it has an adjusted p-value below the 5% threshold. Also, the log-fold change estimate of -0.5 is reasonably precise as the confidence interval is fairly small."

ADD COMMENT • link 8.1 years ago Aaron Lun ★ 28k

0

Entering edit mode

Dear Aaron, thank you for your explanation !! By your last explanation, i think i have misinterpret a specific part of the interpretation of the CIs. Thus, the bigger the C.I.s (CI.L and/or CI.R) do not count that much, taking of course into account at the same time that for a specific gene, the adjusted p-value is smaller than a threshold ? i.e. < 0.05 ? Thus, if my notion is correct, an "ideal senario" for a gene would be to have a significant p-value, and a logFC different from zero, but with "narrower" CIs ? like the one example i gave you above ??

ADD REPLY • link 8.1 years ago Konstantinos Yeles ▴ 80

1

Entering edit mode

Well, "ideal" depends on what you want to do. If you're just interested in whether a gene is DE or not, then looking at the adjusted p-value would be sufficient. Genes with low p-values are often accompanied by non-zero log-fold changes (even more so if you use treat) and CIs that do not contain zero; but I don't think you have to explicitly select on the CIs being narrow, that's already considered in the p-value calculation.

ADD REPLY • link 8.1 years ago Aaron Lun ★ 28k

1

Entering edit mode

Yes, if you want a gene to be DE, then a small p-value, large logFC and narrow CI is the "ideal". However (as Aaron says), the latter two things are already built into the p-value so far as statistical significance is concerned.

ADD REPLY • link 8.1 years ago Gordon Smyth 50k

0

Entering edit mode

Dear Aaron, one last comment about the approach you mentioned to "detect" non-significant genes: in this case, except from an adjusted p-value>0.05, for identifying non-DE genes with near zero logFC, i should also state that both CI.I >-1 & CI.R <1 ? in order for the near-zero logFC to be as precise as possible, right ?

ADD REPLY • link 8.1 years ago Konstantinos Yeles ▴ 80