Question

Strange output from goana()

0

Entering edit mode

cronanz • 0

@cronanz-12047

Last seen 6.6 years ago

Dear everyone,

I have been using goana() from edgeR to perform GO enrichment analysis on my list of DE genes and I'm puzzled by the results. My code to get the output is as follows:

design <- model.matrix(~condition)

disp <- estimateDisp(y, design, robust=TRUE)

fit <- glmFit(disp, design, robust=TRUE)

lrt <- glmLRT(fit, coef = 2)

go <- goana(lrt, FDR=0.1)

topgobp <- topGO(go, ont="BP", n=50)

And a snippet of the result from the final command is as follows:

Term Ont N Up Down P.Up P.Down
GO:0006955   immune response BP   1052   124 113   1.16395565456951e-16 0.00962505873802103
GO:0019221   cytokine-mediated signaling pathway   BP   351 61 37 8.54320669089232e-16 0.129568753575334
GO:0006952   defense response BP   934 112 100   1.49961996610012e-15 0.0158422706753765
GO:0071345   cellular response to cytokine stimulus BP   447 70 42 1.50658420177947e-15 0.326662132519315

It can be seen that even from the first GO term "immune response", the numbers of upregulated genes and downregulated genes are very similar (124 to 113) and yet the corresponding p.values are vastly different (1.16e-16 to 0.009). Am I misunderstanding how goana() works? I thought it is an analysis involving Fisher's exact test to determine overrepresentation of terms based on list of genes provided. I would appreciate if somebody has an explanation for this strange (to me) output. Thanks!

edger goana • 3.1k views

ADD COMMENT • link updated 8.7 years ago by Gordon Smyth 53k • written 8.7 years ago by cronanz • 0

0

Entering edit mode

Please show the number of DE genes, i.e., the output from:

summary(decideTestsDGE(lrt, p=0.1))

ADD REPLY • link 8.7 years ago Gordon Smyth 53k

0

Entering edit mode

Turns out I got the numbers wrong for the comments I made below. The output is:

-1 990
0 10586
1 677

So the upregulated genes are indeed more than the downregulated genes. I suppose this makes the output I got above sense? Apologies for the confusion.

ADD REPLY • link 8.7 years ago cronanz • 0

score 2 · Answer 1 · 2017-02-24

2

Entering edit mode

Aaron Lun ★ 29k

@alun

Last seen 5 hours ago

The city by the bay

The extra piece of information that determines the p-value calculation is the total number of up- or down-regulated genes. The numbers reported in Up and Down are the number of genes that are up- or down-regulated, respectively, and in the gene set corresponding to the row. Even if these numbers are similar, you can get very different p-values, e.g., if the total number of up-regulated genes is (much) lower than the total number of down-regulated genes (i.e., the proportion of up-regulated genes in the set is higher than that of the down-regulated genes). Another contributor is the fact that the hypergeometric distribution has very low variance, so modest differences in the up/down proportions can result in fairly dramatic differences in the p-values.

ADD COMMENT • link 8.7 years ago Aaron Lun ★ 29k

0

Entering edit mode

Thank you Aaron for your answer.

Well if that's the case, that's a big difference in p.values I'm seeing here. I know that it would be prudent to gauge the significance with p.value alone, but given that the above comparison had 70 genes difference between up- and down-regulated genes (more up-regulated genes), I'm still surprised that the difference is big. Is that the entirety of how p.values are computed? I suppose the FDR ranking is not taken into account.

ADD REPLY • link 8.7 years ago cronanz • 0

1

Entering edit mode

Based on your response to Gordon, the proportion of up-regulated genes in the first set is 124/677 = 0.18. The proportion of down-regulated genes is 113/990 = 0.11. So it's actually a fairly big difference in the proportions, which explains the difference in the p-values corresponding to each direction of DE.