Question

2 issues about enriched gene sets via Roast

0

Entering edit mode

Julie Leonard ▴ 110

@julie-leonard-5222

Last seen 9.6 years ago

Hi. I am using Roast to perform gene set enrichment analysis after doing differential expression analysis in edgeR. In this particular study, I had 2 variables which I joined to create a single factor in the linear model: y ~ 0 + combo_variable. Looking at the variance in the data, most of the variance was due to 1 variable and there was very little variance due to the other variable. Thus, large numbers of genes (~10,000) were found to be differentially expressed when testing the contrast for one variable and very few genes (~200) were found to be differentially expressed when testing the contrast for the other variable. When I ran roast for each of these 2 contrasts, the one that had lots of differentially expressed genes found almost all of the gene sets to be enriched. This is understandable since there were lots of genes differentially expressed, but my problem is that most of the gene sets had the same FDRs. Thus I can't even narrow down the list of enriched gene sets by using a more stringent FDR cutoff. A subset of the output is shown below. Why would all of these gene sets have the same p-values and thus the same FDRs?? NGenes PropDown PropUp Direction PValue FDR PValue.Mixed FDR.Mixed 112 0.455357 0.205357 Down 2.00E-04 0.0002 1.00E-04 5.50E-05 38 0.578947 0.210526 Down 2.00E-04 0.0002 1.00E-04 5.50E-05 10 0.2 0.4 Up 2.00E-04 0.0002 1.00E-04 5.50E-05 311 0.299035 0.469453 Up 2.00E-04 0.0002 1.00E-04 5.50E-05 540 0.233333 0.344444 Up 2.00E-04 0.0002 1.00E-04 5.50E-05 1294 0.328439 0.257342 Down 2.00E-04 0.0002 1.00E-04 5.50E-05 317 0.29653 0.533123 Up 2.00E-04 0.0002 1.00E-04 5.50E-05 538 0.421933 0.256506 Down 2.00E-04 0.0002 1.00E-04 5.50E-05 133 0.511278 0.293233 Down 2.00E-04 0.0002 1.00E-04 5.50E-05 39 0.589744 0.205128 Down 2.00E-04 0.0002 1.00E-04 5.50E-05 14 0.214286 0.571429 Up 2.00E-04 0.0002 1.00E-04 5.50E-05 13 0.307692 0.538462 Up 2.00E-04 0.0002 1.00E-04 5.50E-05 36 0.472222 0.222222 Down 2.00E-04 0.0002 1.00E-04 5.50E-05 616 0.160714 0.688312 Up 2.00E-04 0.0002 1.00E-04 5.50E-05 6 1 0 Down 2.00E-04 0.0002 1.00E-04 5.50E-05 21 0.428571 0.285714 Down 2.00E-04 0.0002 1.00E-04 5.50E-05 65 0.415385 0.246154 Down 2.00E-04 0.0002 1.00E-04 5.50E-05 99 0.383838 0.323232 Down 2.00E-04 0.0002 1.00E-04 5.50E-05 19 0 0.578947 Up 2.00E-04 0.0002 1.00E-04 5.50E-05 118 0.5 0.313559 Down 2.00E-04 0.0002 1.00E-04 5.50E-05 470 0.461702 0.323404 Down 2.00E-04 0.0002 1.00E-04 5.50E-05 1401 0.404711 0.250535 Down 2.00E-04 0.0002 1.00E-04 5.50E-05 631 0.272583 0.369255 Up 2.00E-04 0.0002 1.00E-04 5.50E-05 55 0.236364 0.472727 Up 2.00E-04 0.0002 1.00E-04 5.50E-05 5 0.2 0.6 Up 2.00E-04 0.0002 1.00E-04 5.50E-05 On the other hand, when I ran roast for the contrast with few genes differentially expressed, I got few gene sets enriched. But what's odd is it did find some gene sets enriched with FDR.Mixed < 0.05, but none of the genes in the gene set were differentially expressed. Are these enriched gene sets false positives? I'm not sure what's going on here. NGenes PropDown PropUp Direction PValue FDR PValue.Mixed FDR.Mixed # DE genes 18 0.333333 0 Down 2.00E-04 0.01636 4.00E-04 0.024994 0 5 0.4 0 Down 2.00E-04 0.01636 5.00E-04 0.024994 0 7 0.714286 0 Down 6.00E-04 0.045444 6.00E-04 0.024994 0 50 0.08 0.32 Up 0.013 0.159882 0.001 0.032379 0 49 0.346939 0.142857 Down 0.049 0.272132 0.001 0.032379 0 4 0.25 0.25 Down 0.1722 0.457071 0.0012 0.037628 0 Please advise. Thanks, Julie Julie Leonard Computational Biologist Global Bioinformatics Syngenta Biotechnology, Inc. This message may contain confidential information. If yo...{{dropped:7}}

edgeR edgeR • 1.3k views

ADD COMMENT • link updated 10.2 years ago by Gordon Smyth 50k • written 10.3 years ago by Julie Leonard ▴ 110

score 0 · Answer 1 · 2014-01-24

Same questions minus the unformatted table data. Hi. I am using Roast to perform gene set enrichment analysis after doing differential expression analysis in edgeR. In this particular study, I had 2 variables which I joined to create a single factor in the linear model: y ~ 0 + combo_variable. Looking at the variance in the data, most of the variance was due to 1 variable and there was very little variance due to the other variable. Thus, large numbers of genes (~10,000) were found to be differentially expressed when testing the contrast for one variable and very few genes (~200) were found to be differentially expressed when testing the contrast for the other variable. When I ran roast for each of these 2 contrasts, the one that had lots of differentially expressed genes found almost all of the gene sets to be enriched. This is understandable since there were lots of genes differentially expressed, but my problem is that most of the gene sets had the same FDRs. Thus I can't even narrow down the list of enriched gene sets by using a more stringent FDR cutoff. Why would all of these gene sets have the same p-values and thus the same FDRs?? On the other hand, when I ran roast for the contrast with few genes differentially expressed, I got few gene sets enriched. But what's odd is it did find some gene sets enriched with FDR.Mixed < 0.05, but none of the genes in the gene set were differentially expressed. Are these enriched gene sets false positives? I'm not sure what's going on here. Please advise. Thanks, Julie Julie Leonard Computational Biologist Global Bioinformatics Syngenta Biotechnology, Inc. -----Original Message----- From: bioconductor-bounces@r-project.org [mailto:bioconductor- bounces@r-project.org] On Behalf Of julie.leonard@syngenta.com Sent: Thursday, January 23, 2014 5:28 PM To: bioconductor at r-project.org Subject: [BioC] 2 issues about enriched gene sets via Roast Hi. I am using Roast to perform gene set enrichment analysis after doing differential expression analysis in edgeR. In this particular study, I had 2 variables which I joined to create a single factor in the linear model: y ~ 0 + combo_variable. Looking at the variance in the data, most of the variance was due to 1 variable and there was very little variance due to the other variable. Thus, large numbers of genes (~10,000) were found to be differentially expressed when testing the contrast for one variable and very few genes (~200) were found to be differentially expressed when testing the contrast for the other variable. When I ran roast for each of these 2 contrasts, the one that had lots of differentially expressed genes found almost all of the gene sets to be enriched. This is understandable since there were lots of genes differentially expressed, but my problem is that most of the gene sets had the same FDRs. Thus I can't even narrow down the list of enriched! gene sets by using a more stringent FDR cutoff. A subset of the output is shown below. Why would all of these gene sets have the same p-values and thus the same FDRs?? NGenes PropDown PropUp Direction PValue FDR PValue.Mixed FDR.Mixed 112 0.455357 0.205357 Down 2.00E-04 0.0002 1.00E-04 5.50E-05 38 0.578947 0.210526 Down 2.00E-04 0.0002 1.00E-04 5.50E-05 10 0.2 0.4 Up 2.00E-04 0.0002 1.00E-04 5.50E-05 311 0.299035 0.469453 Up 2.00E-04 0.0002 1.00E-04 5.50E-05 540 0.233333 0.344444 Up 2.00E-04 0.0002 1.00E-04 5.50E-05 1294 0.328439 0.257342 Down 2.00E-04 0.0002 1.00E-04 5.50E-05 317 0.29653 0.533123 Up 2.00E-04 0.0002 1.00E-04 5.50E-05 538 0.421933 0.256506 Down 2.00E-04 0.0002 1.00E-04 5.50E-05 133 0.511278 0.293233 Down 2.00E-04 0.0002 1.00E-04 5.50E-05 39 0.589744 0.205128 Down 2.00E-04 0.0002 1.00E-04 5.50E-05 14 0.214286 0.571429 Up 2.00E-04 0.0002 1.00E-04 5.50E-05 13 0.307692 0.538462 Up 2.00E-04 0.0002 1.00E-04 5.50E-05 36 0.472222 0.222222 Down 2.00E-04 0.0002 1.00E-04 5.50E-05 616 0.160714 0.688312 Up 2.00E-04 0.0002 1.00E-04 5.50E-05 6 1 0 Down 2.00E-04 0.0002 1.00E-04 5.50E-05 21 0.428571 0.285714 Down 2.00E-04 0.0002 1.00E-04 5.50E-05 65 0.415385 0.246154 Down 2.00E-04 0.0002 1.00E-04 5.50E-05 99 0.383838 0.323232 Down 2.00E-04 0.0002 1.00E-04 5.50E-05 19 0 0.578947 Up 2.00E-04 0.0002 1.00E-04 5.50E-05 118 0.5 0.313559 Down 2.00E-04 0.0002 1.00E-04 5.50E-05 470 0.461702 0.323404 Down 2.00E-04 0.0002 1.00E-04 5.50E-05 1401 0.404711 0.250535 Down 2.00E-04 0.0002 1.00E-04 5.50E-05 631 0.272583 0.369255 Up 2.00E-04 0.0002 1.00E-04 5.50E-05 55 0.236364 0.472727 Up 2.00E-04 0.0002 1.00E-04 5.50E-05 5 0.2 0.6 Up 2.00E-04 0.0002 1.00E-04 5.50E-05 On the other hand, when I ran roast for the contrast with few genes differentially expressed, I got few gene sets enriched. But what's odd is it did find some gene sets enriched with FDR.Mixed < 0.05, but none of the genes in the gene set were differentially expressed. Are these enriched gene sets false positives? I'm not sure what's going on here. NGenes PropDown PropUp Direction PValue FDR PValue.Mixed FDR.Mixed # DE genes 18 0.333333 0 Down 2.00E-04 0.01636 4.00E-04 0.024994 0 5 0.4 0 Down 2.00E-04 0.01636 5.00E-04 0.024994 0 7 0.714286 0 Down 6.00E-04 0.045444 6.00E-04 0.024994 0 50 0.08 0.32 Up 0.013 0.159882 0.001 0.032379 0 49 0.346939 0.142857 Down 0.049 0.272132 0.001 0.032379 0 4 0.25 0.25 Down 0.1722 0.457071 0.0012 0.037628 0 Please advise. Thanks, Julie Julie Leonard Computational Biologist Global Bioinformatics Syngenta Biotechnology, Inc. This message may contain confidential information. If yo...{{dropped:17}}

score 0 · Answer 2 · 2014-01-25

Dear Julie, When you do a rotation or permutation test, the smallest possible p-value that can be achieved depends on the number of distinct rotations or permutations that have been performed. You appear to be using nrot=9999 rotations, so the smallest one-sided p-value that is possible is p = 1 / (nrot+1) = 1e-4 The small possible two-sided p-value possible therefore is twice this, which is 2e-4. Any gene set that contains lots of differential expression, so that the observed statistic is greater than any of the rotated statistics, will be assigned this minimum p-value. One can resolve these small p-values further by doing more rotations. The more rotations than are done, the fewer gene sets will sit on the minimum. Best wishes Gordon PS. If you're not sure where the above p-value formula comes from, see Section 4 of: http://www.statsci.org/smyth/pubs/PermPValuesPreprint.pdf > Date: Thu, 23 Jan 2014 17:27:46 -0500 > From: <julie.leonard at="" syngenta.com=""> > To: <bioconductor at="" r-project.org=""> > Subject: [BioC] 2 issues about enriched gene sets via Roast > > Hi. > I am using Roast to perform gene set enrichment analysis after doing > differential expression analysis in edgeR. In this particular study, I > had 2 variables which I joined to create a single factor in the linear > model: y ~ 0 + combo_variable. Looking at the variance in the data, > most of the variance was due to 1 variable and there was very little > variance due to the other variable. Thus, large numbers of genes > (~10,000) were found to be differentially expressed when testing the > contrast for one variable and very few genes (~200) were found to be > differentially expressed when testing the contrast for the other > variable. When I ran roast for each of these 2 contrasts, the one that > had lots of differentially expressed genes found almost all of the gene > sets to be enriched. This is understandable since there were lots of > genes differentially expressed, but my problem is that most of the gene > sets had the same FDRs. Thus I can't even narrow down the list of > enriched! gene sets by using a more stringent FDR cutoff. A subset of > the output is shown below. Why would all of these gene sets have the > same p-values and thus the same FDRs?? > > NGenes > > PropDown > > PropUp > > Direction > > PValue > > FDR > > PValue.Mixed > > FDR.Mixed > > 112 > > 0.455357 > > 0.205357 > > Down > > 2.00E-04 > > 0.0002 > > 1.00E-04 > > 5.50E-05 > > 38 > > 0.578947 > > 0.210526 > > Down > > 2.00E-04 > > 0.0002 > > 1.00E-04 > > 5.50E-05 > > 10 > > 0.2 > > 0.4 > > Up > > 2.00E-04 > > 0.0002 > > 1.00E-04 > > 5.50E-05 > > 311 > > 0.299035 > > 0.469453 > > Up > > 2.00E-04 > > 0.0002 > > 1.00E-04 > > 5.50E-05 > > 540 > > 0.233333 > > 0.344444 > > Up > > 2.00E-04 > > 0.0002 > > 1.00E-04 > > 5.50E-05 > > 1294 > > 0.328439 > > 0.257342 > > Down > > 2.00E-04 > > 0.0002 > > 1.00E-04 > > 5.50E-05 > > 317 > > 0.29653 > > 0.533123 > > Up > > 2.00E-04 > > 0.0002 > > 1.00E-04 > > 5.50E-05 > > 538 > > 0.421933 > > 0.256506 > > Down > > 2.00E-04 > > 0.0002 > > 1.00E-04 > > 5.50E-05 > > 133 > > 0.511278 > > 0.293233 > > Down > > 2.00E-04 > > 0.0002 > > 1.00E-04 > > 5.50E-05 > > 39 > > 0.589744 > > 0.205128 > > Down > > 2.00E-04 > > 0.0002 > > 1.00E-04 > > 5.50E-05 > > 14 > > 0.214286 > > 0.571429 > > Up > > 2.00E-04 > > 0.0002 > > 1.00E-04 > > 5.50E-05 > > 13 > > 0.307692 > > 0.538462 > > Up > > 2.00E-04 > > 0.0002 > > 1.00E-04 > > 5.50E-05 > > 36 > > 0.472222 > > 0.222222 > > Down > > 2.00E-04 > > 0.0002 > > 1.00E-04 > > 5.50E-05 > > 616 > > 0.160714 > > 0.688312 > > Up > > 2.00E-04 > > 0.0002 > > 1.00E-04 > > 5.50E-05 > > 6 > > 1 > > 0 > > Down > > 2.00E-04 > > 0.0002 > > 1.00E-04 > > 5.50E-05 > > 21 > > 0.428571 > > 0.285714 > > Down > > 2.00E-04 > > 0.0002 > > 1.00E-04 > > 5.50E-05 > > 65 > > 0.415385 > > 0.246154 > > Down > > 2.00E-04 > > 0.0002 > > 1.00E-04 > > 5.50E-05 > > 99 > > 0.383838 > > 0.323232 > > Down > > 2.00E-04 > > 0.0002 > > 1.00E-04 > > 5.50E-05 > > 19 > > 0 > > 0.578947 > > Up > > 2.00E-04 > > 0.0002 > > 1.00E-04 > > 5.50E-05 > > 118 > > 0.5 > > 0.313559 > > Down > > 2.00E-04 > > 0.0002 > > 1.00E-04 > > 5.50E-05 > > 470 > > 0.461702 > > 0.323404 > > Down > > 2.00E-04 > > 0.0002 > > 1.00E-04 > > 5.50E-05 > > 1401 > > 0.404711 > > 0.250535 > > Down > > 2.00E-04 > > 0.0002 > > 1.00E-04 > > 5.50E-05 > > 631 > > 0.272583 > > 0.369255 > > Up > > 2.00E-04 > > 0.0002 > > 1.00E-04 > > 5.50E-05 > > 55 > > 0.236364 > > 0.472727 > > Up > > 2.00E-04 > > 0.0002 > > 1.00E-04 > > 5.50E-05 > > 5 > > 0.2 > > 0.6 > > Up > > 2.00E-04 > > 0.0002 > > 1.00E-04 > > 5.50E-05 > > > > On the other hand, when I ran roast for the contrast with few genes differentially expressed, I got few gene sets enriched. But what's odd is it did find some gene sets enriched with FDR.Mixed < 0.05, but none of the genes in the gene set were differentially expressed. Are these enriched gene sets false positives? I'm not sure what's going on here. > > NGenes > > PropDown > > PropUp > > Direction > > PValue > > FDR > > PValue.Mixed > > FDR.Mixed > > # DE genes > > 18 > > 0.333333 > > 0 > > Down > > 2.00E-04 > > 0.01636 > > 4.00E-04 > > 0.024994 > > 0 > > 5 > > 0.4 > > 0 > > Down > > 2.00E-04 > > 0.01636 > > 5.00E-04 > > 0.024994 > > 0 > > 7 > > 0.714286 > > 0 > > Down > > 6.00E-04 > > 0.045444 > > 6.00E-04 > > 0.024994 > > 0 > > 50 > > 0.08 > > 0.32 > > Up > > 0.013 > > 0.159882 > > 0.001 > > 0.032379 > > 0 > > 49 > > 0.346939 > > 0.142857 > > Down > > 0.049 > > 0.272132 > > 0.001 > > 0.032379 > > 0 > > 4 > > 0.25 > > 0.25 > > Down > > 0.1722 > > 0.457071 > > 0.0012 > > 0.037628 > > 0 > > > Please advise. > > Thanks, > Julie > > > Julie Leonard > Computational Biologist > Global Bioinformatics > Syngenta Biotechnology, Inc. ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}}

score 0 · Answer 3 · 2014-01-25

> Date: Thu, 23 Jan 2014 17:27:46 -0500 > From: <julie.leonard at="" syngenta.com=""> > To: <bioconductor at="" r-project.org=""> > Subject: [BioC] 2 issues about enriched gene sets via Roast ... > On the other hand, when I ran roast for the contrast with few genes > differentially expressed, I got few gene sets enriched. But what's odd > is it did find some gene sets enriched with FDR.Mixed < 0.05, but none > of the genes in the gene set were differentially expressed. Are these > enriched gene sets false positives? I'm not sure what's going on here. This is in a way the whole point of gene set testing, to be able to pool information between genes to find an overal trend that is not apparent merely by looking at the individual genes. The original GSEA publication (Mootha et al, Nature Genetics 2003) began by saying that no genes were individually significant in their data: "When assessed with ... analytical techniques that take into account the multiple comparisons implicit in microarray analysis, no single gene had a significant difference in expression between the diagnostic categories (data not shown)." They then went on to show that GSEA nevertheless found significant gene set changes. So the phenomenon that you have observed was the original justification for the whole gene set testing approach. Best wishes Gordon ... > Please advise. > > Thanks, > Julie > > > Julie Leonard > Computational Biologist > Global Bioinformatics > Syngenta Biotechnology, Inc. ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}}