Expected number of DE genes?

0

Entering edit mode

Jessica Perry Hekman ▴ 140

@jessica-perry-hekman-6556

Last seen 10.5 years ago

I'm getting only a few dozen differentially expressed genes when I analyze my RNA-Seq data with DESeq2 (79) and EdgeR (34) (even fewer when I use EBSeq). I had expected many more -- hundreds or even a thousand. If this is the real answer, I'm fine with it, but I'm concerned that I'm doing something wrong. What are the ranges of numbers of differentially expressed genes that one would expect from DESeq2 or EdgeR? More information: I'm in the midst of my first RNA-seq project (as many of you have probably surmised from my frequent postings to a variety of lists). My initial goal is to get a list of differentially expressed (DE) genes. I have 24 samples, 12 from each of 2 treatment groups. My species is fox (Vulpes vulpes), which aligns very nicely to dog (Canis familiaris). My current approach is to use the dog reference genome (to which my fox reads align at about 83%) + GTF with location of exons. Can I feel confident about DESeq2 and EdgeR's calls? Thanks very much for any insights, Jessica -- Jessica P. Hekman, DVM, MS PhD student, University of Illinois, Urbana-Champaign Animal Sciences / Genetics, Genomics, and Bioinformatics

Genetics edgeR DESeq2 Genetics edgeR DESeq2 • 3.2k views

ADD COMMENT • link updated 10.7 years ago by Thomas Hampton ▴ 750 • written 10.7 years ago by Jessica Perry Hekman ▴ 140

0

Entering edit mode

Ryan C. Thompson ★ 7.9k

@ryan-c-thompson-5618

Last seen 5 months ago

Icahn School of Medicine at Mount Sinai…

Hi Jessica, Have you done some exploratory analysis on your dataset? A good place to start would be to generate PCA plots using plotMDS for edgeR and plotPCA for DESeq2. Do your samples cluster into two groups as expected on the PCA plots? Secondly, you should have a look at the dispersion estimates from both methods and compare them to typically observed values. You can do this with plotBCV for edgeR and plotDispEsts for DESeq2 (but remember that BCV is the square root of dispersion, so pay attention to the y-axis label). If your dispersions are too high, this indicates that the variation within groups is large, which means that detecting significant differences between groups is difficult and you will get fewer genes. The edgeR User's Guide says in section 2.10 that typical BCV values are 0.1 for genetically identical animals (e.g. lab mice) and 0.4 for human samples. The latter value is probably closer to what you should expect. If your BCV is a lot higher than that, your experiment may not be well-controlled, or there may be some other problem in the methods or the data that you need to track down. Home this helps, -Ryan Thompson. On Tue Jul 15 14:54:47 2014, Jessica Perry Hekman wrote: > I'm getting only a few dozen differentially expressed genes when I > analyze my RNA-Seq data with DESeq2 (79) and EdgeR (34) (even fewer > when I use EBSeq). I had expected many more -- hundreds or even a > thousand. If this is the real answer, I'm fine with it, but I'm > concerned that I'm doing something wrong. What are the ranges of > numbers of differentially expressed genes that one would expect from > DESeq2 or EdgeR? > > More information: > > I'm in the midst of my first RNA-seq project (as many of you have > probably surmised from my frequent postings to a variety of lists). My > initial goal is to get a list of differentially expressed (DE) genes. > > I have 24 samples, 12 from each of 2 treatment groups. > > My species is fox (Vulpes vulpes), which aligns very nicely to dog > (Canis familiaris). > > My current approach is to use the dog reference genome (to which my > fox reads align at about 83%) + GTF with location of exons. > > Can I feel confident about DESeq2 and EdgeR's calls? > > Thanks very much for any insights, > > Jessica >

ADD COMMENT • link 10.7 years ago Ryan C. Thompson ★ 7.9k

0

Entering edit mode

Ryan -- that's extremely helpful, thanks! I'll try PCA and look at my dispersion estimates and see what light that sheds. Much appreciated. Jessica Jessica P. Hekman, DVM, MS PhD student, University of Illinois, Urbana-Champaign Animal Sciences / Genetics, Genomics, and Bioinformatics On 07/15/2014 05:23 PM, Ryan wrote: > Hi Jessica, > > Have you done some exploratory analysis on your dataset? A good place to > start would be to generate PCA plots using plotMDS for edgeR and plotPCA > for DESeq2. Do your samples cluster into two groups as expected on the > PCA plots? Secondly, you should have a look at the dispersion estimates > from both methods and compare them to typically observed values. You can > do this with plotBCV for edgeR and plotDispEsts for DESeq2 (but remember > that BCV is the square root of dispersion, so pay attention to the > y-axis label). If your dispersions are too high, this indicates that the > variation within groups is large, which means that detecting significant > differences between groups is difficult and you will get fewer genes. > The edgeR User's Guide says in section 2.10 that typical BCV values are > 0.1 for genetically identical animals (e.g. lab mice) and 0.4 for human > samples. The latter value is probably closer to what you should expect. > If your BCV is a lot higher than that, your experiment may not be > well-controlled, or there may be some other problem in the methods or > the data that you need to track down. > > Home this helps, > > -Ryan Thompson. > > On Tue Jul 15 14:54:47 2014, Jessica Perry Hekman wrote: >> I'm getting only a few dozen differentially expressed genes when I >> analyze my RNA-Seq data with DESeq2 (79) and EdgeR (34) (even fewer >> when I use EBSeq). I had expected many more -- hundreds or even a >> thousand. If this is the real answer, I'm fine with it, but I'm >> concerned that I'm doing something wrong. What are the ranges of >> numbers of differentially expressed genes that one would expect from >> DESeq2 or EdgeR? >> >> More information: >> >> I'm in the midst of my first RNA-seq project (as many of you have >> probably surmised from my frequent postings to a variety of lists). My >> initial goal is to get a list of differentially expressed (DE) genes. >> >> I have 24 samples, 12 from each of 2 treatment groups. >> >> My species is fox (Vulpes vulpes), which aligns very nicely to dog >> (Canis familiaris). >> >> My current approach is to use the dog reference genome (to which my >> fox reads align at about 83%) + GTF with location of exons. >> >> Can I feel confident about DESeq2 and EdgeR's calls? >> >> Thanks very much for any insights, >> >> Jessica >>

ADD REPLY • link 10.7 years ago Jessica Perry Hekman ▴ 140

0

Entering edit mode

Steve Lianoglou ★ 13k

@steve-lianoglou-2771

Last seen 2.0 years ago

United States

Hi, On Tue, Jul 15, 2014 at 5:54 PM, Jessica Perry Hekman <hekman2 at="" illinois.edu=""> wrote: > I'm getting only a few dozen differentially expressed genes when I analyze > my RNA-Seq data with DESeq2 (79) and EdgeR (34) (even fewer when I use > EBSeq). I had expected many more -- hundreds or even a thousand. If this is > the real answer, I'm fine with it, but I'm concerned that I'm doing > something wrong. What are the ranges of numbers of differentially expressed > genes that one would expect from DESeq2 or EdgeR? This is an impossible question to answer as this will, of course, depend on your experiment. There are many different reasons why you found so few DGEs. Maybe this is reality, maybe this is a failed experiment (ie. some failure at the bench), or maybe there is a mistake in your analysis. We can only help you to answer the last point, but to do so we will need to see the code you used and an explanation of your dataset (ie. which samples are which condition). Also: why do you expect to get upwards of a thousand differentially expressed genes? Have you (or others) done this experiment before and verified those results, or? > More information: > > I'm in the midst of my first RNA-seq project (as many of you have probably > surmised from my frequent postings to a variety of lists). My initial goal > is to get a list of differentially expressed (DE) genes. > > I have 24 samples, 12 from each of 2 treatment groups. That's an amount of replication that most of us would be envious of. Assuming the experiment was done w/o problems, then you should be well powered to detect true DE events. > My species is fox (Vulpes vulpes), which aligns very nicely to dog (Canis > familiaris). Tricky. > My current approach is to use the dog reference genome (to which my fox > reads align at about 83%) + GTF with location of exons. What does the enrichment of reads that align to exons vs. intergenic reads look like? Have you (or others) had success doing an analysis this way before w/ previous data/experiments? > Can I feel confident about DESeq2 and EdgeR's calls? For the most part, yes, as long as you have done the analysis properly. -steve -- Steve Lianoglou Computational Biologist Genentech

ADD COMMENT • link 10.7 years ago Steve Lianoglou ★ 13k

0

Entering edit mode

On 07/15/2014 05:32 PM, Steve Lianoglou wrote: > We can only help you to answer the last point, but to do so we will > need to see the code you used and an explanation of your dataset (ie. > which samples are which condition). > > Also: why do you expect to get upwards of a thousand differentially > expressed genes? Have you (or others) done this experiment before and > verified those results, or? Steve -- thanks for your feedback. This is my first time doing RNA-Seq and I had been under the impression that there would be a large number of DE genes just by chance in most experiments. I think to a large extent I just need to adjust my expectations. I just don't have a feel for what normal results look like! However, as you asked whether this had been done before, I sheepishly recalled that we do have some analysis from a different lab using different tissues from the same animals and a slightly different approach to dealing with the data (de novo transcriptome instead of aligning to dog). It turns out that they found 350-650 DE genes, depending on the tissue. So they did get more DE genes than I did. > What does the enrichment of reads that align to exons vs. intergenic > reads look like? I haven't done this analysis, but I'll take a look and see what it looks like. I'm guessing that I should expect 80-90% of reads to align to exons, since this is RNA? You suggested that seeing my code might be helpful, so here it is, in case you're curious: For EdgeR: x <- read.delim("cf3ncbi.genes.matrix", row.names="GeneName") group <- factor(c(1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2)) y <- DGEList(counts=x,group=group) y <- calcNormFactors(y) y <- estimateCommonDisp(y) y <- estimateTagwiseDisp(y) et <- exactTest(y) tt <- topTags(et, n=100) write.table(tt, file = "edgeR-cf3ncbi-topTags100.txt") ...where cf3ncbi.genes.matrix looks like: GeneName 1_481 2_124 3_23 4_124 5_125 6_126 7_127 8_128 9_129 10_130 11_131 12_132 49_169 50_170 51_171 52_172 53_173 54_174 55_535 56_176 57_177 58_178 59_179 60_180 A1BG 27 39 55 41 50 42 33 66 63 31 50 34 52 53 34 44 35 20 45 43 75 47 43 45 A1CF 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 and so forth (as suggested in the code, the first 12 samples are from one treatment group and the second 12 samples are from another). For DESeq2: sample.table <- data.frame(read.table("DESeq2SampleTable.txt", header=TRUE)) dds <- DESeqDataSetFromHTSeqCount(sample.table, directory = "cf3ncbi/", formula(~ type)) dds2 <- DESeq(dds) res <- results(dds2) write.table (res, "cf3ncbi.deseq2.results.csv") sum( res$padj < 0.05, na.rm=TRUE) ...where DESeq2SampleTable.txt looks like: sampleName sampleFile type 1_481 cf3ncbi-1_481.genes.counts tame 2_124 cf3ncbi-2_124.genes.counts tame ... Thanks, Jessica

ADD REPLY • link 10.7 years ago Jessica Perry Hekman ▴ 140

0

Entering edit mode

Thomas Hampton ▴ 750

@thomas-hampton-2820

Last seen 10.5 years ago

Dear Jessica, The answer to this sort of question depends on the question you are actually asking. You no doubt know a) that in this sort of test many genes are expected to reach nominal significance by chance and b) that software packages include algorithms intended to differentiate between these predicted "false positives" and "true positives". So maybe you are asking : do these strategies work? The short answer is that they definitely do not work perfectly for every experiment or for every gene in any experiment. I look at a lot of data where effects are pretty subtle and the usual sort of multiple hypothesis testing corrections suggest that there are *no* differentially expressed genes in my data. I have yet to run into a situation where follow up measurements and experiments validated the assertion that *no* genes were in fact differentially expressed. If I were in your shoes, as I have been on numerous occasions, I would do at least some of the following. 1) Establish that samples from your experimental groups are more like samples from the same group than they are like samples from other groups using comparisons like PCA, multi dimensional scaling or hierarchical clustering. 2) Consider your a priori knowledge of genes that you thought would go some place. Did they go in the directions you thought they should go? 3) Use qPCR or some low throughput method on single genes to reassure yourself that genes of interest (say genes with a fold change of 2 and a nominal, unadjusted p value of 0.05) are in fact regulated, even though multiple hypothesis testing correction suggests that they may be the result of chance. Good luck, Tom On Jul 15, 2014, at 5:54 PM, Jessica Perry Hekman wrote: > I'm getting only a few dozen differentially expressed genes when I analyze my RNA-Seq data with DESeq2 (79) and EdgeR (34) (even fewer when I use EBSeq). I had expected many more -- hundreds or even a thousand. If this is the real answer, I'm fine with it, but I'm concerned that I'm doing something wrong. What are the ranges of numbers of differentially expressed genes that one would expect from DESeq2 or EdgeR? > > More information: > > I'm in the midst of my first RNA-seq project (as many of you have probably surmised from my frequent postings to a variety of lists). My initial goal is to get a list of differentially expressed (DE) genes. > > I have 24 samples, 12 from each of 2 treatment groups. > > My species is fox (Vulpes vulpes), which aligns very nicely to dog (Canis familiaris). > > My current approach is to use the dog reference genome (to which my fox reads align at about 83%) + GTF with location of exons. > > Can I feel confident about DESeq2 and EdgeR's calls? > > Thanks very much for any insights, > > Jessica > > -- > Jessica P. Hekman, DVM, MS > PhD student, University of Illinois, Urbana-Champaign > Animal Sciences / Genetics, Genomics, and Bioinformatics > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD COMMENT • link 10.7 years ago Thomas Hampton ▴ 750

0

Entering edit mode

Thanks, Tom. Yes, you summarized my dilemma well, although I am more concerned with false negatives right now than false positives (as we do intend to do PCR to validate any positives we get, but any false negatives lost are lost forever :). Ryan also suggested PCA and I'll definitely be trying that next. But getting the feedback that others have used these techniques and feel moderately confident in them (with followup work) is very helpful. Thanks, Jessica Jessica P. Hekman, DVM, MS PhD student, University of Illinois, Urbana-Champaign Animal Sciences / Genetics, Genomics, and Bioinformatics On 07/15/2014 06:52 PM, Thomas H. Hampton wrote: > Dear Jessica, > > The answer to this sort of question depends on the question you are actually asking. > > You no doubt know a) that in this sort of test many genes are expected to reach > nominal significance by chance and b) that software packages include algorithms > intended to differentiate between these predicted "false positives" and "true positives". > > So maybe you are asking : do these strategies work? > > The short answer is that they definitely do not work perfectly for every experiment or for every gene in any experiment. > > I look at a lot of data where effects are pretty subtle and the usual sort of multiple hypothesis testing corrections suggest that > there are *no* differentially expressed genes in my data. > > I have yet to run into a situation where follow up measurements and experiments validated the assertion that *no* genes were > in fact differentially expressed. > > If I were in your shoes, as I have been on numerous occasions, I would do at least some of the following. > > 1) Establish that samples from your experimental groups are more like samples from the same group than > they are like samples from other groups using comparisons like PCA, multi dimensional scaling or hierarchical clustering. > > 2) Consider your a priori knowledge of genes that you thought would go some place. Did they go in the directions you thought they > should go? > > 3) Use qPCR or some low throughput method on single genes to reassure yourself that genes of interest (say genes with a fold change > of 2 and a nominal, unadjusted p value of 0.05) are in fact regulated, even though multiple hypothesis testing correction suggests that > they may be the result of chance. > > Good luck, > > Tom > > > On Jul 15, 2014, at 5:54 PM, Jessica Perry Hekman wrote: > >> I'm getting only a few dozen differentially expressed genes when I analyze my RNA-Seq data with DESeq2 (79) and EdgeR (34) (even fewer when I use EBSeq). I had expected many more -- hundreds or even a thousand. If this is the real answer, I'm fine with it, but I'm concerned that I'm doing something wrong. What are the ranges of numbers of differentially expressed genes that one would expect from DESeq2 or EdgeR? >> >> More information: >> >> I'm in the midst of my first RNA-seq project (as many of you have probably surmised from my frequent postings to a variety of lists). My initial goal is to get a list of differentially expressed (DE) genes. >> >> I have 24 samples, 12 from each of 2 treatment groups. >> >> My species is fox (Vulpes vulpes), which aligns very nicely to dog (Canis familiaris). >> >> My current approach is to use the dog reference genome (to which my fox reads align at about 83%) + GTF with location of exons. >> >> Can I feel confident about DESeq2 and EdgeR's calls? >> >> Thanks very much for any insights, >> >> Jessica >> >> -- >> Jessica P. Hekman, DVM, MS >> PhD student, University of Illinois, Urbana-Champaign >> Animal Sciences / Genetics, Genomics, and Bioinformatics >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >

ADD REPLY • link 10.7 years ago Jessica Perry Hekman ▴ 140

0

Entering edit mode

Dear Jessica On 16/07/14 02:15, Jessica Perry Hekman wrote: > Thanks, Tom. Yes, you summarized my dilemma well, although I am more > concerned with false negatives right now than false positives (as we do > intend to do PCR to validate any positives we get, but any false > negatives lost are lost forever :). I wonder whether you might have fallen for a fundamental but quite common misunderstanding here, because false positives and false negatives are not treated equal in a hypothesis test. In both edgeR and DESeq, you choose a false discover rate (FDR); in the examples of the vignette, we use 10%, but this is by no way the only useful value. This means that you ask DESeq2 to give you a list of genes that are differentially expressed and that this list should not contain more than 10% false positives, and that you are willing to accept as many false negatives as it takes to ensure that. More succinctly: If a gene is not called significant, this does not mean that the algorithm thinks that it is not differentially expressed but merely that it cannot say whether it is. One other important issue is: What does "significantly diferentially expressed" actually mean? In biological systems, all components are so highly interconnected that is seems implausible to think that there are any genes which are not at all affected by your treatment, not even slightly. I would argue that, in typical experiments, most if not all genes change their expression strength at least a tiny bit in reaction to treatment. The question is whether the difference that you observe between the mean expression in treatment and control samples is driven by this reaction to treatment, or whether it is mainly driven by random fluctuations, i.e., by those differences that you also see when comparing samples treated the same way (replicates). When the random noise has the stronger effect, then the observed difference (log fold change) will be in a random direction and may or may not be in the direction that the treatment has affected the gene. Hence, my (somewhat personal) opinion on what a significant p value means in DE analysis, namely: We got the sign right. A significant call means that we can have confidence in the observed direction of the change. The effect of treatment on this gene was strong enough that we can say with confidence whether the gene reacted with up- or with down-regulation. Hence, if you see less DE genes than expected, this means: The effect of your treatment was too weak to be be seen against the noise from random sample-to-sample variation (or equivalently: the variation within treatment groups was too strong and drowned the treatment signal). It does not mean that there was no effect. To judge whether your results are typical, you would need to tell us more about your experiment. Simon

ADD REPLY • link 10.7 years ago Simon Anders ★ 3.8k

0

Entering edit mode

Thanks, Simon. I did expect a fair amount of noise, and I think a lot of my surprise came from the fact that I expected the noise to be expressed as a large number of positives. Your explanation helps adjust my perspective. Jessica On 07/16/2014 01:12 AM, Simon Anders wrote: > Dear Jessica > > On 16/07/14 02:15, Jessica Perry Hekman wrote: >> Thanks, Tom. Yes, you summarized my dilemma well, although I am more >> concerned with false negatives right now than false positives (as we do >> intend to do PCR to validate any positives we get, but any false >> negatives lost are lost forever :). > > I wonder whether you might have fallen for a fundamental but quite > common misunderstanding here, because false positives and false > negatives are not treated equal in a hypothesis test. > > In both edgeR and DESeq, you choose a false discover rate (FDR); in the > examples of the vignette, we use 10%, but this is by no way the only > useful value. This means that you ask DESeq2 to give you a list of genes > that are differentially expressed and that this list should not contain > more than 10% false positives, and that you are willing to accept as > many false negatives as it takes to ensure that. > > More succinctly: If a gene is not called significant, this does not mean > that the algorithm thinks that it is not differentially expressed but > merely that it cannot say whether it is. > > > One other important issue is: What does "significantly diferentially > expressed" actually mean? In biological systems, all components are so > highly interconnected that is seems implausible to think that there are > any genes which are not at all affected by your treatment, not even > slightly. I would argue that, in typical experiments, most if not all > genes change their expression strength at least a tiny bit in reaction > to treatment. The question is whether the difference that you observe > between the mean expression in treatment and control samples is driven > by this reaction to treatment, or whether it is mainly driven by random > fluctuations, i.e., by those differences that you also see when > comparing samples treated the same way (replicates). When the random > noise has the stronger effect, then the observed difference (log fold > change) will be in a random direction and may or may not be in the > direction that the treatment has affected the gene. > > Hence, my (somewhat personal) opinion on what a significant p value > means in DE analysis, namely: We got the sign right. > > A significant call means that we can have confidence in the observed > direction of the change. The effect of treatment on this gene was strong > enough that we can say with confidence whether the gene reacted with up- > or with down-regulation. > > Hence, if you see less DE genes than expected, this means: The effect of > your treatment was too weak to be be seen against the noise from random > sample-to-sample variation (or equivalently: the variation within > treatment groups was too strong and drowned the treatment signal). It > does not mean that there was no effect. > > > To judge whether your results are typical, you would need to tell us > more about your experiment. > > Simon > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD REPLY • link 10.7 years ago Jessica Perry Hekman ▴ 140

Login before adding your answer.