edgeR vs DESeq for comparison without replicate

0

Entering edit mode

Woo, Sangsoon ▴ 20

@woo-sangsoon-4745

Last seen 9.6 years ago

Dear all, I am working on a ChIP-Seq data set. I want to compare two groups having only one sample each group. (no replicates in both group) I generated count matrix which element is the number of reads within gene region for each data set. I applied edgeR and DESeq methods for this comparison. For this case, 1. edgeR uses Poisson by setting common.disp=1e-6 (zero). 2. DESeq still uses NB by assuming there is no difference b/w two samples to estimate dispersion. The results are 1. edgeR identifies many genes with very small p-values / adjusted p-value when I used common.disp approach. 2. edgeR gives none significant genes with tagwise.disp option. 3. DESeq does not identify any significant gene. I think that p-values of #2 and #3 are based on summing over all sums of counts that have a probability less than the probability under the null hypothesis of the observed sum of counts. But #1 is based on Poisson distribution with very small variation than actual data. Am I right? Looking at the raw counts for top genes is not helpful because it is just comparing two numbers. Which package is better for the case without replicate based on your experiences? Thanks for your help in advance. Sangsoon

edgeR DESeq edgeR DESeq • 8.3k views

ADD COMMENT • link updated 12.8 years ago by Davis McCarthy ▴ 260 • written 12.8 years ago by Woo, Sangsoon ▴ 20

0

Entering edit mode

Davis McCarthy ▴ 260

@davis-mccarthy-4138

Last seen 9.6 years ago

Dear Sangsoon Analysing count data for significance without replicates is always somewhat problematic. Experience tells us that genomic count data (ChIP-Seq, RNA-Seq, etc.) has substantial variability, more than a Poisson distribution is able to account for. However, if you do not have replicates then it is not possible to account for the extra-Poisson variability (overdispersion) in a completely satisfying way. I don't think that there really is an answer to the question of which of edgeR or DESeq is "better" for analysing data without replicates. Given that both packages assess significance using Robinson & Smyth's exact test (Biostatistics, 2008), both will give essentially the same significance results if the dispersion modeling is the same. Now, in this case, you are using very different dispersion modeling approaches in edgeR and DESeq, so the results are not all that comparable. There have been discussions previously on this mailing list that suggest using the NB assuming there is no difference b/w samples to roughly estimate the dispersion in both edgeR and DESeq. The results that you describe are not surprising. The edgeR analysis that you did is a Poisson model analysis, which we would expect to yield many significant DE genes. The DESeq analysis that you have described (and which I would probably also normally recommend as a better approach to use in edgeR) roughly estimates the dispersion---once you allow for some variability in the data you see no DE. Again this is not unexpected behaviour. There is currently another thread on Bioconductor in which Gordon has discussed more strategies for analysis when there are no replicates. I recommend that you have a look at his thoughts there. What you haven't told us is the size of the dispersion estimates that DESeq is using. In my experience (common) dispersion values for biological replicate data are often in the range of 0.1-0.6. If the dispersion values that you are using are much higher than this then I would be looking at things much more closely. Fundamentally, however, assessing statistical significance without replicate samples is very difficult - it's a lot to ask of a software package to pull out sensible DE genes without replication. I am somewhat relieved that the DESeq approach you took, and tagwise dispersions in edgeR yield no DE genes. In the end, robust statistical inference on differential expression requires (biologically) replicate samples, and there's no easy way around that. Best wishes Davis > Dear all, > > I am working on a ChIP-Seq data set. > I want to compare two groups having only one sample each group. (no > replicates in both group) > I generated count matrix which element is the number of reads within gene > region for each data set. > > I applied edgeR and DESeq methods for this comparison. > > For this case, > 1. edgeR uses Poisson by setting common.disp=1e-6 (zero). > 2. DESeq still uses NB by assuming there is no difference b/w two samples > to estimate dispersion. > > The results are > 1. edgeR identifies many genes with very small p-values / adjusted p-value > when I used common.disp approach. > 2. edgeR gives none significant genes with tagwise.disp option. > 3. DESeq does not identify any significant gene. > > I think that p-values of #2 and #3 are based on summing over all sums of > counts that have a probability less than the probability under the null > hypothesis of the observed sum of counts. But #1 is based on Poisson > distribution with very small variation than actual data. > Am I right? > Looking at the raw counts for top genes is not helpful because it is just > comparing two numbers. > > Which package is better for the case without replicate based on your > experiences? > > Thanks for your help in advance. > Sangsoon > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > -------------------------------------------------- Davis J McCarthy Research Technician Bioinformatics Division Walter and Eliza Hall Institute of Medical Research 1G Royal Parade, Parkville, Vic 3052, Australia. dmccarthy at wehi.edu.au http://www.wehi.edu.au ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}}

ADD COMMENT • link 12.8 years ago Davis McCarthy ▴ 260

0

Entering edit mode

Dear David, Thank you so~ much for your explanation and details. I know that it's always a difficult problem when we do not have replicates. I think that I'd better use ranking genes based on their logFC instead of establishing p-value threshold. At least we can see biological differences. Of course, we need to be careful for few reads only one group. I will take a look at Gordon's discussion. Thanks again. Sangsoon ----- Original Message ----- From: "Davis McCarthy" <dmccarthy@wehi.edu.au> To: "Sangsoon Woo" <swoo at="" fhcrc.org=""> Cc: bioconductor at r-project.org Sent: Saturday, July 9, 2011 1:11:02 AM Subject: Re: [BioC] edgeR vs DESeq for comparison without replicate Dear Sangsoon Analysing count data for significance without replicates is always somewhat problematic. Experience tells us that genomic count data (ChIP-Seq, RNA-Seq, etc.) has substantial variability, more than a Poisson distribution is able to account for. However, if you do not have replicates then it is not possible to account for the extra-Poisson variability (overdispersion) in a completely satisfying way. I don't think that there really is an answer to the question of which of edgeR or DESeq is "better" for analysing data without replicates. Given that both packages assess significance using Robinson & Smyth's exact test (Biostatistics, 2008), both will give essentially the same significance results if the dispersion modeling is the same. Now, in this case, you are using very different dispersion modeling approaches in edgeR and DESeq, so the results are not all that comparable. There have been discussions previously on this mailing list that suggest using the NB assuming there is no difference b/w samples to roughly estimate the dispersion in both edgeR and DESeq. The results that you describe are not surprising. The edgeR analysis that you did is a Poisson model analysis, which we would expect to yield many significant DE genes. The DESeq analysis that you have described (and which I would probably also normally recommend as a better approach to use in edgeR) roughly estimates the dispersion---once you allow for some variability in the data you see no DE. Again this is not unexpected behaviour. There is currently another thread on Bioconductor in which Gordon has discussed more strategies for analysis when there are no replicates. I recommend that you have a look at his thoughts there. What you haven't told us is the size of the dispersion estimates that DESeq is using. In my experience (common) dispersion values for biological replicate data are often in the range of 0.1-0.6. If the dispersion values that you are using are much higher than this then I would be looking at things much more closely. Fundamentally, however, assessing statistical significance without replicate samples is very difficult - it's a lot to ask of a software package to pull out sensible DE genes without replication. I am somewhat relieved that the DESeq approach you took, and tagwise dispersions in edgeR yield no DE genes. In the end, robust statistical inference on differential expression requires (biologically) replicate samples, and there's no easy way around that. Best wishes Davis > Dear all, > > I am working on a ChIP-Seq data set. > I want to compare two groups having only one sample each group. (no > replicates in both group) > I generated count matrix which element is the number of reads within gene > region for each data set. > > I applied edgeR and DESeq methods for this comparison. > > For this case, > 1. edgeR uses Poisson by setting common.disp=1e-6 (zero). > 2. DESeq still uses NB by assuming there is no difference b/w two samples > to estimate dispersion. > > The results are > 1. edgeR identifies many genes with very small p-values / adjusted p-value > when I used common.disp approach. > 2. edgeR gives none significant genes with tagwise.disp option. > 3. DESeq does not identify any significant gene. > > I think that p-values of #2 and #3 are based on summing over all sums of > counts that have a probability less than the probability under the null > hypothesis of the observed sum of counts. But #1 is based on Poisson > distribution with very small variation than actual data. > Am I right? > Looking at the raw counts for top genes is not helpful because it is just > comparing two numbers. > > Which package is better for the case without replicate based on your > experiences? > > Thanks for your help in advance. > Sangsoon > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > -------------------------------------------------- Davis J McCarthy Research Technician Bioinformatics Division Walter and Eliza Hall Institute of Medical Research 1G Royal Parade, Parkville, Vic 3052, Australia. dmccarthy at wehi.edu.au http://www.wehi.edu.au ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}}

ADD REPLY • link 12.8 years ago Woo, Sangsoon ▴ 20

Login before adding your answer.