Question: DESeq2
0
5.7 years ago by
Guest User12k
Guest User12k wrote:
Hi, I've been using the DESeq as well as DESeq2 packages for my RNAseq analysis. For one particular study alone, I find some discrepancies between DESeq and DESeq2 results. I've read in forums that the gene list is longer when we use DESeq2, which is true in my case. But what I also notice for this study is that there is no overlap between the gene list obtained using DESeq and DESeq2. Could you please help in understanding this difference? -- output of sessionInfo(): R version 3.0.2 (2013-09-25) Platform: x86_64-apple-darwin10.8.0 (64-bit) locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] parallel stats graphics grDevices utils datasets methods base other attached packages: [1] DESeq2_1.2.8 RcppArmadillo_0.4.000 Rcpp_0.10.6 GenomicRanges_1.14.4 [5] XVector_0.2.0 IRanges_1.20.6 BiocGenerics_0.8.0 -- Sent via the guest posting facility at bioconductor.org.
deseq deseq2 • 955 views
modified 5.7 years ago by Michael Love25k • written 5.7 years ago by Guest User12k
0
5.7 years ago by
Michael Love25k
United States
Michael Love25k wrote:
hi, There is not really enough information here to give a concrete answer, but yes, typically DESeq2 will contain most of the DESeq genes found for a given adjusted p-value. If this were not the case, I would first check to see if the rows of the results tables are not lining up for some reason. How many genes are found by either software? Mike On Tue, Feb 18, 2014 at 4:36 PM, Shobana Sekar [guest] < guest@bioconductor.org> wrote: > > Hi, > > I've been using the DESeq as well as DESeq2 packages for my RNAseq > analysis. For one particular study alone, I find some discrepancies between > DESeq and DESeq2 results. I've read in forums that the gene list is longer > when we use DESeq2, which is true in my case. But what I also notice for > this study is that there is no overlap between the gene list obtained using > DESeq and DESeq2. Could you please help in understanding this difference? > > -- output of sessionInfo(): > > R version 3.0.2 (2013-09-25) > Platform: x86_64-apple-darwin10.8.0 (64-bit) > > locale: > [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 > > attached base packages: > [1] parallel stats graphics grDevices utils datasets methods > base > > other attached packages: > [1] DESeq2_1.2.8 RcppArmadillo_0.4.000 Rcpp_0.10.6 > GenomicRanges_1.14.4 > [5] XVector_0.2.0 IRanges_1.20.6 BiocGenerics_0.8.0 > > > -- > Sent via the guest posting facility at bioconductor.org. > [[alternative HTML version deleted]]
In general, for looking at sets of genes, you should always use adjusted p-value. If you want to be less conservative/get longer lists, you can raise the threshold to 0.2 etc. This makes more sense than working with unadjusted p-values. It is a bit surprising that the intersection of the 222 and 1008 genes by p-value is 0, as in comparisons I have seen it is usually mostly a subset of the DESeq2 calls. You might try working with gene names to make sure you haven't accidentally reordered rows. length( intersection( rownames(DESeq2Res)[which(DESeq2Res$padj < .33)], DESeqRes$id[which(DESeqRes$padj < .33)] ) ) If you paste the code you are using for comparison and for DE analysis that would help. Also knowing how many samples and conditions do you have. On Tue, Feb 18, 2014 at 5:00 PM, <ssekar@tgen.org> wrote: > Hi Mike, > > Thank you for the quick reply. So when I filtered using the padj < 0.1 > criterion, DESeq did not give me any results but DESeq2 showed 53 genes. > But when I used the pval < 0.05 criterion and filtered, I got 1008 genes in > DESeq2 and 222 genes in DESeq, with no overlap between the 2 lists. Is it > because we should be looking at padj rather than pval? > > Thank you once again! > > -Shobana > > From: Michael Love <michaelisaiahlove@gmail.com> > Date: Tuesday, February 18, 2014 2:49 PM > To: "Shobana Sekar [guest]" <guest@bioconductor.org> > Cc: "bioconductor@r-project.org" <bioconductor@r-project.org>, Shobana > Sekar <ssekar@tgen.org> > Subject: Re: DESeq2 > > hi, > > There is not really enough information here to give a concrete answer, > but yes, typically DESeq2 will contain most of the DESeq genes found for a > given adjusted p-value. If this were not the case, I would first check to > see if the rows of the results tables are not lining up for some reason. > How many genes are found by either software? > > Mike > > > On Tue, Feb 18, 2014 at 4:36 PM, Shobana Sekar [guest] < > guest@bioconductor.org> wrote: > >> >> Hi, >> >> I've been using the DESeq as well as DESeq2 packages for my RNAseq >> analysis. For one particular study alone, I find some discrepancies between >> DESeq and DESeq2 results. I've read in forums that the gene list is longer >> when we use DESeq2, which is true in my case. But what I also notice for >> this study is that there is no overlap between the gene list obtained using >> DESeq and DESeq2. Could you please help in understanding this difference? >> >> -- output of sessionInfo(): >> >> R version 3.0.2 (2013-09-25) >> Platform: x86_64-apple-darwin10.8.0 (64-bit) >> >> locale: >> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 >> >> attached base packages: >> [1] parallel stats graphics grDevices utils datasets methods >> base >> >> other attached packages: >> [1] DESeq2_1.2.8 RcppArmadillo_0.4.000 Rcpp_0.10.6 >> GenomicRanges_1.14.4 >> [5] XVector_0.2.0 IRanges_1.20.6 BiocGenerics_0.8.0 >> >> >> -- >> Sent via the guest posting facility at bioconductor.org. >> > > [[alternative HTML version deleted]] ADD REPLYlink written 5.7 years ago by Michael Love25k hi Shobana, I'm continuing our discussion where we left off on the Bioconductor list. Note that you can save to CSV files straight from R using write.csv(). (some of your columns were exported from Excel as #NAME?) I used the following code to line up your two results tables: old <- read.delim(...,header=TRUE) new <- read.delim(...,header=TRUE) old$id <- as.character(old$id) new$id <- as.character(new$id) rownames(old) <- old$id rownames(new) <- new$id old <- old[!is.na(old$pval),] new <- new[!is.na(new$pvalue),] common <- intersect(old$id, new$id) While there were ~200 genes for DESeq with p-value less than .05, all but 1 of these are not available in the DESeq2 results: > table(old$pval < .05) FALSE TRUE 39785 222 > table(old[common,"pval"] < .05) FALSE TRUE 17129 1 That these genes have NA p-values in the DESeq2 results table is due to automatic outlier detection using Cook's distance in DESeq2. You might see how the results change if you turn this filtering off by using results(dds, cooksCutoff=FALSE). In the development branch (DESeq2 version >= 1.3), we have implemented a better solution to filtering count outliers when there are many replicates (7 or more), so genes are not set aside from the analysis (p-values set to NA) due to one or more extreme count. If you look however at the intersection using a higher p-value threshold (not recommended for analyses, just for sanity check), the intersection is not empty: table(DESeq = old[common,"pval"] < .2, DESeq2 = new[common,"pvalue"] < .2) DESeq2 DESeq FALSE TRUE FALSE 14166 2878 TRUE 29 57 It is expected that DESeq2 in general calls more DEG than DESeq, which appeared overly conservative in our testing due to the "maximum" rule of estimating dispersion. Mike On Tue, Feb 18, 2014 at 5:59 PM, Michael Love <michaelisaiahlove@gmail.com>wrote: > In general, for looking at sets of genes, you should always use adjusted > p-value. If you want to be less conservative/get longer lists, you can > raise the threshold to 0.2 etc. This makes more sense than working with > unadjusted p-values. It is a bit surprising that the intersection of the > 222 and 1008 genes by p-value is 0, as in comparisons I have seen it is > usually mostly a subset of the DESeq2 calls. You might try working with > gene names to make sure you haven't accidentally reordered rows. > > length( intersection( rownames(DESeq2Res)[which(DESeq2Res$padj < .33)], > DESeqRes$id[which(DESeqRes$padj < .33)] ) ) > > If you paste the code you are using for comparison and for DE analysis > that would help. Also knowing how many samples and conditions do you have. > > > On Tue, Feb 18, 2014 at 5:00 PM, <ssekar@tgen.org> wrote: > >> Hi Mike, >> >> Thank you for the quick reply. So when I filtered using the padj < 0.1 >> criterion, DESeq did not give me any results but DESeq2 showed 53 genes. >> But when I used the pval < 0.05 criterion and filtered, I got 1008 genes in >> DESeq2 and 222 genes in DESeq, with no overlap between the 2 lists. Is it >> because we should be looking at padj rather than pval? >> >> Thank you once again! >> >> -Shobana >> >> From: Michael Love <michaelisaiahlove@gmail.com> >> Date: Tuesday, February 18, 2014 2:49 PM >> To: "Shobana Sekar [guest]" <guest@bioconductor.org> >> Cc: "bioconductor@r-project.org" <bioconductor@r-project.org>, Shobana >> Sekar <ssekar@tgen.org> >> Subject: Re: DESeq2 >> >> hi, >> >> There is not really enough information here to give a concrete answer, >> but yes, typically DESeq2 will contain most of the DESeq genes found for a >> given adjusted p-value. If this were not the case, I would first check to >> see if the rows of the results tables are not lining up for some reason. >> How many genes are found by either software? >> >> Mike >> >> >> On Tue, Feb 18, 2014 at 4:36 PM, Shobana Sekar [guest] < >> guest@bioconductor.org> wrote: >> >>> >>> Hi, >>> >>> I've been using the DESeq as well as DESeq2 packages for my RNAseq >>> analysis. For one particular study alone, I find some discrepancies between >>> DESeq and DESeq2 results. I've read in forums that the gene list is longer >>> when we use DESeq2, which is true in my case. But what I also notice for >>> this study is that there is no overlap between the gene list obtained using >>> DESeq and DESeq2. Could you please help in understanding this difference? >>> >>> -- output of sessionInfo(): >>> >>> R version 3.0.2 (2013-09-25) >>> Platform: x86_64-apple-darwin10.8.0 (64-bit) >>> >>> locale: >>> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 >>> >>> attached base packages: >>> [1] parallel stats graphics grDevices utils datasets methods >>> base >>> >>> other attached packages: >>> [1] DESeq2_1.2.8 RcppArmadillo_0.4.000 Rcpp_0.10.6 >>> GenomicRanges_1.14.4 >>> [5] XVector_0.2.0 IRanges_1.20.6 BiocGenerics_0.8.0 >>> >>> >>> -- >>> Sent via the guest posting facility at bioconductor.org. >>> >> >> > [[alternative HTML version deleted]] ADD REPLYlink written 5.7 years ago by Michael Love25k Thank you so much for the detailed answer! -Shobana From: Michael Love <michaelisaiahlove@gmail.com<mailto:michaelisaiahlove@gmail.com>> Date: Tuesday, February 25, 2014 5:01 PM To: Shobana Sekar <ssekar@tgen.org<mailto:ssekar@tgen.org>> Cc: "guest@bioconductor.org<mailto:guest@bioconductor.org>" <guest@bioconductor.org<mailto:guest@bioconductor.org>>, "bioconductor@r-project.org<mailto:bioconductor@r-project.org>" <bioconductor@r-project.org<mailto:bioconductor@r-project.org>> Subject: Re: DESeq2 hi Shobana, I'm continuing our discussion where we left off on the Bioconductor list. Note that you can save to CSV files straight from R using write.csv(). (some of your columns were exported from Excel as #NAME?) I used the following code to line up your two results tables: old <- read.delim(...,header=TRUE) new <- read.delim(...,header=TRUE) old$id <- as.character(old$id) new$id <- as.character(new$id) rownames(old) <- old$id rownames(new) <- new$id old <- old[!is.na<http: is.na="">(old$pval),] new <- new[!is.na<http: is.na="">(new$pvalue),] common <- intersect(old$id, new$id) While there were ~200 genes for DESeq with p-value less than .05, all but 1 of these are not available in the DESeq2 results: > table(old$pval < .05) FALSE TRUE 39785 222 > table(old[common,"pval"] < .05) FALSE TRUE 17129 1 That these genes have NA p-values in the DESeq2 results table is due to automatic outlier detection using Cook's distance in DESeq2. You might see how the results change if you turn this filtering off by using results(dds, cooksCutoff=FALSE). In the development branch (DESeq2 version >= 1.3), we have implemented a better solution to filtering count outliers when there are many replicates (7 or more), so genes are not set aside from the analysis (p-values set to NA) due to one or more extreme count. If you look however at the intersection using a higher p-value threshold (not recommended for analyses, just for sanity check), the intersection is not empty: table(DESeq = old[common,"pval"] < .2, DESeq2 = new[common,"pvalue"] < .2) DESeq2 DESeq FALSE TRUE FALSE 14166 2878 TRUE 29 57 It is expected that DESeq2 in general calls more DEG than DESeq, which appeared overly conservative in our testing due to the "maximum" rule of estimating dispersion. Mike On Tue, Feb 18, 2014 at 5:59 PM, Michael Love <michaelisaiahlove@gmail.com<mailto:michaelisaiahlove@gmail.com>> wrote: In general, for looking at sets of genes, you should always use adjusted p-value. If you want to be less conservative/get longer lists, you can raise the threshold to 0.2 etc. This makes more sense than working with unadjusted p-values. It is a bit surprising that the intersection of the 222 and 1008 genes by p-value is 0, as in comparisons I have seen it is usually mostly a subset of the DESeq2 calls. You might try working with gene names to make sure you haven't accidentally reordered rows. length( intersection( rownames(DESeq2Res)[which(DESeq2Res$padj < .33)], DESeqRes$id[which(DESeqRes\$padj < .33)] ) ) If you paste the code you are using for comparison and for DE analysis that would help. Also knowing how many samples and conditions do you have. On Tue, Feb 18, 2014 at 5:00 PM, <ssekar@tgen.org<mailto:ssekar@tgen.org>> wrote: Hi Mike, Thank you for the quick reply. So when I filtered using the padj < 0.1 criterion, DESeq did not give me any results but DESeq2 showed 53 genes. But when I used the pval < 0.05 criterion and filtered, I got 1008 genes in DESeq2 and 222 genes in DESeq, with no overlap between the 2 lists. Is it because we should be looking at padj rather than pval? Thank you once again! -Shobana From: Michael Love <michaelisaiahlove@gmail.com<mailto:michaelisaiahlove@gmail.com>> Date: Tuesday, February 18, 2014 2:49 PM To: "Shobana Sekar [guest]" <guest@bioconductor.org<mailto:guest@bioconductor.org>> Cc: "bioconductor@r-project.org<mailto:bioconductor@r-project.org>" <bioconductor@r-project.org<mailto:bioconductor@r-project.org>>, Shobana Sekar <ssekar@tgen.org<mailto:ssekar@tgen.org>> Subject: Re: DESeq2 hi, There is not really enough information here to give a concrete answer, but yes, typically DESeq2 will contain most of the DESeq genes found for a given adjusted p-value. If this were not the case, I would first check to see if the rows of the results tables are not lining up for some reason. How many genes are found by either software? Mike On Tue, Feb 18, 2014 at 4:36 PM, Shobana Sekar [guest] <guest@bioconductor.org<mailto:guest@bioconductor.org>> wrote: Hi, I've been using the DESeq as well as DESeq2 packages for my RNAseq analysis. For one particular study alone, I find some discrepancies between DESeq and DESeq2 results. I've read in forums that the gene list is longer when we use DESeq2, which is true in my case. But what I also notice for this study is that there is no overlap between the gene list obtained using DESeq and DESeq2. Could you please help in understanding this difference? -- output of sessionInfo(): R version 3.0.2 (2013-09-25) Platform: x86_64-apple-darwin10.8.0 (64-bit) locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] parallel stats graphics grDevices utils datasets methods base other attached packages: [1] DESeq2_1.2.8 RcppArmadillo_0.4.000 Rcpp_0.10.6 GenomicRanges_1.14.4 [5] XVector_0.2.0 IRanges_1.20.6 BiocGenerics_0.8.0 -- Sent via the guest posting facility at bioconductor.org<http: bioconductor.org="">. [[alternative HTML version deleted]]