edgeR i get 377 significant genes where in DESeq i got 0
2
0
Entering edit mode
Guest User ★ 13k
@guest-user-4897
Last seen 9.6 years ago
Hi, Assuming i have 2 files: 1's have 1,000,000 reads- one condition 2's have 3,000,000 reads- second condition i used rsem for differential expression for each of the 2 files. i used both programs to analyse differential expression.(edgeR & DESeq) i have a problem in the results..(maybe you can help me..) i used this manual http://manuals.bioinformatics.ucr.edu/home/ht-seq so may script is that: library("DESeq"); conditions <- c("1","2") cds <- newCountDataSet(x,conditions) cds <- estimateSizeFactors(cds) cds <- estimateDispersions(cds,method="blind",sharingMode="maximum",fi tType="local") res <- nbinomTest(cds,condA="1",condB="2") res[order(res$padj), ][1:4,] sigDESeq <- res[res$padj <= 0.05, ] sigDESeq <- na.omit(sigDESeq) sigDESeq <- as.character(sigDESeq[,1]) library("edgeR"); group <- factor(c(1,2)) cdsR <- DGEList(counts=x, group=group) cdsR <- estimateCommonDisp(cdsR) cdsR <- estimateTagwiseDisp(cdsR) et <- exactTest(cdsR, pair=c("1", "2")) all <- as.data.frame(topTags(et, n=100000)) sigedgeR <- all[all$adj.P.Val <= 0.05, ] sigedgeR <- na.omit(sigedgeR) sigedgeR <- row.names(sigedgeR) source("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/ My_R_Scripts/overLapper.R") OLlist <- overLapper(setlist=list(DESeq=sigDESeq, edgeR=sigedgeR), sep="_", type="vennsets") counts <- sapply(OLlist$Venn_List, length) vennPlot(counts=counts) counts oveLapp <- OLlist$Venn_List $$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$ i want to check Differential expression between biological replicates. (here: i have only 2) 1 & 2 are biological replicates. the problem is: in edgeR i get 377 significant genes where in DESeq i got 0 (ZERO) it looks weird to me, but i do'nt know what i did wrong.. Do you know? Thanks, Pap -- output of sessionInfo(): R version 2.14.0 (2011-10-31) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] edgeR_2.4.3 limma_3.10.2 loaded via a namespace (and not attached): [1] annotate_1.32.1 AnnotationDbi_1.16.16 Biobase_2.14.0 DBI_0.2-5 DESeq_1.6.1 [6] genefilter_1.36.0 geneplotter_1.32.1 grid_2.14.0 IRanges_1.12.6 RColorBrewer_1.0-5 [11] RSQLite_0.11.1 splines_2.14.0 survival_2.36-12 tools_2.14.0 xtable_1.7-0 -- Sent via the guest posting facility at bioconductor.org.
edgeR DESeq edgeR DESeq • 1.3k views
ADD COMMENT
0
Entering edit mode
@gordon-smyth
Last seen 1 hour ago
WEHI, Melbourne, Australia
Dear Papori, Probably you haven't made a mistake. In our experience, this is the typical behaviour of the two packages. The DESeq people should be applauded for trying something different, but commonsense would tell you that setting dispersions equal to the "maximum" of two extremes is likely to be conservative, especially when there are so few replicates. Best wishes Gordon > Date: Mon, 12 Mar 2012 06:42:47 -0700 (PDT) > From: "papori [guest]" <guest at="" bioconductor.org=""> > To: bioconductor at r-project.org, drordh at gmail.com > Subject: [BioC] edgeR i get 377 significant genes where in DESeq i got 0 > > Hi, > Assuming i have 2 files: > 1's have 1,000,000 reads- one condition > 2's have 3,000,000 reads- second condition > > i used rsem for differential expression for each of the 2 files. > > i used both programs to analyse differential expression.(edgeR & DESeq) > > i have a problem in the results..(maybe you can help me..) > i used this manual http://manuals.bioinformatics.ucr.edu/home/ht-seq > so may script is that: > > library("DESeq"); > conditions <- c("1","2") > cds <- newCountDataSet(x,conditions) > cds <- estimateSizeFactors(cds) > cds <- estimateDispersions(cds,method="blind",sharingMode="maximum", fitType="local") > res <- nbinomTest(cds,condA="1",condB="2") > res[order(res$padj), ][1:4,] > > sigDESeq <- res[res$padj <= 0.05, ] > sigDESeq <- na.omit(sigDESeq) > sigDESeq <- as.character(sigDESeq[,1]) > > library("edgeR"); > group <- factor(c(1,2)) > cdsR <- DGEList(counts=x, group=group) > cdsR <- estimateCommonDisp(cdsR) > cdsR <- estimateTagwiseDisp(cdsR) > et <- exactTest(cdsR, pair=c("1", "2")) > > all <- as.data.frame(topTags(et, n=100000)) > sigedgeR <- all[all$adj.P.Val <= 0.05, ] > sigedgeR <- na.omit(sigedgeR) > sigedgeR <- row.names(sigedgeR) > > source("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/ > My_R_Scripts/overLapper.R") > OLlist <- overLapper(setlist=list(DESeq=sigDESeq, edgeR=sigedgeR), > sep="_", type="vennsets") > counts <- sapply(OLlist$Venn_List, length) > vennPlot(counts=counts) > counts > > oveLapp <- OLlist$Venn_List > > $$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$ > > i want to check Differential expression between biological replicates. > (here: i have only 2) > 1 & 2 are biological replicates. > > the problem is: > in edgeR i get 377 significant genes where in DESeq i got 0 (ZERO) > > it looks weird to me, but i do'nt know what i did wrong.. > > Do you know? > > > Thanks, > Pap > > -- output of sessionInfo(): > > R version 2.14.0 (2011-10-31) > Platform: x86_64-unknown-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 LC_PAPER=C LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] edgeR_2.4.3 limma_3.10.2 > > loaded via a namespace (and not attached): > [1] annotate_1.32.1 AnnotationDbi_1.16.16 Biobase_2.14.0 DBI_0.2-5 DESeq_1.6.1 > [6] genefilter_1.36.0 geneplotter_1.32.1 grid_2.14.0 IRanges_1.12.6 RColorBrewer_1.0-5 > [11] RSQLite_0.11.1 splines_2.14.0 survival_2.36-12 tools_2.14.0 xtable_1.7-0 > > -- > Sent via the guest posting facility at bioconductor.org. > ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}}
ADD COMMENT
0
Entering edit mode
Hi On 2012-03-14 00:16, Gordon K Smyth wrote: > Probably you haven't made a mistake. In our experience, this is the > typical behaviour of the two packages. > > The DESeq people should be applauded for trying something different, but > commonsense would tell you that setting dispersions equal to the > "maximum" of two extremes is likely to be conservative, especially when > there are so few replicates. Of course, I need to chime in here. First of all, the poster did indeed make a mistake, see below. Second, in our experience, the use of the maximum rule, while admittedly looking overly conservative, costs surprisingly little power for typical data. We performed simulations that convinced at least us that this simple scheme is more robust than the WML scheme in edgeR, which, in our hands, failed to control type-I error when presented with simulated data with few replicates when the dispersion values were drawn from rather wide distributions. Of course, a systematic comparison is still lacking and should maybe be done by somebody more unbiased. In my opinion, such a comparison should be based on a simulation study that tests how the methods deal with simulated data with true dispersion values drawn from distributions of different shapes and widths, modeled after real data where available. Now, to Pap's data set: >> Hi, >> Assuming i have 2 files: >> 1's have 1,000,000 reads- one condition >> 2's have 3,000,000 reads- second condition Pap has two samples in total, not two replicates per condition, and so the whole discussion above is not applicable anyway. In this case, a proper statistical analysis is not possible. We can try to get at least something with workarounds, though. EdgeR used to switch to Poisson mode if presented with data without replication, i.e., it assumed zero biological variation, which, of course, leads to a large number of hits, which one cannot expect to be reproducible. Given that there are only 377 hits, this seems to have changed, and the edgeR authors will be able to comment on that. DESeq offers the method "blind" to deal with data without replicates, where it assumes that most genes are not differentially expressed and hence estimates the dispersion from a comparison across the two samples. Only those genes that "stick out" by showing much stronger differences than most genes will be reported. This method cannot be combined with the "maximum" rule, because then, every gene that is "sticking out" would be compared to itself. This is why this command here >> estimateDispersions(cds,method="blind",sharingMode="maximum",fitTyp e="local") produces a warning informing the user that 'method="blind"' should only be used together with 'sharingMode="fit-only"'. Pap, you may have overlooked this warning. Maybe, I should maybe change it to an error. Please try again with 'sharingMode="fit-only"' and let us know what you get. Simon
ADD REPLY
0
Entering edit mode
@gordon-smyth
Last seen 1 hour ago
WEHI, Melbourne, Australia
Dear Simon, On re-reading the original post, you may well be right that Pap has no biological replicates, in which case he or she has ignored warning messages from both our packages, and the discussion is not very interesting. There's ambiguity in the post -- it does mention biological replicates and, although there are only two files, a file may contain more than one library. I have been disappointed by users ignoring the warning message from edgeR about lack of replication, so I recently changed the developmental version of edgeR so that it returns NA dispersion when there is no replication, so a user is now forced to think more about what they are doing and take one of the actions outlined in the edge User's Guide. You seem to be suggesting that edgeR has inserted a numeric dispersion value different from zero in the case of no replication. If that is what you are suggesting, then you are incorrect. My comment about the conservatism of DESeq stands, as it is our experience. I think that you and I have a disagreement about what constitutes "typical data". In my opinion, some of the datasets being publicly presented as examples of RNA-Seq data show obvious signs that there are missing explanatory variables such as batch effects. Such data will produce outlier dispersions. You may wish to design an algorithm for these circumstances. My concern however is with experimental data that my biomedical collaborators present to me, and we take pains to avoid important uncontrolled factors or to identify and model them. Best wishes Gordon > Date: Wed, 14 Mar 2012 08:41:12 +0100 > From: Simon Anders <anders at="" embl.de=""> > To: bioconductor at r-project.org > Subject: Re: [BioC] edgeR i get 377 significant genes where in DESeq i > got 0 > > Hi > > On 2012-03-14 00:16, Gordon K Smyth wrote: >> Probably you haven't made a mistake. In our experience, this is the >> typical behaviour of the two packages. >> >> The DESeq people should be applauded for trying something different, but >> commonsense would tell you that setting dispersions equal to the >> "maximum" of two extremes is likely to be conservative, especially when >> there are so few replicates. > > Of course, I need to chime in here. > > First of all, the poster did indeed make a mistake, see below. > > Second, in our experience, the use of the maximum rule, while admittedly > looking overly conservative, costs surprisingly little power for typical > data. We performed simulations that convinced at least us that this > simple scheme is more robust than the WML scheme in edgeR, which, in our > hands, failed to control type-I error when presented with simulated data > with few replicates when the dispersion values were drawn from rather > wide distributions. Of course, a systematic comparison is still lacking > and should maybe be done by somebody more unbiased. In my opinion, such > a comparison should be based on a simulation study that tests how the > methods deal with simulated data with true dispersion values drawn from > distributions of different shapees and widths, modeled after real data > where available. > > Now, to Pap's data set: > >>> Hi, >>> Assuming i have 2 files: >>> 1's have 1,000,000 reads- one condition >>> 2's have 3,000,000 reads- second condition > > Pap has two samples in total, not two replicates per condition, and so > the whole discussion above is not applicable anyway. > > In this case, a proper statistical analysis is not possible. We can try > to get at least something with workarounds, though. > > EdgeR used to switch to Poisson mode if presented with data without > replication, i.e., it assumed zero biological variation, which, of > course, leads to a large number of hits, which one cannot expect to be > reproducible. Given that there are only 377 hits, this seems to have > changed, and the edgeR authors will be able to comment on that. > > DESeq offers the method "blind" to deal with data without replicates, > where it assumes that most genes are not differentially expressed and > hence estimates the dispersion from a comparison across the two samples. > Only those genes that "stick out" by showing much stronger differences > than most genes will be reported. > > This method cannot be combined with the "maximum" rule, because then, > every gene that is "sticking out" would be compared to itself. > > This is why this command here > >>> estimateDispersions(cds,method="blind",sharingMode="maximum",fitTy pe="local") > > produces a warning informing the user that 'method="blind"' should only > be used together with 'sharingMode="fit-only"'. > > Pap, you may have overlooked this warning. Maybe, I should maybe change > it to an error. > > Please try again with 'sharingMode="fit-only"' and let us know what you get. > > Simon > ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}}
ADD COMMENT

Login before adding your answer.

Traffic: 1059 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6