DESeq2; FoldChange values with only 0's
1
0
Entering edit mode
@hanneke-van-deutekom-6220
Last seen 9.6 years ago
Hi, My experiment consist of control and diseased samples; looking like this: colData<-DataFrame(condition=factor(c("A","A","B","B","C","C","D","D", "E","E","F","F","H","H","I","I")), type=factor(c("plasma","plasma","plasma","plasma","plasma","plasma","p lasma","plasma","tissue","tissue","tissue","tissue","tissue","tissue", "tissue","tissue")), diseased=factor(c("no","no","no","no","yes","yes","yes","yes","no","no ","no","no","yes","yes","yes","yes"))) To search for differential expression I compare AvsB, AvsC, BvsD and BvsD then EvsF, EvsH, FvsI and HvsI, So I use the results(dds,contrast=c("condition","H","I")) option in DESeq2. And then, these are the results I do not get; look at this example: geneid comparison log2foldchange padj countA1 countA2 countC1 countC2 (counts are the normalized values) xxx AC 2.40588491829641 0.281844992240883 0 0 0 0 yyy AB 3.98976903585264 0.0455960023534269 0 0 0 0 or .. a log2fold change has been found between A and C of 2.4 and almost 4 in AvsB, while all of these samples are. I noticed that in all cases where the counts and the normalized counts of the two conditions that I tested were all zero I do get a fold change, even up to (2log) 5. Often I see a padj of "NA". So I was satisfied with that, as pseudocounts could result in some kind of fold change. Leaving those with padj "NA" out would be good then. But as the example above, I sometimes get a padj, which is even significant in the AB example above. This propably has something to do with all the other values (A1, A2, B1, B2,....,I1,I2); yyy 0 0 0 0 0 0 0 11 4897697 4167458 4337956 4139125 3731919 9711412 4043504 3796036 (i.e. massively present in tissue samples but not in blood, because of the discrepancies between blood and tissue I am not comparing them directly) I would like to understand why there is a significance here between A and B?? Or should I 'just' split my data in tissue and plasma?? But then still. I will find the AC example above, as B and D have counts in that example, however small numbers (E-I are also small numbers) Hanneke > sessionInfo() R version 3.0.2 (2013-09-25) Platform: x86_64-pc-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats graphics grDevices utils datasets methods [8] base other attached packages: [1] DESeq2_1.2.2 RcppArmadillo_0.3.910.0 Rcpp_0.10.4 [4] GenomicRanges_1.14.3 XVector_0.2.0 IRanges_1.20.4 [7] BiocGenerics_0.8.0 loaded via a namespace (and not attached): [1] annotate_1.40.0 AnnotationDbi_1.24.0 Biobase_2.22.0 [4] DBI_0.2-7 genefilter_1.44.0 grid_3.0.2 [7] lattice_0.20-23 locfit_1.5-9.1 RColorBrewer_1.0-5 [10] RSQLite_0.11.4 splines_3.0.2 stats4_3.0.2 [13] survival_2.37-4 XML_3.98-1.1 xtable_1.7-1 [[alternative HTML version deleted]]
• 867 views
ADD COMMENT
0
Entering edit mode
@mikelove
Last seen 9 hours ago
United States
hi Hanneke, That looks like a bug. Could you email me your data and a reproducible example so I can look into this? Can you also try rerunning DESeq() with betaPrior=FALSE? thanks, Mike On Fri, Nov 22, 2013 at 9:48 AM, Hanneke van Deutekom < hwmvandeutekom@gmail.com> wrote: > Hi, > > > My experiment consist of control and diseased samples; looking like this: > > > colData<-DataFrame(condition=factor(c("A","A","B","B","C","C","D","D ","E","E","F","F","H","H","I","I")), > > type=factor(c("plasma","plasma","plasma","plasma","plasma","plasma", "plasma","plasma","tissue","tissue","tissue","tissue","tissue","tissue ","tissue","tissue")), > > diseased=factor(c("no","no","no","no","yes","yes","yes","yes","no"," no","no","no","yes","yes","yes","yes"))) > > To search for differential expression I compare > AvsB, AvsC, BvsD and BvsD > then > EvsF, EvsH, FvsI and HvsI, > > So I use the results(dds,contrast=c("condition","H","I")) option in DESeq2. > > And then, these are the results I do not get; > look at this example: > > geneid comparison log2foldchange padj countA1 countA2 countC1 countC2 > (counts are the normalized values) > xxx AC 2.40588491829641 0.281844992240883 0 0 0 0 > yyy AB 3.98976903585264 0.0455960023534269 0 0 0 0 > or .. a log2fold change has been found between A and C of 2.4 and almost 4 > in AvsB, while all of these samples are. > > I noticed that in all cases where the counts and the normalized counts of > the two conditions that I tested were all zero I do get a fold change, even > up to (2log) 5. Often I see a padj of "NA". So I was satisfied with that, > as pseudocounts could result in some kind of fold change. Leaving those > with padj "NA" out would be good then. > But as the example above, I sometimes get a padj, which is even significant > in the AB example above. This propably has something to do with all the > other values (A1, A2, B1, B2,....,I1,I2); > yyy 0 0 0 0 0 0 0 11 4897697 4167458 > 4337956 4139125 3731919 9711412 4043504 3796036 (i.e. > massively present in tissue samples but not in blood, because of the > discrepancies between blood and tissue I am not comparing them directly) > > I would like to understand why there is a significance here between A and > B?? > Or should I 'just' split my data in tissue and plasma?? But then still. I > will find the AC example above, as B and D have counts in that example, > however small numbers (E-I are also small numbers) > > > Hanneke > > > > > > > sessionInfo() > R version 3.0.2 (2013-09-25) > Platform: x86_64-pc-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 > [7] LC_PAPER=en_US.UTF-8 LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] parallel stats graphics grDevices utils datasets methods > [8] base > > other attached packages: > [1] DESeq2_1.2.2 RcppArmadillo_0.3.910.0 Rcpp_0.10.4 > [4] GenomicRanges_1.14.3 XVector_0.2.0 IRanges_1.20.4 > [7] BiocGenerics_0.8.0 > > loaded via a namespace (and not attached): > [1] annotate_1.40.0 AnnotationDbi_1.24.0 Biobase_2.22.0 > [4] DBI_0.2-7 genefilter_1.44.0 grid_3.0.2 > [7] lattice_0.20-23 locfit_1.5-9.1 RColorBrewer_1.0-5 > [10] RSQLite_0.11.4 splines_3.0.2 stats4_3.0.2 > [13] survival_2.37-4 XML_3.98-1.1 xtable_1.7-1 > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]
ADD COMMENT
0
Entering edit mode
hi Hanneke, Never mind about the example, I can reproduce this and will look into it. thanks, Mike On Fri, Nov 22, 2013 at 10:01 AM, Michael Love <michaelisaiahlove@gmail.com>wrote: > hi Hanneke, > > That looks like a bug. Could you email me your data and a reproducible > example so I can look into this? > > Can you also try rerunning DESeq() with betaPrior=FALSE? > > thanks, > > Mike > > > On Fri, Nov 22, 2013 at 9:48 AM, Hanneke van Deutekom < > hwmvandeutekom@gmail.com> wrote: > >> Hi, >> >> >> My experiment consist of control and diseased samples; looking like this: >> >> >> colData<-DataFrame(condition=factor(c("A","A","B","B","C","C","D"," D","E","E","F","F","H","H","I","I")), >> >> type=factor(c("plasma","plasma","plasma","plasma","plasma","plasma" ,"plasma","plasma","tissue","tissue","tissue","tissue","tissue","tissu e","tissue","tissue")), >> >> diseased=factor(c("no","no","no","no","yes","yes","yes","yes","no", "no","no","no","yes","yes","yes","yes"))) >> >> To search for differential expression I compare >> AvsB, AvsC, BvsD and BvsD >> then >> EvsF, EvsH, FvsI and HvsI, >> >> So I use the results(dds,contrast=c("condition","H","I")) option in >> DESeq2. >> >> And then, these are the results I do not get; >> look at this example: >> >> geneid comparison log2foldchange padj countA1 countA2 countC1 countC2 >> (counts are the normalized values) >> xxx AC 2.40588491829641 0.281844992240883 0 0 0 0 >> yyy AB 3.98976903585264 0.0455960023534269 0 0 0 0 >> or .. a log2fold change has been found between A and C of 2.4 and almost 4 >> in AvsB, while all of these samples are. >> >> I noticed that in all cases where the counts and the normalized counts of >> the two conditions that I tested were all zero I do get a fold change, >> even >> up to (2log) 5. Often I see a padj of "NA". So I was satisfied with that, >> as pseudocounts could result in some kind of fold change. Leaving those >> with padj "NA" out would be good then. >> But as the example above, I sometimes get a padj, which is even >> significant >> in the AB example above. This propably has something to do with all the >> other values (A1, A2, B1, B2,....,I1,I2); >> yyy 0 0 0 0 0 0 0 11 4897697 4167458 >> 4337956 4139125 3731919 9711412 4043504 3796036 (i.e. >> massively present in tissue samples but not in blood, because of the >> discrepancies between blood and tissue I am not comparing them directly) >> >> I would like to understand why there is a significance here between A and >> B?? >> Or should I 'just' split my data in tissue and plasma?? But then still. I >> will find the AC example above, as B and D have counts in that example, >> however small numbers (E-I are also small numbers) >> >> >> Hanneke >> >> >> >> >> >> > sessionInfo() >> R version 3.0.2 (2013-09-25) >> Platform: x86_64-pc-linux-gnu (64-bit) >> >> locale: >> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 >> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 >> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C >> [9] LC_ADDRESS=C LC_TELEPHONE=C >> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >> >> attached base packages: >> [1] parallel stats graphics grDevices utils datasets methods >> [8] base >> >> other attached packages: >> [1] DESeq2_1.2.2 RcppArmadillo_0.3.910.0 Rcpp_0.10.4 >> [4] GenomicRanges_1.14.3 XVector_0.2.0 IRanges_1.20.4 >> [7] BiocGenerics_0.8.0 >> >> loaded via a namespace (and not attached): >> [1] annotate_1.40.0 AnnotationDbi_1.24.0 Biobase_2.22.0 >> [4] DBI_0.2-7 genefilter_1.44.0 grid_3.0.2 >> [7] lattice_0.20-23 locfit_1.5-9.1 RColorBrewer_1.0-5 >> [10] RSQLite_0.11.4 splines_3.0.2 stats4_3.0.2 >> [13] survival_2.37-4 XML_3.98-1.1 xtable_1.7-1 >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor@r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > [[alternative HTML version deleted]]
ADD REPLY
0
Entering edit mode
hi Hanneke, Thanks for pointing this out. It is a general issue with asymmetric behavior of the prior on log2 fold changes for factors with 3 or more levels, which we are interested in fixing. Other users had noted that the log2 fold changes and p-values for a factor with 3 or more levels can change if the base level of that factor is changed. I will start implementing a solution for this in the development branch (v1.3), but I don't want to disrupt functionality in the release branch (v1.2). My recommendation for version 1.2 is to use DESeq(dds, betaPrior=FALSE) if the factor you are interested in generating results for has 3 or more levels. If you have a design "~ group + condition", where group has 3 levels and condition has 2 levels, and you were only interested in results for condition, then you could continue to use the default settings. I pushed out a new version to the release branch, 1.2.6, which provides this message to the user if a DESeqDataSet is constructed with a factor with 3 or more levels. Usage note: the following factors have 3 or more levels: > group > For DESeq2 versions < 1.3, if you plan on extracting results for > these factors, we recommend using betaPrior=FALSE as an argument > when calling DESeq(). > As currently implemented in version 1.2, the log2 fold changes can > vary if the base level is changed, when extracting results for a > factor with 3 or more levels. A solution will be implemented in > version 1.3 which allows for the use of a beta prior and symmetric > log2 fold change estimates regardless of the selection of base level. Mike On Fri, Nov 22, 2013 at 10:20 AM, Michael Love <michaelisaiahlove@gmail.com>wrote: > hi Hanneke, > > Never mind about the example, I can reproduce this and will look into it. > > thanks, > > Mike > > > On Fri, Nov 22, 2013 at 10:01 AM, Michael Love < > michaelisaiahlove@gmail.com> wrote: > >> hi Hanneke, >> >> That looks like a bug. Could you email me your data and a reproducible >> example so I can look into this? >> >> Can you also try rerunning DESeq() with betaPrior=FALSE? >> >> thanks, >> >> Mike >> >> >> On Fri, Nov 22, 2013 at 9:48 AM, Hanneke van Deutekom < >> hwmvandeutekom@gmail.com> wrote: >> >>> Hi, >>> >>> >>> My experiment consist of control and diseased samples; looking like this: >>> >>> >>> colData<-DataFrame(condition=factor(c("A","A","B","B","C","C","D", "D","E","E","F","F","H","H","I","I")), >>> >>> type=factor(c("plasma","plasma","plasma","plasma","plasma","plasma ","plasma","plasma","tissue","tissue","tissue","tissue","tissue","tiss ue","tissue","tissue")), >>> >>> diseased=factor(c("no","no","no","no","yes","yes","yes","yes","no" ,"no","no","no","yes","yes","yes","yes"))) >>> >>> To search for differential expression I compare >>> AvsB, AvsC, BvsD and BvsD >>> then >>> EvsF, EvsH, FvsI and HvsI, >>> >>> So I use the results(dds,contrast=c("condition","H","I")) option in >>> DESeq2. >>> >>> And then, these are the results I do not get; >>> look at this example: >>> >>> geneid comparison log2foldchange padj countA1 countA2 countC1 countC2 >>> (counts are the normalized values) >>> xxx AC 2.40588491829641 0.281844992240883 0 0 0 0 >>> yyy AB 3.98976903585264 0.0455960023534269 0 0 0 0 >>> or .. a log2fold change has been found between A and C of 2.4 and almost >>> 4 >>> in AvsB, while all of these samples are. >>> >>> I noticed that in all cases where the counts and the normalized counts of >>> the two conditions that I tested were all zero I do get a fold change, >>> even >>> up to (2log) 5. Often I see a padj of "NA". So I was satisfied with that, >>> as pseudocounts could result in some kind of fold change. Leaving those >>> with padj "NA" out would be good then. >>> But as the example above, I sometimes get a padj, which is even >>> significant >>> in the AB example above. This propably has something to do with all the >>> other values (A1, A2, B1, B2,....,I1,I2); >>> yyy 0 0 0 0 0 0 0 11 4897697 4167458 >>> 4337956 4139125 3731919 9711412 4043504 3796036 (i.e. >>> massively present in tissue samples but not in blood, because of the >>> discrepancies between blood and tissue I am not comparing them directly) >>> >>> I would like to understand why there is a significance here between A and >>> B?? >>> Or should I 'just' split my data in tissue and plasma?? But then still. I >>> will find the AC example above, as B and D have counts in that example, >>> however small numbers (E-I are also small numbers) >>> >>> >>> Hanneke >>> >>> >>> >>> >>> >>> > sessionInfo() >>> R version 3.0.2 (2013-09-25) >>> Platform: x86_64-pc-linux-gnu (64-bit) >>> >>> locale: >>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >>> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 >>> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 >>> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C >>> [9] LC_ADDRESS=C LC_TELEPHONE=C >>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >>> >>> attached base packages: >>> [1] parallel stats graphics grDevices utils datasets methods >>> [8] base >>> >>> other attached packages: >>> [1] DESeq2_1.2.2 RcppArmadillo_0.3.910.0 Rcpp_0.10.4 >>> [4] GenomicRanges_1.14.3 XVector_0.2.0 IRanges_1.20.4 >>> [7] BiocGenerics_0.8.0 >>> >>> loaded via a namespace (and not attached): >>> [1] annotate_1.40.0 AnnotationDbi_1.24.0 Biobase_2.22.0 >>> [4] DBI_0.2-7 genefilter_1.44.0 grid_3.0.2 >>> [7] lattice_0.20-23 locfit_1.5-9.1 RColorBrewer_1.0-5 >>> [10] RSQLite_0.11.4 splines_3.0.2 stats4_3.0.2 >>> [13] survival_2.37-4 XML_3.98-1.1 xtable_1.7-1 >>> >>> [[alternative HTML version deleted]] >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor@r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >> >> > [[alternative HTML version deleted]]
ADD REPLY

Login before adding your answer.

Traffic: 697 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6