Question: DESeq2 high fold change values for comparison of samples with zero counts
0
5.3 years ago by
Noa Henig50
Noa Henig50 wrote:
Hi DESeq developers, I am running DESeq2 1.2.8 to receive differential expression in 2 different cell types. For each cell type I have control samples, over-expression of gene A, and over-expression of gene B, while the second cell type is used in order to support the findings from the first cell type. I have 2 questions: First, I found some genes that had zero normalized counts in 7 out of 8 replicates (the last one had 1 count or 1.6 counts). The fold change for comparison for such two genes was -13.4 and 2.47. Since the 'condition' has more than 3 levels I set the betaPrior to False as Michael indicated in the earlier correspondence (the code appears below). Can you tell the reason for this this and how to correct this? In addition, I received the following warning: Warning message: In parametricDispersionFit(mcols(objectNZ)$baseMean[useForFit], : the parametric dispersion fit did not converge, which occurs when the dispersion estimates over the mean do not fit to the curve y = a/x + b. You should re-run the analysis using fitType='local' or 'mean' (DESeq2 > v1.2 will re-run automatically). When using local regression fit, the user should examine plotDispEsts(dds) to make sure the fitted line is not sharply curving up or down based on the position of individual points. I guess that by running version 1.2.8 the additional fit types were ran automatically but can I see the actual code that was ran ? This is the sessionInfo: R version 3.0.2 (2013-09-25) Platform: i386-w64-mingw32/i386 (32-bit) locale: [1] LC_COLLATE=Hebrew_Israel.1255 LC_CTYPE=Hebrew_Israel.1255 LC_MONETARY=Hebrew_Israel.1255 [4] LC_NUMERIC=C LC_TIME=Hebrew_Israel.1255 attached base packages: [1] parallel stats graphics grDevices utils datasets methods base other attached packages: [1] DESeq2_1.2.8 RcppArmadillo_0.4.000 Rcpp_0.10.6 GenomicRanges_1.14.4 XVector_0.2.0 [6] IRanges_1.20.6 BiocGenerics_0.8.0 loaded via a namespace (and not attached): [1] annotate_1.40.0 AnnotationDbi_1.24.0 Biobase_2.22.0 DBI_0.2-7 genefilter_1.44.0 [6] grid_3.0.2 lattice_0.20-23 locfit_1.5-9.1 RColorBrewer_1.0-5 RSQLite_0.11.4 [11] splines_3.0.2 stats4_3.0.2 survival_2.37-4 XML_3.98-1.1 xtable_1.7-1 And this is the code I was running: tmpcountData = read.table(countsTable, header=TRUE, row.names=1) numOfConds=6 samplesList = c("U_V0", "U_V0","U_V0", "U_V0", "U_V0", "U_V0", "U_K1", "U_K1", "U_K1", "U_K1", "U_p5", "U_p5", "M_V0", "M_V0", "M_V0", "M_K1","M_K1", "M_p5", "M_p5") countData = tmpcountData[,3:21] # counts table contains 2 more columns before the counts colData = data.frame(row.names=colnames(countData), condition = samplesList) conditionsList = c("U_V0","U_K1", "U_p5","M_V0","M_K1","M_p5") colData$condition = factor( colData$condition, levels = conditionsList) dds <- DESeqDataSetFromMatrix(countData = countData, colData = colData, design = ~condition) dds <- estimateSizeFactors(dds) dds <- estimateDispersions(dds, maxit = 100) dds <- nbinomWaldTest(dds, betaPrior = FALSE) png("plotDispEsts_final.png") plotDispEsts(dds) dev.off() tmp_counts = counts(dds, normalized=TRUE) normalized_counts = data.frame(tmp_counts) ... reading the columns of counts for comparisons and writing the results to a table.... Thanks! Noa Henig On Tue, Jan 7, 2014 at 4:21 PM, Michael Love <michaelisaiahlove at="" gmail.com="">wrote: > Hi Noa, > > Thanks for reporting in. This sounds like the same problem that a user > had in November: > > > http://permalink.gmane.org/gmane.science.biology.informatics.conduct or/51749 > > > If you are using the updated release version (>= 1.2.6), there should > have been a warning printed when you build the DESeqDataSet that you should > set betaPrior=FALSE because factors are present in the design with 3 or > more levels. > > In the devel branch of DESeq2 (v1.3), we have implemented a solution that > allows using a prior on log fold changes for factors with 3 or more levels > results. Then DESeq() or nbinomWaldTest() will provide symmetric results > regardless of the base level and it takes care of this situation with > nonzero log fold changes from contrasts in which both conditions have zeros. > > With future questions please > post your full code, the output of sessionInfo() > , as this helps us provide better answers. > > Mike > On Jan 7, 2014 8:49 AM, "Noa Henig" <ntzunz at="" gmail.com=""> wrote: > >> Hi, >> I am using DESeq2 to get differential expression of 45 samples which >> represent 26 different conditions. >> I read that the calculation of the fold change in changed DESeq2 and >> affects the genes having low counts more than the highly expressed genes. >> However, I found genes that had zero counts in series of 4 samples (2 >> replicates in each of 2 conditions), while their log2fold change was 3. >> what is the reason for that? or how can this be prevented? >> >> I used the nbinomWaldTest and did not replace outliers with the trimmed >> mean since I have only 2 replicates per condition. >> >> I'd appreciate your help very much, >> Noa Henig >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > -------------- next part -------------- A non-text attachment was scrubbed... Name: plotDispEsts_default.png Type: image/png Size: 15011 bytes Desc: not available URL: <https: stat.ethz.ch="" pipermail="" bioconductor="" attachments="" 20140209="" e42fa806="" attachment.png=""> -------------- next part -------------- A non-text attachment was scrubbed... Name: plotDispEsts_mean.png Type: image/png Size: 14656 bytes Desc: not available URL: <https: stat.ethz.ch="" pipermail="" bioconductor="" attachments="" 20140209="" e42fa806="" attachment-0001.png=""> -------------- next part -------------- A non-text attachment was scrubbed... Name: plotDispEsts_local.png Type: image/png Size: 14842 bytes Desc: not available URL: <https: stat.ethz.ch="" pipermail="" bioconductor="" attachments="" 20140209="" e42fa806="" attachment-0002.png=""> regression deseq deseq2 • 1.7k views ADD COMMENTlink modified 5.3 years ago by Michael Love23k • written 5.3 years ago by Noa Henig50 Answer: DESeq2 high fold change values for comparison of samples with zero counts 0 5.3 years ago by Michael Love23k United States Michael Love23k wrote: hi Noa, On Sun, Feb 9, 2014 at 10:25 AM, Noa Henig <ntzunz@gmail.com> wrote: > Hi DESeq developers, > I am running DESeq2 1.2.8 to receive differential expression in 2 different > cell types. For each cell type I have control samples, over- expression of > gene A, and over-expression of gene B, while the second cell type is used > in order to support the findings from the first cell type. > > I have 2 questions: > First, I found some genes that had zero normalized counts in 7 out of 8 > replicates (the last one had 1 count or 1.6 counts). The fold change for > comparison for such two genes was -13.4 and 2.47. Since the 'condition' has > more than 3 levels I set the betaPrior to False as Michael indicated in the > earlier correspondence (the code appears below). > Can you tell the reason for this this and how to correct this? > âI don't understand what is unexpected here. It looks like you are not following my recommendation in a previous email to use a design formula with celltype and condition, e.g.: ~ celltype + condition where celltype would be a column in colData with levels U or M, and condition would be a column in colData with levels V0, K1 or p5. When one of the levels specified in the contrast has zero counts, and we do not use a prior on log fold change, the log fold change estimate will slope toward + or - infinity, but the algorithm will stop if the increase in the likelihood is less than a threshold (as does the glm() function in base R), or if the estimate passes fold change of 1024 or 1/1024. You don't need to correct these estimates, they will either be significant or not by adjusted p-value (as can be seen in the MA-plot) depending on the estimate of dispersion. There is typically a curved boundary region in the MA-plot where fold changes from low count genes are not significant by adjusted p-value, while fold change from high count genes are âsignificant.â âMikeâ > > In addition, I received the following warning: > Warning message: > In parametricDispersionFit(mcols(objectNZ)$baseMean[useForFit], : > the parametric dispersion fit did not converge, > which occurs when the dispersion estimates over the mean do not fit to the > curve y = a/x + b. You should re-run the analysis using fitType='local' > or 'mean' (DESeq2 > v1.2 will re-run automatically). > When using local regression fit, the user should examine plotDispEsts(dds) > to make sure the fitted line is not sharply curving up or down based on > the position of individual points. > > I guess that by running version 1.2.8 the additional fit types were ran > automatically but can I see the actual code that was ran ? > > âThis error message is admittedly a bit confusing, and I've improved this in the devel branch. Better would be to say: fitType="parametric" didn't work. The dispersion estimates do not fit well to the parametrization.â So fitType="local" was automatically substituted. You can avoid this warning if you rerun the analysis by specifying fitType="local" or "mean" from the start. > This is the sessionInfo: > R version 3.0.2 (2013-09-25) > Platform: i386-w64-mingw32/i386 (32-bit) > > locale: > [1] LC_COLLATE=Hebrew_Israel.1255 LC_CTYPE=Hebrew_Israel.1255 > LC_MONETARY=Hebrew_Israel.1255 > [4] LC_NUMERIC=C LC_TIME=Hebrew_Israel.1255 > > attached base packages: > [1] parallel stats graphics grDevices utils datasets methods > base > > other attached packages: > [1] DESeq2_1.2.8 RcppArmadillo_0.4.000 Rcpp_0.10.6 > GenomicRanges_1.14.4 XVector_0.2.0 > [6] IRanges_1.20.6 BiocGenerics_0.8.0 > > loaded via a namespace (and not attached): > [1] annotate_1.40.0 AnnotationDbi_1.24.0 Biobase_2.22.0 > DBI_0.2-7 genefilter_1.44.0 > [6] grid_3.0.2 lattice_0.20-23 locfit_1.5-9.1 > RColorBrewer_1.0-5 RSQLite_0.11.4 > [11] splines_3.0.2 stats4_3.0.2 survival_2.37-4 > XML_3.98-1.1 xtable_1.7-1 > > And this is the code I was running: > > tmpcountData = read.table(countsTable, header=TRUE, row.names=1) > > numOfConds=6 > > samplesList = c("U_V0", "U_V0","U_V0", "U_V0", "U_V0", "U_V0", "U_K1", > "U_K1", "U_K1", "U_K1", "U_p5", "U_p5", "M_V0", "M_V0", "M_V0", > "M_K1","M_K1", "M_p5", "M_p5") > > countData = tmpcountData[,3:21] # counts table contains 2 more columns > before the counts > > colData = data.frame(row.names=colnames(countData), condition = > samplesList) > > conditionsList = c("U_V0","U_K1", "U_p5","M_V0","M_K1","M_p5") > > colData$condition = factor( colData$condition, levels = > conditionsList) > > dds <- DESeqDataSetFromMatrix(countData = countData, colData = colData, > design = ~condition) > > dds <- estimateSizeFactors(dds) > > dds <- estimateDispersions(dds, maxit = 100) > > dds <- nbinomWaldTest(dds, betaPrior = FALSE) > > png("plotDispEsts_final.png") > > plotDispEsts(dds) > > dev.off() > > tmp_counts = counts(dds, normalized=TRUE) > > normalized_counts = data.frame(tmp_counts) > > > ... reading the columns of counts for comparisons and writing the results > to > a table.... > > > Thanks! > Noa Henig > > > > > > On Tue, Jan 7, 2014 at 4:21 PM, Michael Love <michaelisaiahlove@gmail.com> >wrote: > > > Hi Noa, > > > > Thanks for reporting in. This sounds like the same problem that a user > > had in November: > > > > > > > http://permalink.gmane.org/gmane.science.biology.informatics.conduct or/51749 > > > > > > If you are using the updated release version (>= 1.2.6), there should > > have been a warning printed when you build the DESeqDataSet that you > should > > set betaPrior=FALSE because factors are present in the design with 3 or > > more levels. > > > > In the devel branch of DESeq2 (v1.3), we have implemented a solution that > > allows using a prior on log fold changes for factors with 3 or more > levels > > results. Then DESeq() or nbinomWaldTest() will provide symmetric results > > regardless of the base level and it takes care of this situation with > > nonzero log fold changes from contrasts in which both conditions have > zeros. > > > > With future questions please > > post your full code, the output of sessionInfo() > > , as this helps us provide better answers. > > > > Mike > > On Jan 7, 2014 8:49 AM, "Noa Henig" <ntzunz@gmail.com> wrote: > > > >> Hi, > >> I am using DESeq2 to get differential expression of 45 samples which > >> represent 26 different conditions. > >> I read that the calculation of the fold change in changed DESeq2 and > >> affects the genes having low counts more than the highly expressed > genes. > >> However, I found genes that had zero counts in series of 4 samples (2 > >> replicates in each of 2 conditions), while their log2fold change was 3. > >> what is the reason for that? or how can this be prevented? > >> > >> I used the nbinomWaldTest and did not replace outliers with the trimmed > >> mean since I have only 2 replicates per condition. > >> > >> I'd appreciate your help very much, > >> Noa Henig > >> > >> [[alternative HTML version deleted]] > >> > >> _______________________________________________ > >> Bioconductor mailing list > >> Bioconductor@r-project.org > >> https://stat.ethz.ch/mailman/listinfo/bioconductor > >> Search the archives: > >> http://news.gmane.org/gmane.science.biology.informatics.conductor > >> > > > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]
Sorry, I left out the interaction term in my earlier response, which is important in order to test if the effect of condition is different in the second cell type as in the first cell type. On Sun, Feb 9, 2014 at 12:31 PM, Michael Love <michaelisaiahlove@gmail.com>wrote: > hi Noa, > > > On Sun, Feb 9, 2014 at 10:25 AM, Noa Henig <ntzunz@gmail.com> wrote: > >> Hi DESeq developers, >> I am running DESeq2 1.2.8 to receive differential expression in 2 >> different >> cell types. For each cell type I have control samples, over- expression of >> gene A, and over-expression of gene B, while the second cell type is used >> in order to support the findings from the first cell type. >> >> I have 2 questions: >> First, I found some genes that had zero normalized counts in 7 out of 8 >> replicates (the last one had 1 count or 1.6 counts). The fold change for >> comparison for such two genes was -13.4 and 2.47. Since the 'condition' >> has >> more than 3 levels I set the betaPrior to False as Michael indicated in >> the >> earlier correspondence (the code appears below). >> Can you tell the reason for this this and how to correct this? >> > > âI don't understand what is unexpected here. It looks like you are not > following my recommendation in a previous email to use a design formula > with celltype and condition, e.g.: > > ~ celltype + condition > I would recommend you use for this experiment the design formula:ââ â~ celltype + condition + celltype:condition Let's say that celltype has two levels "Cell1" and "Cell2", and condition has three levels "Level1", "Level2", "Level3". Then to examine the effect of one of the levels of condition you would use, e.g.: results(dds, name="condition_Level2_vs_Level1") ...or equivalently: results(dds, contrast=c("condition","Level2","Level1")â) âYou can compare any pairs of levels using the contrast argument.â âIn order to test if the second cell type had a different effect of condition as the first cell-type, you would use: results(dds, name="celltypeCell2:conditionLevel2â") ...for the level2 effect and, results(dds, name="celltypeCell2:conditionLevel3") ...for the level3 effect. âMikeâ > > where celltype would be a column in colData with levels U or M, and > condition would be a column in colData with levels V0, K1 or p5. > > When one of the levels specified in the contrast has zero counts, and we > do not use a prior on log fold change, the log fold change estimate will > slope toward + or - infinity, but the algorithm will stop if the increase > in the likelihood is less than a threshold (as does the glm() function in > base R), or if the estimate passes fold change of 1024 or 1/1024. > > You don't need to correct these estimates, they will either be significant > or not by adjusted p-value (as can be seen in the MA-plot) depending on the > estimate of dispersion. There is typically a curved boundary region in the > MA-plot where fold changes from low count genes are not significant by > adjusted p-value, while fold change from high count genes are âsignificant.â > > > âMikeâ > > > >> >> In addition, I received the following warning: >> Warning message: >> In parametricDispersionFit(mcols(objectNZ)$baseMean[useForFit], : >> the parametric dispersion fit did not converge, >> which occurs when the dispersion estimates over the mean do not fit to the >> curve y = a/x + b. You should re-run the analysis using fitType='local' >> or 'mean' (DESeq2 > v1.2 will re-run automatically). >> When using local regression fit, the user should examine plotDispEsts(dds) >> to make sure the fitted line is not sharply curving up or down based on >> the position of individual points. >> >> I guess that by running version 1.2.8 the additional fit types were ran >> automatically but can I see the actual code that was ran ? >> >> > âThis error message is admittedly a bit confusing, and I've improved this > in the devel branch. Better would be to say: fitType="parametric" didn't > work. The dispersion estimates do not fit well to the parametrization.â So > fitType="local" was automatically substituted. You can avoid this warning > if you rerun the analysis by specifying fitType="local" or "mean" from the > start. > > > >> This is the sessionInfo: >> R version 3.0.2 (2013-09-25) >> Platform: i386-w64-mingw32/i386 (32-bit) >> >> locale: >> [1] LC_COLLATE=Hebrew_Israel.1255 LC_CTYPE=Hebrew_Israel.1255 >> LC_MONETARY=Hebrew_Israel.1255 >> [4] LC_NUMERIC=C LC_TIME=Hebrew_Israel.1255 >> >> attached base packages: >> [1] parallel stats graphics grDevices utils datasets methods >> base >> >> other attached packages: >> [1] DESeq2_1.2.8 RcppArmadillo_0.4.000 Rcpp_0.10.6 >> GenomicRanges_1.14.4 XVector_0.2.0 >> [6] IRanges_1.20.6 BiocGenerics_0.8.0 >> >> loaded via a namespace (and not attached): >> [1] annotate_1.40.0 AnnotationDbi_1.24.0 Biobase_2.22.0 >> DBI_0.2-7 genefilter_1.44.0 >> [6] grid_3.0.2 lattice_0.20-23 locfit_1.5-9.1 >> RColorBrewer_1.0-5 RSQLite_0.11.4 >> [11] splines_3.0.2 stats4_3.0.2 survival_2.37-4 >> XML_3.98-1.1 xtable_1.7-1 >> >> And this is the code I was running: >> >> tmpcountData = read.table(countsTable, header=TRUE, row.names=1) >> >> numOfConds=6 >> >> samplesList = c("U_V0", "U_V0","U_V0", "U_V0", "U_V0", "U_V0", "U_K1", >> "U_K1", "U_K1", "U_K1", "U_p5", "U_p5", "M_V0", "M_V0", "M_V0", >> "M_K1","M_K1", "M_p5", "M_p5") >> >> countData = tmpcountData[,3:21] # counts table contains 2 more columns >> before the counts >> >> colData = data.frame(row.names=colnames(countData), condition = >> samplesList) >> >> conditionsList = c("U_V0","U_K1", "U_p5","M_V0","M_K1","M_p5") >> >> colData$condition = factor( colData$condition, levels = >> conditionsList) >> >> dds <- DESeqDataSetFromMatrix(countData = countData, colData = colData, >> design = ~condition) >> >> dds <- estimateSizeFactors(dds) >> >> dds <- estimateDispersions(dds, maxit = 100) >> >> dds <- nbinomWaldTest(dds, betaPrior = FALSE) >> >> png("plotDispEsts_final.png") >> >> plotDispEsts(dds) >> >> dev.off() >> >> tmp_counts = counts(dds, normalized=TRUE) >> >> normalized_counts = data.frame(tmp_counts) >> >> >> ... reading the columns of counts for comparisons and writing the >> results to >> a table.... >> >> >> Thanks! >> Noa Henig >> >> >> >> >> >> On Tue, Jan 7, 2014 at 4:21 PM, Michael Love <michaelisaiahlove@gmail.com>> >wrote: >> >> > Hi Noa, >> > >> > Thanks for reporting in. This sounds like the same problem that a user >> > had in November: >> > >> > >> > >> http://permalink.gmane.org/gmane.science.biology.informatics.conduc tor/51749 >> > >> > >> > If you are using the updated release version (>= 1.2.6), there should >> > have been a warning printed when you build the DESeqDataSet that you >> should >> > set betaPrior=FALSE because factors are present in the design with 3 or >> > more levels. >> > >> > In the devel branch of DESeq2 (v1.3), we have implemented a solution >> that >> > allows using a prior on log fold changes for factors with 3 or more >> levels >> > results. Then DESeq() or nbinomWaldTest() will provide symmetric results >> > regardless of the base level and it takes care of this situation with >> > nonzero log fold changes from contrasts in which both conditions have >> zeros. >> > >> > With future questions please >> > post your full code, the output of sessionInfo() >> > , as this helps us provide better answers. >> > >> > Mike >> > On Jan 7, 2014 8:49 AM, "Noa Henig" <ntzunz@gmail.com> wrote: >> > >> >> Hi, >> >> I am using DESeq2 to get differential expression of 45 samples which >> >> represent 26 different conditions. >> >> I read that the calculation of the fold change in changed DESeq2 and >> >> affects the genes having low counts more than the highly expressed >> genes. >> >> However, I found genes that had zero counts in series of 4 samples (2 >> >> replicates in each of 2 conditions), while their log2fold change was 3. >> >> what is the reason for that? or how can this be prevented? >> >> >> >> I used the nbinomWaldTest and did not replace outliers with the trimmed >> >> mean since I have only 2 replicates per condition. >> >> >> >> I'd appreciate your help very much, >> >> Noa Henig >> >> >> >> [[alternative HTML version deleted]] >> >> >> >> _______________________________________________ >> >> Bioconductor mailing list >> >> Bioconductor@r-project.org >> >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> >> Search the archives: >> >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> >> > >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor@r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > [[alternative HTML version deleted]] ADD REPLYlink written 5.3 years ago by Michael Love23k Hi Michael, I am working on an a new setup that follows the design you suggested, should have the results soon. Meanwhile I wanted to get a quick estimate for the differential expression only in one of the cell types but I still have a problem with high fold change for samples having zero or very low counts: I used a partial matrix that includes only the replicates of one of the cell types, so there is a single factor with 3 possible levels in this design. The code was the same as I sent last Sunday, where betaPrior was set to FALSE, except the counts matrix that included this time only 6 replicates of U_V0, 4 replicates of U_K1 and 2 replicates of U_p5: "U_V0", "U_V0","U_V0", "U_V0", "U_V0", "U_V0", "U_K1","U_K1", "U_K1", "U_K1", "U_p5", "U_p5" The normalized counts for a certain gene were the following: 0, 0, 0, 0, 0, 0.83, 0, 1.12, 0, 0, 0, 0 The log2 fold change for that gene for the comparison U_V0 vs. U_p5 was *-13.38* For the comparison of U_V0 vs. U_K1 it was 1.01. Can you tell what is the problem in this case? Thanks again, Noa On Sun, Feb 9, 2014 at 8:27 PM, Michael Love <michaelisaiahlove@gmail.com>wrote: > Sorry, I left out the interaction term in my earlier response, which is > important in order to test if the effect of condition is different in the > second cell type as in the first cell type. > > On Sun, Feb 9, 2014 at 12:31 PM, Michael Love <michaelisaiahlove@gmail.com> > wrote: > >> hi Noa, >> >> >> On Sun, Feb 9, 2014 at 10:25 AM, Noa Henig <ntzunz@gmail.com> wrote: >> >>> Hi DESeq developers, >>> I am running DESeq2 1.2.8 to receive differential expression in 2 >>> different >>> cell types. For each cell type I have control samples, over- expression of >>> gene A, and over-expression of gene B, while the second cell type is used >>> in order to support the findings from the first cell type. >>> >>> I have 2 questions: >>> First, I found some genes that had zero normalized counts in 7 out of 8 >>> replicates (the last one had 1 count or 1.6 counts). The fold change for >>> comparison for such two genes was -13.4 and 2.47. Since the 'condition' >>> has >>> more than 3 levels I set the betaPrior to False as Michael indicated in >>> the >>> earlier correspondence (the code appears below). >>> Can you tell the reason for this this and how to correct this? >>> >> >> I don't understand what is unexpected here. It looks like you are not >> following my recommendation in a previous email to use a design formula >> with celltype and condition, e.g.: >> >> ~ celltype + condition >> > > > I would recommend you use for this experiment the design formula: > > ~ celltype + condition + celltype:condition > > Let's say that celltype has two levels "Cell1" and "Cell2", and condition > has three levels "Level1", "Level2", "Level3". Then to examine the effect > of one of the levels of condition you would use, e.g.: > > results(dds, name="condition_Level2_vs_Level1") > ...or equivalently: > results(dds, contrast=c("condition","Level2","Level1")) > > You can compare any pairs of levels using the contrast argument. > > In order to test if the second cell type had a different effect of > condition as the first cell-type, you would use: > > results(dds, name="celltypeCell2:conditionLevel2") > > ...for the level2 effect and, > > results(dds, name="celltypeCell2:conditionLevel3") > > ...for the level3 effect. > > > Mike > > > >> >> where celltype would be a column in colData with levels U or M, and >> condition would be a column in colData with levels V0, K1 or p5. >> >> When one of the levels specified in the contrast has zero counts, and we >> do not use a prior on log fold change, the log fold change estimate will >> slope toward + or - infinity, but the algorithm will stop if the increase >> in the likelihood is less than a threshold (as does the glm() function in >> base R), or if the estimate passes fold change of 1024 or 1/1024. >> >> You don't need to correct these estimates, they will either be >> significant or not by adjusted p-value (as can be seen in the MA- plot) >> depending on the estimate of dispersion. There is typically a curved >> boundary region in the MA-plot where fold changes from low count genes are >> not significant by adjusted p-value, while fold change from high count >> genes are significant. >> >> >> Mike >> >> >> >>> >>> In addition, I received the following warning: >>> Warning message: >>> In parametricDispersionFit(mcols(objectNZ)$baseMean[useForFit], : >>> the parametric dispersion fit did not converge, >>> which occurs when the dispersion estimates over the mean do not fit to >>> the >>> curve y = a/x + b. You should re-run the analysis using fitType='local' >>> or 'mean' (DESeq2 > v1.2 will re-run automatically). >>> When using local regression fit, the user should examine >>> plotDispEsts(dds) >>> to make sure the fitted line is not sharply curving up or down based on >>> the position of individual points. >>> >>> I guess that by running version 1.2.8 the additional fit types were ran >>> automatically but can I see the actual code that was ran ? >>> >>> >> This error message is admittedly a bit confusing, and I've improved this >> in the devel branch. Better would be to say: fitType="parametric" didn't >> work. The dispersion estimates do not fit well to the parametrization. So >> fitType="local" was automatically substituted. You can avoid this warning >> if you rerun the analysis by specifying fitType="local" or "mean" from the >> start. >> >> >> >>> This is the sessionInfo: >>> R version 3.0.2 (2013-09-25) >>> Platform: i386-w64-mingw32/i386 (32-bit) >>> >>> locale: >>> [1] LC_COLLATE=Hebrew_Israel.1255 LC_CTYPE=Hebrew_Israel.1255 >>> LC_MONETARY=Hebrew_Israel.1255 >>> [4] LC_NUMERIC=C LC_TIME=Hebrew_Israel.1255 >>> >>> attached base packages: >>> [1] parallel stats graphics grDevices utils datasets methods >>> base >>> >>> other attached packages: >>> [1] DESeq2_1.2.8 RcppArmadillo_0.4.000 Rcpp_0.10.6 >>> GenomicRanges_1.14.4 XVector_0.2.0 >>> [6] IRanges_1.20.6 BiocGenerics_0.8.0 >>> >>> loaded via a namespace (and not attached): >>> [1] annotate_1.40.0 AnnotationDbi_1.24.0 Biobase_2.22.0 >>> DBI_0.2-7 genefilter_1.44.0 >>> [6] grid_3.0.2 lattice_0.20-23 locfit_1.5-9.1 >>> RColorBrewer_1.0-5 RSQLite_0.11.4 >>> [11] splines_3.0.2 stats4_3.0.2 survival_2.37-4 >>> XML_3.98-1.1 xtable_1.7-1 >>> >>> And this is the code I was running: >>> >>> tmpcountData = read.table(countsTable, header=TRUE, row.names=1) >>> >>> numOfConds=6 >>> >>> samplesList = c("U_V0", "U_V0","U_V0", "U_V0", "U_V0", "U_V0", "U_K1", >>> "U_K1", "U_K1", "U_K1", "U_p5", "U_p5", "M_V0", "M_V0", "M_V0", >>> "M_K1","M_K1", "M_p5", "M_p5") >>> >>> countData = tmpcountData[,3:21] # counts table contains 2 more columns >>> before the counts >>> >>> colData = data.frame(row.names=colnames(countData), condition = >>> samplesList) >>> >>> conditionsList = c("U_V0","U_K1", "U_p5","M_V0","M_K1","M_p5") >>> >>> colData$condition = factor( colData$condition, levels = >>> conditionsList) >>> >>> dds <- DESeqDataSetFromMatrix(countData = countData, colData = colData, >>> design = ~condition) >>> >>> dds <- estimateSizeFactors(dds) >>> >>> dds <- estimateDispersions(dds, maxit = 100) >>> >>> dds <- nbinomWaldTest(dds, betaPrior = FALSE) >>> >>> png("plotDispEsts_final.png") >>> >>> plotDispEsts(dds) >>> >>> dev.off() >>> >>> tmp_counts = counts(dds, normalized=TRUE) >>> >>> normalized_counts = data.frame(tmp_counts) >>> >>> >>> ... reading the columns of counts for comparisons and writing the >>> results to >>> a table.... >>> >>> >>> Thanks! >>> Noa Henig >>> >>> >>> >>> >>> >>> On Tue, Jan 7, 2014 at 4:21 PM, Michael Love < >>> michaelisaiahlove@gmail.com>wrote: >>> >>> > Hi Noa, >>> > >>> > Thanks for reporting in. This sounds like the same problem that a user >>> > had in November: >>> > >>> > >>> > >>> http://permalink.gmane.org/gmane.science.biology.informatics.condu ctor/51749 >>> > >>> > >>> > If you are using the updated release version (>= 1.2.6), there should >>> > have been a warning printed when you build the DESeqDataSet that you >>> should >>> > set betaPrior=FALSE because factors are present in the design with 3 or >>> > more levels. >>> > >>> > In the devel branch of DESeq2 (v1.3), we have implemented a solution >>> that >>> > allows using a prior on log fold changes for factors with 3 or more >>> levels >>> > results. Then DESeq() or nbinomWaldTest() will provide symmetric >>> results >>> > regardless of the base level and it takes care of this situation with >>> > nonzero log fold changes from contrasts in which both conditions have >>> zeros. >>> > >>> > With future questions please >>> > post your full code, the output of sessionInfo() >>> > , as this helps us provide better answers. >>> > >>> > Mike >>> > On Jan 7, 2014 8:49 AM, "Noa Henig" <ntzunz@gmail.com> wrote: >>> > >>> >> Hi, >>> >> I am using DESeq2 to get differential expression of 45 samples which >>> >> represent 26 different conditions. >>> >> I read that the calculation of the fold change in changed DESeq2 and >>> >> affects the genes having low counts more than the highly expressed >>> genes. >>> >> However, I found genes that had zero counts in series of 4 samples (2 >>> >> replicates in each of 2 conditions), while their log2fold change was >>> 3. >>> >> what is the reason for that? or how can this be prevented? >>> >> >>> >> I used the nbinomWaldTest and did not replace outliers with the >>> trimmed >>> >> mean since I have only 2 replicates per condition. >>> >> >>> >> I'd appreciate your help very much, >>> >> Noa Henig >>> >> >>> >> [[alternative HTML version deleted]] >>> >> >>> >> _______________________________________________ >>> >> Bioconductor mailing list >>> >> Bioconductor@r-project.org >>> >> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> >> Search the archives: >>> >> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >> >>> > >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor@r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >> >> > [[alternative HTML version deleted]]
Hi Noa, On Feb 11, 2014 7:52 AM, "Noa Henig" <ntzunz@gmail.com> wrote: > > Hi Michael, > I am working on an a new setup that follows the design you suggested, should have the results soon. Meanwhile I wanted to get a quick estimate for the differential expression only in one of the cell types but I still have a problem with high fold change for samples having zero or very low counts: > I used a partial matrix that includes only the replicates of one of the cell types, so there is a single factor with 3 possible levels in this design. The code was the same as I sent last Sunday, where betaPrior was set to FALSE, except the counts matrix that included this time only 6 replicates of U_V0, 4 replicates of U_K1 and 2 replicates of U_p5: > > "U_V0", "U_V0","U_V0", "U_V0", "U_V0", "U_V0", "U_K1","U_K1", "U_K1", "U_K1", "U_p5", "U_p5" > > The normalized counts for a certain gene were the following: > 0, 0, 0, 0, 0, 0.83, 0, 1.12, 0, 0, 0, 0 > > The log2 fold change for that gene for the comparison U_V0 vs. U_p5 was -13.38 > For the comparison of U_V0 vs. U_K1 it was 1.01. > > Can you tell what is the problem in this case? See my last email. This is not a problem. U_p5 has zero counts, so expected value zero. The base level has small, positive expected value. The fold change is from a small positive value to zero. So log(0). Note that U_V0 is the base level in your design. So I assume you are actually reporting log fold changes over the base level. Mike > Thanks again, > Noa > > > On Sun, Feb 9, 2014 at 8:27 PM, Michael Love <michaelisaiahlove@gmail.com> wrote: >> >> Sorry, I left out the interaction term in my earlier response, which is important in order to test if the effect of condition is different in the second cell type as in the first cell type. >> >> On Sun, Feb 9, 2014 at 12:31 PM, Michael Love < michaelisaiahlove@gmail.com> wrote: >>> >>> hi Noa, >>> >>> >>> On Sun, Feb 9, 2014 at 10:25 AM, Noa Henig <ntzunz@gmail.com> wrote: >>>> >>>> Hi DESeq developers, >>>> I am running DESeq2 1.2.8 to receive differential expression in 2 different >>>> cell types. For each cell type I have control samples, over- expression of >>>> gene A, and over-expression of gene B, while the second cell type is used >>>> in order to support the findings from the first cell type. >>>> >>>> I have 2 questions: >>>> First, I found some genes that had zero normalized counts in 7 out of 8 >>>> replicates (the last one had 1 count or 1.6 counts). The fold change for >>>> comparison for such two genes was -13.4 and 2.47. Since the 'condition' has >>>> more than 3 levels I set the betaPrior to False as Michael indicated in the >>>> earlier correspondence (the code appears below). >>>> Can you tell the reason for this this and how to correct this? >>> >>> >>> I don't understand what is unexpected here. It looks like you are not following my recommendation in a previous email to use a design formula with celltype and condition, e.g.: >>> >>> ~ celltype + condition >> >> >> >> I would recommend you use for this experiment the design formula: >> >> ~ celltype + condition + celltype:condition >> >> Let's say that celltype has two levels "Cell1" and "Cell2", and condition has three levels "Level1", "Level2", "Level3". Then to examine the effect of one of the levels of condition you would use, e.g.: >> >> results(dds, name="condition_Level2_vs_Level1") >> ...or equivalently: >> results(dds, contrast=c("condition","Level2","Level1")) >> >> You can compare any pairs of levels using the contrast argument. >> >> In order to test if the second cell type had a different effect of condition as the first cell-type, you would use: >> >> results(dds, name="celltypeCell2:conditionLevel2") >> >> ...for the level2 effect and, >> >> results(dds, name="celltypeCell2:conditionLevel3") >> >> ...for the level3 effect. >> >> >> Mike >> >> >>> >>> >>> where celltype would be a column in colData with levels U or M, and condition would be a column in colData with levels V0, K1 or p5. >>> >>> When one of the levels specified in the contrast has zero counts, and we do not use a prior on log fold change, the log fold change estimate will slope toward + or - infinity, but the algorithm will stop if the increase in the likelihood is less than a threshold (as does the glm() function in base R), or if the estimate passes fold change of 1024 or 1/1024. >>> >>> You don't need to correct these estimates, they will either be significant or not by adjusted p-value (as can be seen in the MA-plot) depending on the estimate of dispersion. There is typically a curved boundary region in the MA-plot where fold changes from low count genes are not significant by adjusted p-value, while fold change from high count genes are significant. >>> >>> >>> Mike >>> >>> >>>> >>>> >>>> In addition, I received the following warning: >>>> Warning message: >>>> In parametricDispersionFit(mcols(objectNZ)$baseMean[useForFit], : >>>> the parametric dispersion fit did not converge, >>>> which occurs when the dispersion estimates over the mean do not fit to the >>>> curve y = a/x + b. You should re-run the analysis using fitType='local' >>>> or 'mean' (DESeq2 > v1.2 will re-run automatically). >>>> When using local regression fit, the user should examine plotDispEsts(dds) >>>> to make sure the fitted line is not sharply curving up or down based on >>>> the position of individual points. >>>> >>>> I guess that by running version 1.2.8 the additional fit types were ran >>>> automatically but can I see the actual code that was ran ? >>>> >>> >>> This error message is admittedly a bit confusing, and I've improved this in the devel branch. Better would be to say: fitType="parametric" didn't work. The dispersion estimates do not fit well to the parametrization. So fitType="local" was automatically substituted. You can avoid this warning if you rerun the analysis by specifying fitType="local" or "mean" from the start. >>> >>> >>>> >>>> This is the sessionInfo: >>>> R version 3.0.2 (2013-09-25) >>>> Platform: i386-w64-mingw32/i386 (32-bit) >>>> >>>> locale: >>>> [1] LC_COLLATE=Hebrew_Israel.1255 LC_CTYPE=Hebrew_Israel.1255 >>>> LC_MONETARY=Hebrew_Israel.1255 >>>> [4] LC_NUMERIC=C LC_TIME=Hebrew_Israel.1255 >>>> >>>> attached base packages: >>>> [1] parallel stats graphics grDevices utils datasets methods >>>> base >>>> >>>> other attached packages: >>>> [1] DESeq2_1.2.8 RcppArmadillo_0.4.000 Rcpp_0.10.6 >>>> GenomicRanges_1.14.4 XVector_0.2.0 >>>> [6] IRanges_1.20.6 BiocGenerics_0.8.0 >>>> >>>> loaded via a namespace (and not attached): >>>> [1] annotate_1.40.0 AnnotationDbi_1.24.0 Biobase_2.22.0 >>>> DBI_0.2-7 genefilter_1.44.0 >>>> [6] grid_3.0.2 lattice_0.20-23 locfit_1.5-9.1 >>>> RColorBrewer_1.0-5 RSQLite_0.11.4 >>>> [11] splines_3.0.2 stats4_3.0.2 survival_2.37-4 >>>> XML_3.98-1.1 xtable_1.7-1 >>>> >>>> And this is the code I was running: >>>> >>>> tmpcountData = read.table(countsTable, header=TRUE, row.names=1) >>>> >>>> numOfConds=6 >>>> >>>> samplesList = c("U_V0", "U_V0","U_V0", "U_V0", "U_V0", "U_V0", "U_K1", >>>> "U_K1", "U_K1", "U_K1", "U_p5", "U_p5", "M_V0", "M_V0", "M_V0", >>>> "M_K1","M_K1", "M_p5", "M_p5") >>>> >>>> countData = tmpcountData[,3:21] # counts table contains 2 more columns >>>> before the counts >>>> >>>> colData = data.frame(row.names=colnames(countData), condition = samplesList) >>>> >>>> conditionsList = c("U_V0","U_K1", "U_p5","M_V0","M_K1","M_p5") >>>> >>>> colData$condition = factor( colData\$condition, levels = >>>> conditionsList) >>>> >>>> dds <- DESeqDataSetFromMatrix(countData = countData, colData = colData, >>>> design = ~condition) >>>> >>>> dds <- estimateSizeFactors(dds) >>>> >>>> dds <- estimateDispersions(dds, maxit = 100) >>>> >>>> dds <- nbinomWaldTest(dds, betaPrior = FALSE) >>>> >>>> png("plotDispEsts_final.png") >>>> >>>> plotDispEsts(dds) >>>> >>>> dev.off() >>>> >>>> tmp_counts = counts(dds, normalized=TRUE) >>>> >>>> normalized_counts = data.frame(tmp_counts) >>>> >>>> >>>> ... reading the columns of counts for comparisons and writing the results to >>>> a table.... >>>> >>>> >>>> Thanks! >>>> Noa Henig >>>> >>>> >>>> >>>> >>>> >>>> On Tue, Jan 7, 2014 at 4:21 PM, Michael Love < michaelisaiahlove@gmail.com>wrote: >>>> >>>> > Hi Noa, >>>> > >>>> > Thanks for reporting in. This sounds like the same problem that a user >>>> > had in November: >>>> > >>>> > >>>> > http://permalink.gmane.org/gmane.science.biology.informatics.conductor /51749 >>>> > >>>> > >>>> > If you are using the updated release version (>= 1.2.6), there should >>>> > have been a warning printed when you build the DESeqDataSet that you should >>>> > set betaPrior=FALSE because factors are present in the design with 3 or >>>> > more levels. >>>> > >>>> > In the devel branch of DESeq2 (v1.3), we have implemented a solution that >>>> > allows using a prior on log fold changes for factors with 3 or more levels >>>> > results. Then DESeq() or nbinomWaldTest() will provide symmetric results >>>> > regardless of the base level and it takes care of this situation with >>>> > nonzero log fold changes from contrasts in which both conditions have zeros. >>>> > >>>> > With future questions please >>>> > post your full code, the output of sessionInfo() >>>> > , as this helps us provide better answers. >>>> > >>>> > Mike >>>> > On Jan 7, 2014 8:49 AM, "Noa Henig" <ntzunz@gmail.com> wrote: >>>> > >>>> >> Hi, >>>> >> I am using DESeq2 to get differential expression of 45 samples which >>>> >> represent 26 different conditions. >>>> >> I read that the calculation of the fold change in changed DESeq2 and >>>> >> affects the genes having low counts more than the highly expressed genes. >>>> >> However, I found genes that had zero counts in series of 4 samples (2 >>>> >> replicates in each of 2 conditions), while their log2fold change was 3. >>>> >> what is the reason for that? or how can this be prevented? >>>> >> >>>> >> I used the nbinomWaldTest and did not replace outliers with the trimmed >>>> >> mean since I have only 2 replicates per condition. >>>> >> >>>> >> I'd appreciate your help very much, >>>> >> Noa Henig >>>> >> >>>> >> [[alternative HTML version deleted]] >>>> >> >>>> >> _______________________________________________ >>>> >> Bioconductor mailing list >>>> >> Bioconductor@r-project.org >>>> >> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> >> Search the archives: >>>> >> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>> >> >>>> > >>>> >>>> _______________________________________________ >>>> Bioconductor mailing list >>>> Bioconductor@r-project.org >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >>> >> > [[alternative HTML version deleted]]