Search
Question: A question with DEXSeq package: inconsistency between normalized counts vs. fitted expression, fitted splicing or fold changes
0
5.9 years ago by
Yao,Hui40
Yao,Hui40 wrote:
Dear DEXSeq authors and users, I am using DEXSeq package (1.2.0) to do splicing analysis for human genome. One part of results from the analysis is really confusing me. It seems that for many genes the fitted expression, the fitted splicing and estimated fold changes are inconsistent (actually reversed) with its normalized counts. To clearly explain the problem, I am showing a simple example with only two genes with 30 samples as below. > load("toBioConductor.RData") > library(DEXSeq) > ls() [1] "testgene" ### For this data set, our interest is to investigate the differential usage of exons between two tissue "type", X and N. ### The samples were collected from two batches, So we need to adjust the batch effects in the following model. > formu <- count ~ sample + (exon + from)*type > testgene <- estimateDispersions(testgene, formula=formu) > testgene <- fitDispersionFunction(testgene) > f0 <- count~sample + from*exon + type > f1 <- count~sample + from*exon + type*I(exon==exonID) > testgene <- testForDEU(testgene,formula0=f0, formula1=f1) > testgene <- estimatelog2FoldChanges(testgene,fitExpToVar="type" ) > res <- DEUresultTable(testgene) n the attached figure, toBioConductor-plotDEXSeq.pdf, for E011, its normalized counts of "N" are clearly smaller than those of "X". However, for both Fitted splicing and Fitted expression, the level of "N" is larger than that of "X". And if we checked the estimated fold change as below, the fold change is consistent with fitted expression and fitted splicing. > res["ENSG00000001497:011",] geneID exonID dispersion pvalue padjust ENSG00000001497:011 ENSG00000001497 E011 0.4096852 0.0006160211 0.01334712 meanBase log2fold(X/N) ENSG00000001497:011 5.714405 -1.936449 We also explicitly check the normalized counts for E011 of this gene. As below shows the median of each type. Those of "N" is clearly smaller than "X". > dtbl <- data.frame(counts=counts(testgene,normalized=T)["ENSG0000000 1497:011",],type=pData(testgene)$type) > with(dtbl,tapply(counts,type,median)) X N 8.534784 2.064785 > sessionInfo() R version 2.14.0 (2011-10-31) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US LC_NUMERIC=C LC_TIME=en_US [4] LC_COLLATE=en_US LC_MONETARY=en_US LC_MESSAGES=en_US [7] LC_PAPER=C LC_NAME=C LC_ADDRESS=C [10] LC_TELEPHONE=C LC_MEASUREMENT=en_US LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] DEXSeq_1.2.0 Biobase_2.14.0 loaded via a namespace (and not attached): [1] biomaRt_2.10.0 hwriter_1.3 plyr_1.7.1 RCurl_1.91-1 statmod_1.4.14 [6] stringr_0.6 tools_2.14.0 XML_3.9-4 So, have I made any mistakes in the analysis? Many thanks in advance, Hui Hui Yao, Ph.D. Principal Statistical Analyst MD Anderson Cancer Center -------------- next part -------------- A non-text attachment was scrubbed... Name: toBioConductor-plotDEXSeq.pdf Type: application/pdf Size: 18588 bytes Desc: toBioConductor-plotDEXSeq.pdf URL: <https: stat.ethz.ch="" pipermail="" bioconductor="" attachments="" 20120625="" 45c8527d="" attachment.pdf=""> ADD COMMENTlink modified 5.9 years ago by Alejandro Reyes1.5k • written 5.9 years ago by Yao,Hui40 0 5.9 years ago by Alejandro Reyes1.5k Dana-Farber Cancer Institute, Boston, USA Alejandro Reyes1.5k wrote: Dear Yao Hui, Thank you for your email and detailed explanations! First, there seems to be a typo in your dispersion formula (which I introduced in the vignette and apparently I forgot to change in the 1.2.0 release). The correct formula would be: formu<- count ~ sample + (type + from) * exon Regarding your question, it sounds like an embarrassing bug... but I could not find in the DEXSeq code a place where the sample names could be switched by mistake. Could you send me your "toBioConductor.RData" object to give it a closer look, as well as the code used to generate the plot?? Best wishes, Alejandro Reyes > Dear DEXSeq authors and users, > > I am using DEXSeq package (1.2.0) to do splicing analysis for human genome. One part of results from the analysis is really confusing me. It seems that for many genes the fitted expression, the fitted splicing and estimated fold changes are inconsistent (actually reversed) with its normalized counts. > > To clearly explain the problem, I am showing a simple example with only two genes with 30 samples as below. > >> load("toBioConductor.RData") >> library(DEXSeq) >> ls() > [1] "testgene" > > ### For this data set, our interest is to investigate the differential usage of exons between two tissue "type", X and N. > ### The samples were collected from two batches, So we need to adjust the batch effects in the following model. > > >> formu<- count ~ sample + (exon + from)*type >> testgene<- estimateDispersions(testgene, formula=formu) >> testgene<- fitDispersionFunction(testgene) >> f0<- count~sample + from*exon + type >> f1<- count~sample + from*exon + type*I(exon==exonID) >> testgene<- testForDEU(testgene,formula0=f0, formula1=f1) >> testgene<- estimatelog2FoldChanges(testgene,fitExpToVar="type" ) >> res<- DEUresultTable(testgene) > n the attached figure, toBioConductor-plotDEXSeq.pdf, for E011, its normalized counts of "N" are clearly smaller than those of "X". However, for both Fitted splicing and Fitted expression, the level of "N" is larger than that of "X". And if we checked the estimated fold change as below, the fold change is consistent with fitted expression and fitted splicing. > >> res["ENSG00000001497:011",] > geneID exonID dispersion pvalue padjust > ENSG00000001497:011 ENSG00000001497 E011 0.4096852 0.0006160211 0.01334712 > meanBase log2fold(X/N) > ENSG00000001497:011 5.714405 -1.936449 > > We also explicitly check the normalized counts for E011 of this gene. As below shows the median of each type. Those of "N" is clearly smaller than "X". > >> dtbl<- data.frame(counts=counts(testgene,normalized=T)["ENSG0000000 1497:011",],type=pData(testgene)$type) >> with(dtbl,tapply(counts,type,median)) > X N > 8.534784 2.064785 > >> sessionInfo() > R version 2.14.0 (2011-10-31) > Platform: x86_64-unknown-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_US LC_NUMERIC=C LC_TIME=en_US > [4] LC_COLLATE=en_US LC_MONETARY=en_US LC_MESSAGES=en_US > [7] LC_PAPER=C LC_NAME=C LC_ADDRESS=C > [10] LC_TELEPHONE=C LC_MEASUREMENT=en_US LC_IDENTIFICATION=C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] DEXSeq_1.2.0 Biobase_2.14.0 > > loaded via a namespace (and not attached): > [1] biomaRt_2.10.0 hwriter_1.3 plyr_1.7.1 RCurl_1.91-1 statmod_1.4.14 > [6] stringr_0.6 tools_2.14.0 XML_3.9-4 > > So, have I made any mistakes in the analysis? > > > Many thanks in advance, > > Hui > > Hui Yao, Ph.D. > Principal Statistical Analyst > MD Anderson Cancer Center > > > > > > > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor [[alternative HTML version deleted]]
Dear Alejandro, Thanks a lot for quick reply! First I appreciate that you pinpoint the typo and will correct it in the analysis. Then I will send you the toBioConductor.RData and the code in a separate e-mail to avoid exceeding the limit of bio-conductor email size. Thanks again, Hui On Jun 26, 2012, at 4:14 AM, Alejandro Reyes wrote: Dear Yao Hui, Thank you for your email and detailed explanations! First, there seems to be a typo in your dispersion formula (which I introduced in the vignette and apparently I forgot to change in the 1.2.0 release). The correct formula would be: formu <- count ~ sample + (type + from) * exon Regarding your question, it sounds like an embarrassing bug... but I could not find in the DEXSeq code a place where the sample names could be switched by mistake. Could you send me your "toBioConductor.RData" object to give it a closer look, as well as the code used to generate the plot?? Best wishes, Alejandro Reyes Dear DEXSeq authors and users, I am using DEXSeq package (1.2.0) to do splicing analysis for human genome. One part of results from the analysis is really confusing me. It seems that for many genes the fitted expression, the fitted splicing and estimated fold changes are inconsistent (actually reversed) with its normalized counts. To clearly explain the problem, I am showing a simple example with only two genes with 30 samples as below. load("toBioConductor.RData") library(DEXSeq) ls() [1] "testgene" ### For this data set, our interest is to investigate the differential usage of exons between two tissue "type", X and N. ### The samples were collected from two batches, So we need to adjust the batch effects in the following model. formu <- count ~ sample + (exon + from)*type testgene <- estimateDispersions(testgene, formula=formu) testgene <- fitDispersionFunction(testgene) f0 <- count~sample + from*exon + type f1 <- count~sample + from*exon + type*I(exon==exonID) testgene <- testForDEU(testgene,formula0=f0, formula1=f1) testgene <- estimatelog2FoldChanges(testgene,fitExpToVar="type" ) res <- DEUresultTable(testgene) n the attached figure, toBioConductor-plotDEXSeq.pdf, for E011, its normalized counts of "N" are clearly smaller than those of "X". However, for both Fitted splicing and Fitted expression, the level of "N" is larger than that of "X". And if we checked the estimated fold change as below, the fold change is consistent with fitted expression and fitted splicing. res["ENSG00000001497:011",] geneID exonID dispersion pvalue padjust ENSG00000001497:011 ENSG00000001497 E011 0.4096852 0.0006160211 0.01334712 meanBase log2fold(X/N) ENSG00000001497:011 5.714405 -1.936449 We also explicitly check the normalized counts for E011 of this gene. As below shows the median of each type. Those of "N" is clearly smaller than "X". dtbl <- data.frame(counts=counts(testgene,normalized=T)["ENSG000000014 97:011",],type=pData(testgene)$type) with(dtbl,tapply(counts,type,median)) X N 8.534784 2.064785 sessionInfo() R version 2.14.0 (2011-10-31) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US LC_NUMERIC=C LC_TIME=en_US [4] LC_COLLATE=en_US LC_MONETARY=en_US LC_MESSAGES=en_US [7] LC_PAPER=C LC_NAME=C LC_ADDRESS=C [10] LC_TELEPHONE=C LC_MEASUREMENT=en_US LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] DEXSeq_1.2.0 Biobase_2.14.0 loaded via a namespace (and not attached): [1] biomaRt_2.10.0 hwriter_1.3 plyr_1.7.1 RCurl_1.91-1 statmod_1.4.14 [6] stringr_0.6 tools_2.14.0 XML_3.9-4 So, have I made any mistakes in the analysis? Many thanks in advance, Hui Hui Yao, Ph.D. Principal Statistical Analyst MD Anderson Cancer Center _______________________________________________ Bioconductor mailing list Bioconductor@r-project.org<mailto:bioconductor@r-project.org> https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor [[alternative HTML version deleted]] ADD REPLYlink written 5.9 years ago by Yao,Hui40 Dear Yao Hui, It was an error in the code, the levels in one of your variables in the design matrix were not sorted! At some point part of the code was (wrongly) relying on the order of the factors, some the conditions were mixed. You could stay with the same version and do this before starting the analysis: levels( pData(testgene)$type ) <- sort( levels( pData(testgene)$type ) ) Or update to version DEXSeq_1.3.3? ( it is already fixed here). An apology for the bug! Alejandro ADD REPLYlink written 5.9 years ago by Alejandro Reyes1.5k Dear Alejandro, It works now. Thanks a lot! Hui On Jun 26, 2012, at 11:28 AM, Alejandro Reyes wrote: levels( pData(testgene)$type ) <- sort( levels( pData(testgene)\$type ) ) [[alternative HTML version deleted]]