A question with DEXSeq package: inconsistency between normalized counts vs. fitted expression, fitted splicing or fold changes

0

Entering edit mode

Yao,Hui ▴ 40

@yaohui-5209

Last seen 9.6 years ago

Dear DEXSeq authors and users, I am using DEXSeq package (1.2.0) to do splicing analysis for human genome. One part of results from the analysis is really confusing me. It seems that for many genes the fitted expression, the fitted splicing and estimated fold changes are inconsistent (actually reversed) with its normalized counts. To clearly explain the problem, I am showing a simple example with only two genes with 30 samples as below. > load("toBioConductor.RData") > library(DEXSeq) > ls() [1] "testgene" ### For this data set, our interest is to investigate the differential usage of exons between two tissue "type", X and N. ### The samples were collected from two batches, So we need to adjust the batch effects in the following model. > formu <- count ~ sample + (exon + from)*type > testgene <- estimateDispersions(testgene, formula=formu) > testgene <- fitDispersionFunction(testgene) > f0 <- count~sample + from*exon + type > f1 <- count~sample + from*exon + type*I(exon==exonID) > testgene <- testForDEU(testgene,formula0=f0, formula1=f1) > testgene <- estimatelog2FoldChanges(testgene,fitExpToVar="type" ) > res <- DEUresultTable(testgene) n the attached figure, toBioConductor-plotDEXSeq.pdf, for E011, its normalized counts of "N" are clearly smaller than those of "X". However, for both Fitted splicing and Fitted expression, the level of "N" is larger than that of "X". And if we checked the estimated fold change as below, the fold change is consistent with fitted expression and fitted splicing. > res["ENSG00000001497:011",] geneID exonID dispersion pvalue padjust ENSG00000001497:011 ENSG00000001497 E011 0.4096852 0.0006160211 0.01334712 meanBase log2fold(X/N) ENSG00000001497:011 5.714405 -1.936449 We also explicitly check the normalized counts for E011 of this gene. As below shows the median of each type. Those of "N" is clearly smaller than "X". > dtbl <- data.frame(counts=counts(testgene,normalized=T)["ENSG0000000 1497:011",],type=pData(testgene)$type) > with(dtbl,tapply(counts,type,median)) X N 8.534784 2.064785 > sessionInfo() R version 2.14.0 (2011-10-31) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US LC_NUMERIC=C LC_TIME=en_US [4] LC_COLLATE=en_US LC_MONETARY=en_US LC_MESSAGES=en_US [7] LC_PAPER=C LC_NAME=C LC_ADDRESS=C [10] LC_TELEPHONE=C LC_MEASUREMENT=en_US LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] DEXSeq_1.2.0 Biobase_2.14.0 loaded via a namespace (and not attached): [1] biomaRt_2.10.0 hwriter_1.3 plyr_1.7.1 RCurl_1.91-1 statmod_1.4.14 [6] stringr_0.6 tools_2.14.0 XML_3.9-4 So, have I made any mistakes in the analysis? Many thanks in advance, Hui Hui Yao, Ph.D. Principal Statistical Analyst MD Anderson Cancer Center -------------- next part -------------- A non-text attachment was scrubbed... Name: toBioConductor-plotDEXSeq.pdf Type: application/pdf Size: 18588 bytes Desc: toBioConductor-plotDEXSeq.pdf URL: <https: stat.ethz.ch="" pipermail="" bioconductor="" attachments="" 20120625="" 45c8527d="" attachment.pdf="">

Cancer DEXSeq Cancer DEXSeq • 1.5k views

ADD COMMENT • link updated 11.8 years ago by Alejandro Reyes ★ 1.9k • written 11.8 years ago by Yao,Hui ▴ 40

0

Entering edit mode

Alejandro Reyes ★ 1.9k

@alejandro-reyes-5124

Last seen 8 days ago

Novartis Institutes for BioMedical Rese…

Dear Yao Hui, Thank you for your email and detailed explanations! First, there seems to be a typo in your dispersion formula (which I introduced in the vignette and apparently I forgot to change in the 1.2.0 release). The correct formula would be: formu<- count ~ sample + (type + from) * exon Regarding your question, it sounds like an embarrassing bug... but I could not find in the DEXSeq code a place where the sample names could be switched by mistake. Could you send me your "toBioConductor.RData" object to give it a closer look, as well as the code used to generate the plot?? Best wishes, Alejandro Reyes > Dear DEXSeq authors and users, > > I am using DEXSeq package (1.2.0) to do splicing analysis for human genome. One part of results from the analysis is really confusing me. It seems that for many genes the fitted expression, the fitted splicing and estimated fold changes are inconsistent (actually reversed) with its normalized counts. > > To clearly explain the problem, I am showing a simple example with only two genes with 30 samples as below. > >> load("toBioConductor.RData") >> library(DEXSeq) >> ls() > [1] "testgene" > > ### For this data set, our interest is to investigate the differential usage of exons between two tissue "type", X and N. > ### The samples were collected from two batches, So we need to adjust the batch effects in the following model. > > >> formu<- count ~ sample + (exon + from)*type >> testgene<- estimateDispersions(testgene, formula=formu) >> testgene<- fitDispersionFunction(testgene) >> f0<- count~sample + from*exon + type >> f1<- count~sample + from*exon + type*I(exon==exonID) >> testgene<- testForDEU(testgene,formula0=f0, formula1=f1) >> testgene<- estimatelog2FoldChanges(testgene,fitExpToVar="type" ) >> res<- DEUresultTable(testgene) > n the attached figure, toBioConductor-plotDEXSeq.pdf, for E011, its normalized counts of "N" are clearly smaller than those of "X". However, for both Fitted splicing and Fitted expression, the level of "N" is larger than that of "X". And if we checked the estimated fold change as below, the fold change is consistent with fitted expression and fitted splicing. > >> res["ENSG00000001497:011",] > geneID exonID dispersion pvalue padjust > ENSG00000001497:011 ENSG00000001497 E011 0.4096852 0.0006160211 0.01334712 > meanBase log2fold(X/N) > ENSG00000001497:011 5.714405 -1.936449 > > We also explicitly check the normalized counts for E011 of this gene. As below shows the median of each type. Those of "N" is clearly smaller than "X". > >> dtbl<- data.frame(counts=counts(testgene,normalized=T)["ENSG0000000 1497:011",],type=pData(testgene)$type) >> with(dtbl,tapply(counts,type,median)) > X N > 8.534784 2.064785 > >> sessionInfo() > R version 2.14.0 (2011-10-31) > Platform: x86_64-unknown-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_US LC_NUMERIC=C LC_TIME=en_US > [4] LC_COLLATE=en_US LC_MONETARY=en_US LC_MESSAGES=en_US > [7] LC_PAPER=C LC_NAME=C LC_ADDRESS=C > [10] LC_TELEPHONE=C LC_MEASUREMENT=en_US LC_IDENTIFICATION=C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] DEXSeq_1.2.0 Biobase_2.14.0 > > loaded via a namespace (and not attached): > [1] biomaRt_2.10.0 hwriter_1.3 plyr_1.7.1 RCurl_1.91-1 statmod_1.4.14 > [6] stringr_0.6 tools_2.14.0 XML_3.9-4 > > So, have I made any mistakes in the analysis? > > > Many thanks in advance, > > Hui > > Hui Yao, Ph.D. > Principal Statistical Analyst > MD Anderson Cancer Center > > > > > > > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor [[alternative HTML version deleted]]

ADD COMMENT • link 11.8 years ago Alejandro Reyes ★ 1.9k

0

Entering edit mode

Dear Alejandro, Thanks a lot for quick reply! First I appreciate that you pinpoint the typo and will correct it in the analysis. Then I will send you the toBioConductor.RData and the code in a separate e-mail to avoid exceeding the limit of bio-conductor email size. Thanks again, Hui On Jun 26, 2012, at 4:14 AM, Alejandro Reyes wrote: Dear Yao Hui, Thank you for your email and detailed explanations! First, there seems to be a typo in your dispersion formula (which I introduced in the vignette and apparently I forgot to change in the 1.2.0 release). The correct formula would be: formu <- count ~ sample + (type + from) * exon Regarding your question, it sounds like an embarrassing bug... but I could not find in the DEXSeq code a place where the sample names could be switched by mistake. Could you send me your "toBioConductor.RData" object to give it a closer look, as well as the code used to generate the plot?? Best wishes, Alejandro Reyes Dear DEXSeq authors and users, I am using DEXSeq package (1.2.0) to do splicing analysis for human genome. One part of results from the analysis is really confusing me. It seems that for many genes the fitted expression, the fitted splicing and estimated fold changes are inconsistent (actually reversed) with its normalized counts. To clearly explain the problem, I am showing a simple example with only two genes with 30 samples as below. load("toBioConductor.RData") library(DEXSeq) ls() [1] "testgene" ### For this data set, our interest is to investigate the differential usage of exons between two tissue "type", X and N. ### The samples were collected from two batches, So we need to adjust the batch effects in the following model. formu <- count ~ sample + (exon + from)*type testgene <- estimateDispersions(testgene, formula=formu) testgene <- fitDispersionFunction(testgene) f0 <- count~sample + from*exon + type f1 <- count~sample + from*exon + type*I(exon==exonID) testgene <- testForDEU(testgene,formula0=f0, formula1=f1) testgene <- estimatelog2FoldChanges(testgene,fitExpToVar="type" ) res <- DEUresultTable(testgene) n the attached figure, toBioConductor-plotDEXSeq.pdf, for E011, its normalized counts of "N" are clearly smaller than those of "X". However, for both Fitted splicing and Fitted expression, the level of "N" is larger than that of "X". And if we checked the estimated fold change as below, the fold change is consistent with fitted expression and fitted splicing. res["ENSG00000001497:011",] geneID exonID dispersion pvalue padjust ENSG00000001497:011 ENSG00000001497 E011 0.4096852 0.0006160211 0.01334712 meanBase log2fold(X/N) ENSG00000001497:011 5.714405 -1.936449 We also explicitly check the normalized counts for E011 of this gene. As below shows the median of each type. Those of "N" is clearly smaller than "X". dtbl <- data.frame(counts=counts(testgene,normalized=T)["ENSG000000014 97:011",],type=pData(testgene)$type) with(dtbl,tapply(counts,type,median)) X N 8.534784 2.064785 sessionInfo() R version 2.14.0 (2011-10-31) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US LC_NUMERIC=C LC_TIME=en_US [4] LC_COLLATE=en_US LC_MONETARY=en_US LC_MESSAGES=en_US [7] LC_PAPER=C LC_NAME=C LC_ADDRESS=C [10] LC_TELEPHONE=C LC_MEASUREMENT=en_US LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] DEXSeq_1.2.0 Biobase_2.14.0 loaded via a namespace (and not attached): [1] biomaRt_2.10.0 hwriter_1.3 plyr_1.7.1 RCurl_1.91-1 statmod_1.4.14 [6] stringr_0.6 tools_2.14.0 XML_3.9-4 So, have I made any mistakes in the analysis? Many thanks in advance, Hui Hui Yao, Ph.D. Principal Statistical Analyst MD Anderson Cancer Center _______________________________________________ Bioconductor mailing list Bioconductor@r-project.org<mailto:bioconductor@r-project.org> https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor [[alternative HTML version deleted]]

ADD REPLY • link 11.8 years ago Yao,Hui ▴ 40

0

Entering edit mode

Dear Yao Hui, It was an error in the code, the levels in one of your variables in the design matrix were not sorted! At some point part of the code was (wrongly) relying on the order of the factors, some the conditions were mixed. You could stay with the same version and do this before starting the analysis: levels( pData(testgene)$type ) <- sort( levels( pData(testgene)$type ) ) Or update to version DEXSeq_1.3.3? ( it is already fixed here). An apology for the bug! Alejandro

ADD REPLY • link 11.8 years ago Alejandro Reyes ★ 1.9k

0

Entering edit mode

Dear Alejandro, It works now. Thanks a lot! Hui On Jun 26, 2012, at 11:28 AM, Alejandro Reyes wrote: levels( pData(testgene)$type ) <- sort( levels( pData(testgene)$type ) ) [[alternative HTML version deleted]]

ADD REPLY • link 11.8 years ago Yao,Hui ▴ 40

Login before adding your answer.