Question: Efficiently running DEXSeq for Large Cohorts
0
6.9 years ago by
Fong Chun Chan320 wrote:
Hi all, I've been trying to get DEXSeq to run on a fairly large RNA-seq cohort that I have. To be specific, I have 89 samples and I am attempt to generate DE exon usage results on > 500,000 exons. I've followed the latest tutorial (1.5.6) on Bioconductor and it so far I've had relatively no problems. It just the two steps that are mentioned, estimateDispersions and testForDEU, are taking a fairly long time. I've already attempted to parallelize this on a 48-core 256GB machine, but I get very little progress on the run-time of these functions. I was just wondering if anyone has a good way of running DEXSeq on such a large cohort. Tips on how to reduce run time? Are there way to parallelize these jobs across a cluster rather than rely on a single machine with multi-cores? Any help would be greatly appreciated. Thanks, Fong [[alternative HTML version deleted]]
dexseq • 1.6k views
modified 4.4 years ago by rahel1435010 • written 6.9 years ago by Fong Chun Chan320
Answer: Efficiently running DEXSeq for Large Cohorts
0
6.9 years ago by
Denali
Steve Lianoglou12k wrote:
Hi, On Mon, Jan 14, 2013 at 3:57 AM, Fong Chun Chan <fongchun at="" alumni.ubc.ca=""> wrote: > Hi all, > > I've been trying to get DEXSeq to run on a fairly large RNA-seq cohort that > I have. To be specific, I have 89 samples and I am attempt to generate DE > exon usage results on > 500,000 exons. > > I've followed the latest tutorial (1.5.6) on Bioconductor and it so far > I've had relatively no problems. It just the two steps that are mentioned, > estimateDispersions and testForDEU, are taking a fairly long time. I've > already attempted to parallelize this on a 48-core 256GB machine, but I get > very little progress on the run-time of these functions. If you look at the cpu usage of your machine -- on linux you might us top or htop, for instance -- are all 48 cores typically pegged at 100% usage? -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology | Memorial Sloan-Kettering Cancer Center | Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact
Answer: Efficiently running DEXSeq for Large Cohorts
0
6.9 years ago by
Alejandro Reyes1.7k
Dana-Farber Cancer Institute, Boston, USA
Alejandro Reyes1.7k wrote:
Dear Fong Chun Chan, Thank you for your interest in DEXSeq and sorry in advance for the long e-mail. We have also noticed that the computing time increases considerably when you have a large number of samples, conditions or number of exons of a gene. For users in these situations, we have implemented a variant of this functions (estimateDispersionsTRT and testForDEUTRT) in the most recent versions of DEXSeq in the svn. The difference relies on how the model matrix is prepared, in the "normal" functions, the model matrices used to fit the glms are prepared for each exon, such that each exon bin is treated individually, independently of which exon you are testing. For example, if you have a gene with 5 exons, when testing for exon E001, you would consider independently E002, E003, ... , E005 in the model. In the "TRT" implementation the same model matrix is used for all the exons. In the same example as before, you would consider E001 and the sum of all the rest exons of the same gene. This reduces the model and allows to use DEXSeq with a large number of samples. For more clarity, you could try to compare the normal model frame of a gene with the TRT model frame: data(pasillaExons, package="pasilla") modelFrameForGene(pasillaExons, "FBgn0000256") # vs modelFrameForTRT( pasillaExons ) Using the same example, in the last model frame, "this" would be the "E001" and "others" would be the sum of E002 + E003 + ... + E005. This would be the "normal" DEXSeq analysis: pasillaExons <- estimateSizeFactors( pasillaExons ) pasillaExons <- estimateDispersions( pasillaExons ) pasillaExons <- fitDispersionFunction( pasillaExons ) pasillaExons <- testForDEU( pasillaExons ) This would be the "TRT", pasillaExonsTRT <- estimateSizeFactors( pasillaExons ) pasillaExonsTRT <- estimateDispersionsTRT( pasillaExons ) pasillaExonsTRT <- fitDispersionFunction( pasillaExons ) pasillaExonsTRT <- testForDEUTRT( pasillaExons ) And you can see that you get the same results: plot(fData(pasillaExons)$pvalue, fData(pasillaExonsTRT)$pvalue, log="xy") I have the "TRT" tried this for large cohorts with complex models and it works nicely and in reasonable computing times. Best regards, Alejandro Reyes ps. this changes need to be added to the vignette. > Hi all, > > I've been trying to get DEXSeq to run on a fairly large RNA-seq cohort that > I have. To be specific, I have 89 samples and I am attempt to generate DE > exon usage results on > 500,000 exons. > > I've followed the latest tutorial (1.5.6) on Bioconductor and it so far > I've had relatively no problems. It just the two steps that are mentioned, > estimateDispersions and testForDEU, are taking a fairly long time. I've > already attempted to parallelize this on a 48-core 256GB machine, but I get > very little progress on the run-time of these functions. > > I was just wondering if anyone has a good way of running DEXSeq on such a > large cohort. Tips on how to reduce run time? Are there way to parallelize > these jobs across a cluster rather than rely on a single machine with > multi-cores? Any help would be greatly appreciated. > > Thanks, > > Fong > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
Sorry, there was an error in part of my code: The "TRT" would be like this, data(pasillaExons, package="pasilla") pasillaExonsTRT <- estimateSizeFactors( pasillaExons ) pasillaExonsTRT <- estimateDispersionsTRT( pasillaExonsTRT ) pasillaExonsTRT <- fitDispersionFunction( pasillaExonsTRT ) pasillaExonsTRT <- testForDEUTRT( pasillaExonsTRT ) Bests, Alejandro > Dear Fong Chun Chan, > > Thank you for your interest in DEXSeq and sorry in advance for the > long e-mail. We have also noticed that the computing time increases > considerably when you have a large number of samples, conditions or > number of exons of a gene. For users in these situations, we have > implemented a variant of this functions (estimateDispersionsTRT and > testForDEUTRT) in the most recent versions of DEXSeq in the svn. > > The difference relies on how the model matrix is prepared, in the > "normal" functions, the model matrices used to fit the glms are > prepared for each exon, such that each exon bin is treated > individually, independently of which exon you are testing. For > example, if you have a gene with 5 exons, when testing for exon E001, > you would consider independently E002, E003, ... , E005 in the model. > > In the "TRT" implementation the same model matrix is used for all the > exons. In the same example as before, you would consider E001 and the > sum of all the rest exons of the same gene. This reduces the model and > allows to use DEXSeq with a large number of samples. For more clarity, > you could try to compare the normal model frame of a gene with the TRT > model frame: > > data(pasillaExons, package="pasilla") > modelFrameForGene(pasillaExons, "FBgn0000256") > # vs > modelFrameForTRT( pasillaExons ) > > Using the same example, in the last model frame, "this" would be the > "E001" and "others" would be the sum of E002 + E003 + ... + E005. > > This would be the "normal" DEXSeq analysis: > > pasillaExons <- estimateSizeFactors( pasillaExons ) > pasillaExons <- estimateDispersions( pasillaExons ) > pasillaExons <- fitDispersionFunction( pasillaExons ) > pasillaExons <- testForDEU( pasillaExons ) > > This would be the "TRT", > > pasillaExonsTRT <- estimateSizeFactors( pasillaExons ) > pasillaExonsTRT <- estimateDispersionsTRT( pasillaExons ) > pasillaExonsTRT <- fitDispersionFunction( pasillaExons ) > pasillaExonsTRT <- testForDEUTRT( pasillaExons ) > > And you can see that you get the same results: > > plot(fData(pasillaExons)$pvalue, fData(pasillaExonsTRT)$pvalue, log="xy") > > I have the "TRT" tried this for large cohorts with complex models and > it works nicely and in reasonable computing times. > > Best regards, > Alejandro Reyes > > ps. this changes need to be added to the vignette. > > >> Hi all, >> >> I've been trying to get DEXSeq to run on a fairly large RNA-seq >> cohort that >> I have. To be specific, I have 89 samples and I am attempt to >> generate DE >> exon usage results on > 500,000 exons. >> >> I've followed the latest tutorial (1.5.6) on Bioconductor and it so far >> I've had relatively no problems. It just the two steps that are >> mentioned, >> estimateDispersions and testForDEU, are taking a fairly long time. I've >> already attempted to parallelize this on a 48-core 256GB machine, but >> I get >> very little progress on the run-time of these functions. >> >> I was just wondering if anyone has a good way of running DEXSeq on >> such a >> large cohort. Tips on how to reduce run time? Are there way to >> parallelize >> these jobs across a cluster rather than rely on a single machine with >> multi-cores? Any help would be greatly appreciated. >> >> Thanks, >> >> Fong >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor
Hi Alejandro, Thanks for the reply. I am trying to use the TRT functions from the DEXSeq svn repository in Bioconductor. When I tried to run: source('/bioconductor.svn/DEXSeq/R/TRT.R') estimateDispersionsTRT( exonCounSet, nCores = 48 ) I got a series of errors complaining about "glmnb.fit": In addition: Warning message: In FUN(c(10L, 20L, 30L, 40L, 50L, 60L, 70L, 80L, 90L, 100L, 110L, : Unable to estimate dispersions for ENSG00000187634:ENSE00001631320 Error : could not find function "glmnb.fit" In addition: Warning message: In FUN(c(7L, 17L, 27L, 37L, 47L, 57L, 67L, 77L, 87L, 97L, 107L, : Unable to estimate dispersions for ENSG00000187634:ENSE00001884257 Error : could not find function "glmnb.fit" .... According to this post, http://seqanswers.com/forums/showthread.php?t=16040, glmnb.fit is supposed to come with statmod. According to my sessionInfo(), I should have the latest versions: > sessionInfo() R version 2.15.2 (2012-10-26) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats graphics grDevices utils datasets methods [8] base other attached packages: [1] optparse_1.0.0 DEXSeq_1.4.0 Biobase_2.18.0 BiocGenerics_0.4.0 loaded via a namespace (and not attached): [1] biomaRt_2.14.0 getopt_1.19.0 hwriter_1.3 RCurl_1.95-3 statmod_1.4.16 [6] stringr_0.6.2 tools_2.15.2 XML_3.95-0.1 > Am I doing something wrong here? Thanks for your help, Fong On Mon, Jan 14, 2013 at 1:44 AM, Alejandro Reyes <alejandro.reyes@embl.de>wrote: > Sorry, there was an error in part of my code: > > The "TRT" would be like this, > > data(pasillaExons, package="pasilla") > > pasillaExonsTRT <- estimateSizeFactors( pasillaExons ) > pasillaExonsTRT <- estimateDispersionsTRT( pasillaExonsTRT ) > pasillaExonsTRT <- fitDispersionFunction( pasillaExonsTRT ) > pasillaExonsTRT <- testForDEUTRT( pasillaExonsTRT ) > > Bests, > Alejandro > > > Dear Fong Chun Chan, >> >> Thank you for your interest in DEXSeq and sorry in advance for the long >> e-mail. We have also noticed that the computing time increases considerably >> when you have a large number of samples, conditions or number of exons of a >> gene. For users in these situations, we have implemented a variant of this >> functions (estimateDispersionsTRT and testForDEUTRT) in the most recent >> versions of DEXSeq in the svn. >> >> The difference relies on how the model matrix is prepared, in the >> "normal" functions, the model matrices used to fit the glms are prepared >> for each exon, such that each exon bin is treated individually, >> independently of which exon you are testing. For example, if you have a >> gene with 5 exons, when testing for exon E001, you would consider >> independently E002, E003, ... , E005 in the model. >> >> In the "TRT" implementation the same model matrix is used for all the >> exons. In the same example as before, you would consider E001 and the sum >> of all the rest exons of the same gene. This reduces the model and allows >> to use DEXSeq with a large number of samples. For more clarity, you could >> try to compare the normal model frame of a gene with the TRT model frame: >> >> data(pasillaExons, package="pasilla") >> modelFrameForGene(**pasillaExons, "FBgn0000256") >> # vs >> modelFrameForTRT( pasillaExons ) >> >> Using the same example, in the last model frame, "this" would be the >> "E001" and "others" would be the sum of E002 + E003 + ... + E005. >> >> This would be the "normal" DEXSeq analysis: >> >> pasillaExons <- estimateSizeFactors( pasillaExons ) >> pasillaExons <- estimateDispersions( pasillaExons ) >> pasillaExons <- fitDispersionFunction( pasillaExons ) >> pasillaExons <- testForDEU( pasillaExons ) >> >> This would be the "TRT", >> >> pasillaExonsTRT <- estimateSizeFactors( pasillaExons ) >> pasillaExonsTRT <- estimateDispersionsTRT( pasillaExons ) >> pasillaExonsTRT <- fitDispersionFunction( pasillaExons ) >> pasillaExonsTRT <- testForDEUTRT( pasillaExons ) >> >> And you can see that you get the same results: >> >> plot(fData(pasillaExons)$**pvalue, fData(pasillaExonsTRT)$pvalue, >> log="xy") >> >> I have the "TRT" tried this for large cohorts with complex models and it >> works nicely and in reasonable computing times. >> >> Best regards, >> Alejandro Reyes >> >> ps. this changes need to be added to the vignette. >> >> >> Hi all, >>> >>> I've been trying to get DEXSeq to run on a fairly large RNA-seq cohort >>> that >>> I have. To be specific, I have 89 samples and I am attempt to generate DE >>> exon usage results on > 500,000 exons. >>> >>> I've followed the latest tutorial (1.5.6) on Bioconductor and it so far >>> I've had relatively no problems. It just the two steps that are >>> mentioned, >>> estimateDispersions and testForDEU, are taking a fairly long time. I've >>> already attempted to parallelize this on a 48-core 256GB machine, but I >>> get >>> very little progress on the run-time of these functions. >>> >>> I was just wondering if anyone has a good way of running DEXSeq on such a >>> large cohort. Tips on how to reduce run time? Are there way to >>> parallelize >>> these jobs across a cluster rather than rely on a single machine with >>> multi-cores? Any help would be greatly appreciated. >>> >>> Thanks, >>> >>> Fong >>> >>> [[alternative HTML version deleted]] >>> >>> ______________________________**_________________ >>> Bioconductor mailing list >>> Bioconductor@r-project.org >>> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https: stat.="" ethz.ch="" mailman="" listinfo="" bioconductor=""> >>> Search the archives: http://news.gmane.org/gmane.** >>> science.biology.informatics.**conductor<http: news.gmane.org="" gman="" e.science.biology.informatics.conductor=""> >>> >> >> ______________________________**_________________ >> Bioconductor mailing list >> Bioconductor@r-project.org >> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https: stat.e="" thz.ch="" mailman="" listinfo="" bioconductor=""> >> Search the archives: http://news.gmane.org/gmane.** >> science.biology.informatics.**conductor<http: news.gmane.org="" gmane="" .science.biology.informatics.conductor=""> >> > > [[alternative HTML version deleted]]
ADD REPLYlink written 6.9 years ago by Fong Chun Chan320
Dear Fong Chun Chan, If you want to source the code, you need to attach the libraries required directly to your environment. Before calling the TRT functions, just do: library(statmod) I think that should work, Best regards, Alejandro > Hi Alejandro, > > Thanks for the reply. I am trying to use the TRT functions from the > DEXSeq svn repository in Bioconductor. When I tried to run: > > source('/bioconductor.svn/DEXSeq/R/TRT.R') > estimateDispersionsTRT( exonCounSet, nCores = 48 ) > > I got a series of errors complaining about "glmnb.fit": > > In addition: Warning message: > In FUN(c(10L, 20L, 30L, 40L, 50L, 60L, 70L, 80L, 90L, 100L, 110L, : > Unable to estimate dispersions for ENSG00000187634:ENSE00001631320 > Error : could not find function "glmnb.fit" > In addition: Warning message: > In FUN(c(7L, 17L, 27L, 37L, 47L, 57L, 67L, 77L, 87L, 97L, 107L, : > Unable to estimate dispersions for ENSG00000187634:ENSE00001884257 > Error : could not find function "glmnb.fit" > .... > > According to this post, > http://seqanswers.com/forums/showthread.php?t=16040, glmnb.fit is > supposed to come with statmod. According to my sessionInfo(), I should > have the latest versions: > > > sessionInfo() > R version 2.15.2 (2012-10-26) > Platform: x86_64-unknown-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 > [7] LC_PAPER=C LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] parallel stats graphics grDevices utils datasets methods > [8] base > > other attached packages: > [1] optparse_1.0.0 DEXSeq_1.4.0 Biobase_2.18.0 BiocGenerics_0.4.0 > > loaded via a namespace (and not attached): > [1] biomaRt_2.14.0 getopt_1.19.0 hwriter_1.3 RCurl_1.95-3 > statmod_1.4.16 > [6] stringr_0.6.2 tools_2.15.2 XML_3.95-0.1 > > > > Am I doing something wrong here? Thanks for your help, > > Fong > > > On Mon, Jan 14, 2013 at 1:44 AM, Alejandro Reyes > <alejandro.reyes at="" embl.de="" <mailto:alejandro.reyes="" at="" embl.de="">> wrote: > > Sorry, there was an error in part of my code: > > The "TRT" would be like this, > > data(pasillaExons, package="pasilla") > > pasillaExonsTRT <- estimateSizeFactors( pasillaExons ) > pasillaExonsTRT <- estimateDispersionsTRT( pasillaExonsTRT ) > pasillaExonsTRT <- fitDispersionFunction( pasillaExonsTRT ) > pasillaExonsTRT <- testForDEUTRT( pasillaExonsTRT ) > > Bests, > Alejandro > > > Dear Fong Chun Chan, > > Thank you for your interest in DEXSeq and sorry in advance for > the long e-mail. We have also noticed that the computing time > increases considerably when you have a large number of > samples, conditions or number of exons of a gene. For users in > these situations, we have implemented a variant of this > functions (estimateDispersionsTRT and testForDEUTRT) in the > most recent versions of DEXSeq in the svn. > > The difference relies on how the model matrix is prepared, in > the "normal" functions, the model matrices used to fit the > glms are prepared for each exon, such that each exon bin is > treated individually, independently of which exon you are > testing. For example, if you have a gene with 5 exons, when > testing for exon E001, you would consider independently E002, > E003, ... , E005 in the model. > > In the "TRT" implementation the same model matrix is used for > all the exons. In the same example as before, you would > consider E001 and the sum of all the rest exons of the same > gene. This reduces the model and allows to use DEXSeq with a > large number of samples. For more clarity, you could try to > compare the normal model frame of a gene with the TRT model frame: > > data(pasillaExons, package="pasilla") > modelFrameForGene(pasillaExons, "FBgn0000256") > # vs > modelFrameForTRT( pasillaExons ) > > Using the same example, in the last model frame, "this" would > be the "E001" and "others" would be the sum of E002 + E003 + > ... + E005. > > This would be the "normal" DEXSeq analysis: > > pasillaExons <- estimateSizeFactors( pasillaExons ) > pasillaExons <- estimateDispersions( pasillaExons ) > pasillaExons <- fitDispersionFunction( pasillaExons ) > pasillaExons <- testForDEU( pasillaExons ) > > This would be the "TRT", > > pasillaExonsTRT <- estimateSizeFactors( pasillaExons ) > pasillaExonsTRT <- estimateDispersionsTRT( pasillaExons ) > pasillaExonsTRT <- fitDispersionFunction( pasillaExons ) > pasillaExonsTRT <- testForDEUTRT( pasillaExons ) > > And you can see that you get the same results: > > plot(fData(pasillaExons)$pvalue, > fData(pasillaExonsTRT)$pvalue, log="xy") > > I have the "TRT" tried this for large cohorts with complex > models and it works nicely and in reasonable computing times. > > Best regards, > Alejandro Reyes > > ps. this changes need to be added to the vignette. > > > Hi all, > > I've been trying to get DEXSeq to run on a fairly large > RNA-seq cohort that > I have. To be specific, I have 89 samples and I am attempt > to generate DE > exon usage results on > 500,000 exons. > > I've followed the latest tutorial (1.5.6) on Bioconductor > and it so far > I've had relatively no problems. It just the two steps > that are mentioned, > estimateDispersions and testForDEU, are taking a fairly > long time. I've > already attempted to parallelize this on a 48-core 256GB > machine, but I get > very little progress on the run-time of these functions. > > I was just wondering if anyone has a good way of running > DEXSeq on such a > large cohort. Tips on how to reduce run time? Are there > way to parallelize > these jobs across a cluster rather than rely on a single > machine with > multi-cores? Any help would be greatly appreciated. > > Thanks, > > Fong > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org <mailto:bioconductor at="" r-project.org=""> > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org <mailto:bioconductor at="" r-project.org=""> > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > >
Great. Seems to work now. Thanks, Fong On Tue, Jan 15, 2013 at 2:25 PM, Alejandro Reyes <alejandro.reyes@embl.de>wrote: > Dear Fong Chun Chan, > > If you want to source the code, you need to attach the libraries required > directly to your environment. Before calling the TRT functions, just do: > > library(statmod) > > I think that should work, > > Best regards, > Alejandro > > > > Hi Alejandro, >> >> Thanks for the reply. I am trying to use the TRT functions from the >> DEXSeq svn repository in Bioconductor. When I tried to run: >> >> source('/bioconductor.svn/**DEXSeq/R/TRT.R') >> estimateDispersionsTRT( exonCounSet, nCores = 48 ) >> >> I got a series of errors complaining about "glmnb.fit": >> >> In addition: Warning message: >> In FUN(c(10L, 20L, 30L, 40L, 50L, 60L, 70L, 80L, 90L, 100L, 110L, : >> Unable to estimate dispersions for ENSG00000187634:**ENSE00001631320 >> Error : could not find function "glmnb.fit" >> In addition: Warning message: >> In FUN(c(7L, 17L, 27L, 37L, 47L, 57L, 67L, 77L, 87L, 97L, 107L, : >> Unable to estimate dispersions for ENSG00000187634:**ENSE00001884257 >> Error : could not find function "glmnb.fit" >> .... >> >> According to this post, http://seqanswers.com/forums/** >> showthread.php?t=16040<http: seqanswers.com="" forums="" showthread.php?="" t="16040">, >> glmnb.fit is supposed to come with statmod. According to my sessionInfo(), >> I should have the latest versions: >> >> > sessionInfo() >> R version 2.15.2 (2012-10-26) >> Platform: x86_64-unknown-linux-gnu (64-bit) >> >> locale: >> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 >> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 >> [7] LC_PAPER=C LC_NAME=C >> [9] LC_ADDRESS=C LC_TELEPHONE=C >> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >> >> attached base packages: >> [1] parallel stats graphics grDevices utils datasets methods >> [8] base >> >> other attached packages: >> [1] optparse_1.0.0 DEXSeq_1.4.0 Biobase_2.18.0 BiocGenerics_0.4.0 >> >> loaded via a namespace (and not attached): >> [1] biomaRt_2.14.0 getopt_1.19.0 hwriter_1.3 RCurl_1.95-3 >> statmod_1.4.16 >> [6] stringr_0.6.2 tools_2.15.2 XML_3.95-0.1 >> > >> >> Am I doing something wrong here? Thanks for your help, >> >> Fong >> >> >> On Mon, Jan 14, 2013 at 1:44 AM, Alejandro Reyes <alejandro.reyes@embl.de<mailto:>> alejandro.reyes@embl.**de <alejandro.reyes@embl.de>>> wrote: >> >> Sorry, there was an error in part of my code: >> >> The "TRT" would be like this, >> >> data(pasillaExons, package="pasilla") >> >> pasillaExonsTRT <- estimateSizeFactors( pasillaExons ) >> pasillaExonsTRT <- estimateDispersionsTRT( pasillaExonsTRT ) >> pasillaExonsTRT <- fitDispersionFunction( pasillaExonsTRT ) >> pasillaExonsTRT <- testForDEUTRT( pasillaExonsTRT ) >> >> Bests, >> Alejandro >> >> >> Dear Fong Chun Chan, >> >> Thank you for your interest in DEXSeq and sorry in advance for >> the long e-mail. We have also noticed that the computing time >> increases considerably when you have a large number of >> samples, conditions or number of exons of a gene. For users in >> these situations, we have implemented a variant of this >> functions (estimateDispersionsTRT and testForDEUTRT) in the >> most recent versions of DEXSeq in the svn. >> >> The difference relies on how the model matrix is prepared, in >> the "normal" functions, the model matrices used to fit the >> glms are prepared for each exon, such that each exon bin is >> treated individually, independently of which exon you are >> testing. For example, if you have a gene with 5 exons, when >> testing for exon E001, you would consider independently E002, >> E003, ... , E005 in the model. >> >> In the "TRT" implementation the same model matrix is used for >> all the exons. In the same example as before, you would >> consider E001 and the sum of all the rest exons of the same >> gene. This reduces the model and allows to use DEXSeq with a >> large number of samples. For more clarity, you could try to >> compare the normal model frame of a gene with the TRT model frame: >> >> data(pasillaExons, package="pasilla") >> modelFrameForGene(**pasillaExons, "FBgn0000256") >> # vs >> modelFrameForTRT( pasillaExons ) >> >> Using the same example, in the last model frame, "this" would >> be the "E001" and "others" would be the sum of E002 + E003 + >> ... + E005. >> >> This would be the "normal" DEXSeq analysis: >> >> pasillaExons <- estimateSizeFactors( pasillaExons ) >> pasillaExons <- estimateDispersions( pasillaExons ) >> pasillaExons <- fitDispersionFunction( pasillaExons ) >> pasillaExons <- testForDEU( pasillaExons ) >> >> This would be the "TRT", >> >> pasillaExonsTRT <- estimateSizeFactors( pasillaExons ) >> pasillaExonsTRT <- estimateDispersionsTRT( pasillaExons ) >> pasillaExonsTRT <- fitDispersionFunction( pasillaExons ) >> pasillaExonsTRT <- testForDEUTRT( pasillaExons ) >> >> And you can see that you get the same results: >> >> plot(fData(pasillaExons)$**pvalue, >> fData(pasillaExonsTRT)$pvalue, log="xy") >> >> I have the "TRT" tried this for large cohorts with complex >> models and it works nicely and in reasonable computing times. >> >> Best regards, >> Alejandro Reyes >> >> ps. this changes need to be added to the vignette. >> >> >> Hi all, >> >> I've been trying to get DEXSeq to run on a fairly large >> RNA-seq cohort that >> I have. To be specific, I have 89 samples and I am attempt >> to generate DE >> exon usage results on > 500,000 exons. >> >> I've followed the latest tutorial (1.5.6) on Bioconductor >> and it so far >> I've had relatively no problems. It just the two steps >> that are mentioned, >> estimateDispersions and testForDEU, are taking a fairly >> long time. I've >> already attempted to parallelize this on a 48-core 256GB >> machine, but I get >> very little progress on the run-time of these functions. >> >> I was just wondering if anyone has a good way of running >> DEXSeq on such a >> large cohort. Tips on how to reduce run time? Are there >> way to parallelize >> these jobs across a cluster rather than rely on a single >> machine with >> multi-cores? Any help would be greatly appreciated. >> >> Thanks, >> >> Fong >> >> [[alternative HTML version deleted]] >> >> ______________________________**_________________ >> Bioconductor mailing list >> Bioconductor@r-project.org <mailto:bioconductor@r-**>> project.org <bioconductor@r-project.org>> >> >> https://stat.ethz.ch/mailman/**listinfo/bioconductor<ht tps:="" stat.ethz.ch="" mailman="" listinfo="" bioconductor=""> >> Search the archives: >> http://news.gmane.org/gmane.**science.biology.informatics.** >> conductor<http: news.gmane.org="" gmane.science.biology.informatics.c="" onductor=""> >> >> >> ______________________________**_________________ >> Bioconductor mailing list >> Bioconductor@r-project.org <mailto:bioconductor@r-**project.org<bioconductor@r-project.org> >> > >> >> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https: stat.ethz.ch="" mailman="" listinfo="" bioconductor=""> >> Search the archives: >> http://news.gmane.org/gmane.**science.biology.informatics.** >> conductor<http: news.gmane.org="" gmane.science.biology.informatics.c="" onductor=""> >> >> >> >> > [[alternative HTML version deleted]]
ADD REPLYlink written 6.9 years ago by Fong Chun Chan320
Hi Alejandro, I am getting these errors when running the TRT functions: [1] "Using the TRT functions ..." ...................................................................... ...................................................................... ...................................................................... ...................................................................... ...................................................................... ...................................................................... ...................................................................... ...................................................................... ...................................................................... ...................................................................... ...................................................................... ...................................................................... ....................................................................Er ror in qr.default(mm * sqrt(w)) : NA/NaN/Inf in foreign function call (arg 1) In addition: There were 50 or more warnings (use warnings() to see the first 50) .Error in qr.default(mm * sqrt(w)) : NA/NaN/Inf in foreign function call (arg 1) In addition: There were 50 or more warnings (use warnings() to see the first 50) ...................................................................... ...................................................................... ...................................................................... ...................................................................... ...................................................................... ................. Perhaps I am not using the svn version fo DEXseq properly. I am actually still calling library(DEXSeq) and then sourcing the following files from the svn repository: library('statmod') source('scripts/TRT.R') source('~/bioconductor.svn/DEXSeq/R/core_functions.R') source('~/bioconductor.svn/DEXSeq/R/coef_handling.R') source('~/bioconductor.svn/DEXSeq/R/visual_funct.R') The reason why I was doing this was because it seemed that the order of which files to source mattered as it was complaining. So I worked out the most parsimonious set of files to source that still work with the current DEXSeq library. Should I just be not bothering to use the current DEXSeq library and depend on sourcing all the svn .R files? Thanks, Fong On Tue, Jan 15, 2013 at 4:05 PM, Fong Chun Chan <fongchun@alumni.ubc.ca>wrote: > Great. Seems to work now. Thanks, > > Fong > > > On Tue, Jan 15, 2013 at 2:25 PM, Alejandro Reyes <alejandro.reyes@embl.de>wrote: > >> Dear Fong Chun Chan, >> >> If you want to source the code, you need to attach the libraries required >> directly to your environment. Before calling the TRT functions, just do: >> >> library(statmod) >> >> I think that should work, >> >> Best regards, >> Alejandro >> >> >> >> Hi Alejandro, >>> >>> Thanks for the reply. I am trying to use the TRT functions from the >>> DEXSeq svn repository in Bioconductor. When I tried to run: >>> >>> source('/bioconductor.svn/**DEXSeq/R/TRT.R') >>> estimateDispersionsTRT( exonCounSet, nCores = 48 ) >>> >>> I got a series of errors complaining about "glmnb.fit": >>> >>> In addition: Warning message: >>> In FUN(c(10L, 20L, 30L, 40L, 50L, 60L, 70L, 80L, 90L, 100L, 110L, : >>> Unable to estimate dispersions for ENSG00000187634:**ENSE00001631320 >>> Error : could not find function "glmnb.fit" >>> In addition: Warning message: >>> In FUN(c(7L, 17L, 27L, 37L, 47L, 57L, 67L, 77L, 87L, 97L, 107L, : >>> Unable to estimate dispersions for ENSG00000187634:**ENSE00001884257 >>> Error : could not find function "glmnb.fit" >>> .... >>> >>> According to this post, http://seqanswers.com/forums/** >>> showthread.php?t=16040<http: seqanswers.com="" forums="" showthread.php="" ?t="16040">, >>> glmnb.fit is supposed to come with statmod. According to my sessionInfo(), >>> I should have the latest versions: >>> >>> > sessionInfo() >>> R version 2.15.2 (2012-10-26) >>> Platform: x86_64-unknown-linux-gnu (64-bit) >>> >>> locale: >>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >>> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 >>> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 >>> [7] LC_PAPER=C LC_NAME=C >>> [9] LC_ADDRESS=C LC_TELEPHONE=C >>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >>> >>> attached base packages: >>> [1] parallel stats graphics grDevices utils datasets methods >>> [8] base >>> >>> other attached packages: >>> [1] optparse_1.0.0 DEXSeq_1.4.0 Biobase_2.18.0 BiocGenerics_0.4.0 >>> >>> loaded via a namespace (and not attached): >>> [1] biomaRt_2.14.0 getopt_1.19.0 hwriter_1.3 RCurl_1.95-3 >>> statmod_1.4.16 >>> [6] stringr_0.6.2 tools_2.15.2 XML_3.95-0.1 >>> > >>> >>> Am I doing something wrong here? Thanks for your help, >>> >>> Fong >>> >>> >>> On Mon, Jan 14, 2013 at 1:44 AM, Alejandro Reyes < >>> alejandro.reyes@embl.de <mailto:alejandro.reyes@embl.**de<alejandro.reyes@embl.de>>> >>> wrote: >>> >>> Sorry, there was an error in part of my code: >>> >>> The "TRT" would be like this, >>> >>> data(pasillaExons, package="pasilla") >>> >>> pasillaExonsTRT <- estimateSizeFactors( pasillaExons ) >>> pasillaExonsTRT <- estimateDispersionsTRT( pasillaExonsTRT ) >>> pasillaExonsTRT <- fitDispersionFunction( pasillaExonsTRT ) >>> pasillaExonsTRT <- testForDEUTRT( pasillaExonsTRT ) >>> >>> Bests, >>> Alejandro >>> >>> >>> Dear Fong Chun Chan, >>> >>> Thank you for your interest in DEXSeq and sorry in advance for >>> the long e-mail. We have also noticed that the computing time >>> increases considerably when you have a large number of >>> samples, conditions or number of exons of a gene. For users in >>> these situations, we have implemented a variant of this >>> functions (estimateDispersionsTRT and testForDEUTRT) in the >>> most recent versions of DEXSeq in the svn. >>> >>> The difference relies on how the model matrix is prepared, in >>> the "normal" functions, the model matrices used to fit the >>> glms are prepared for each exon, such that each exon bin is >>> treated individually, independently of which exon you are >>> testing. For example, if you have a gene with 5 exons, when >>> testing for exon E001, you would consider independently E002, >>> E003, ... , E005 in the model. >>> >>> In the "TRT" implementation the same model matrix is used for >>> all the exons. In the same example as before, you would >>> consider E001 and the sum of all the rest exons of the same >>> gene. This reduces the model and allows to use DEXSeq with a >>> large number of samples. For more clarity, you could try to >>> compare the normal model frame of a gene with the TRT model >>> frame: >>> >>> data(pasillaExons, package="pasilla") >>> modelFrameForGene(**pasillaExons, "FBgn0000256") >>> # vs >>> modelFrameForTRT( pasillaExons ) >>> >>> Using the same example, in the last model frame, "this" would >>> be the "E001" and "others" would be the sum of E002 + E003 + >>> ... + E005. >>> >>> This would be the "normal" DEXSeq analysis: >>> >>> pasillaExons <- estimateSizeFactors( pasillaExons ) >>> pasillaExons <- estimateDispersions( pasillaExons ) >>> pasillaExons <- fitDispersionFunction( pasillaExons ) >>> pasillaExons <- testForDEU( pasillaExons ) >>> >>> This would be the "TRT", >>> >>> pasillaExonsTRT <- estimateSizeFactors( pasillaExons ) >>> pasillaExonsTRT <- estimateDispersionsTRT( pasillaExons ) >>> pasillaExonsTRT <- fitDispersionFunction( pasillaExons ) >>> pasillaExonsTRT <- testForDEUTRT( pasillaExons ) >>> >>> And you can see that you get the same results: >>> >>> plot(fData(pasillaExons)$**pvalue, >>> fData(pasillaExonsTRT)$pvalue, log="xy") >>> >>> I have the "TRT" tried this for large cohorts with complex >>> models and it works nicely and in reasonable computing times. >>> >>> Best regards, >>> Alejandro Reyes >>> >>> ps. this changes need to be added to the vignette. >>> >>> >>> Hi all, >>> >>> I've been trying to get DEXSeq to run on a fairly large >>> RNA-seq cohort that >>> I have. To be specific, I have 89 samples and I am attempt >>> to generate DE >>> exon usage results on > 500,000 exons. >>> >>> I've followed the latest tutorial (1.5.6) on Bioconductor >>> and it so far >>> I've had relatively no problems. It just the two steps >>> that are mentioned, >>> estimateDispersions and testForDEU, are taking a fairly >>> long time. I've >>> already attempted to parallelize this on a 48-core 256GB >>> machine, but I get >>> very little progress on the run-time of these functions. >>> >>> I was just wondering if anyone has a good way of running >>> DEXSeq on such a >>> large cohort. Tips on how to reduce run time? Are there >>> way to parallelize >>> these jobs across a cluster rather than rely on a single >>> machine with >>> multi-cores? Any help would be greatly appreciated. >>> >>> Thanks, >>> >>> Fong >>> >>> [[alternative HTML version deleted]] >>> >>> ______________________________**_________________ >>> Bioconductor mailing list >>> Bioconductor@r-project.org <mailto:bioconductor@r-**>>> project.org <bioconductor@r-project.org>> >>> >>> https://stat.ethz.ch/mailman/**listinfo/bioconductor<h ttps:="" stat.ethz.ch="" mailman="" listinfo="" bioconductor=""> >>> Search the archives: >>> http://news.gmane.org/gmane.**science.biology.informatics.** >>> conductor<http: news.gmane.org="" gmane.science.biology.informatics.="" conductor=""> >>> >>> >>> ______________________________**_________________ >>> Bioconductor mailing list >>> Bioconductor@r-project.org <mailto:bioconductor@r-**project.org<bioconductor@r-project.org> >>> > >>> >>> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https :="" stat.ethz.ch="" mailman="" listinfo="" bioconductor=""> >>> Search the archives: >>> http://news.gmane.org/gmane.**science.biology.informatics.** >>> conductor<http: news.gmane.org="" gmane.science.biology.informatics.="" conductor=""> >>> >>> >>> >>> >> > [[alternative HTML version deleted]]
ADD REPLYlink written 6.9 years ago by Fong Chun Chan320
Dear Fong Chun Chan, I would definitely recommend to install the newest versions rather than sourcing the files! However I think this is not the cause of the errors. These warnings come from exons where the glm function fails, e.g. when an exon has very high variance and an NAs are generated somewhere inside the function. If I am correct, these errors do not stop the function from estimating the rest of the exon dispersions? If so it should not stop the overall workflow, but you would loose that exon from the analysis. I think my intention was to print warnings instead. I will do it! Best wishes Alejandro > Hi Alejandro, > > I am getting these errors when running the TRT functions: > > [1] "Using the TRT functions ..." > .................................................................... ...................................................................... ...................................................................... ...................................................................... ...................................................................... ...................................................................... ...................................................................... ...................................................................... ...................................................................... ...................................................................... ...................................................................... ...................................................................... ...................................................................... Error > in qr.default(mm * sqrt(w)) : > NA/NaN/Inf in foreign function call (arg 1) > In addition: There were 50 or more warnings (use warnings() to see the > first 50) > .Error in qr.default(mm * sqrt(w)) : > NA/NaN/Inf in foreign function call (arg 1) > In addition: There were 50 or more warnings (use warnings() to see the > first 50) > .................................................................... ...................................................................... ...................................................................... ...................................................................... ...................................................................... ................... > > Perhaps I am not using the svn version fo DEXseq properly. I am > actually still calling library(DEXSeq) and then sourcing the following > files from the svn repository: > > library('statmod') > source('scripts/TRT.R') > source('~/bioconductor.svn/DEXSeq/R/core_functions.R') > source('~/bioconductor.svn/DEXSeq/R/coef_handling.R') > source('~/bioconductor.svn/DEXSeq/R/visual_funct.R') > > The reason why I was doing this was because it seemed that the order > of which files to source mattered as it was complaining. So I worked > out the most parsimonious set of files to source that still work with > the current DEXSeq library. Should I just be not bothering to use the > current DEXSeq library and depend on sourcing all the svn .R files? > > Thanks, > > Fong > > > On Tue, Jan 15, 2013 at 4:05 PM, Fong Chun Chan > <fongchun@alumni.ubc.ca <mailto:fongchun@alumni.ubc.ca="">> wrote: > > Great. Seems to work now. Thanks, > > Fong > > > On Tue, Jan 15, 2013 at 2:25 PM, Alejandro Reyes > <alejandro.reyes@embl.de <mailto:alejandro.reyes@embl.de="">> wrote: > > Dear Fong Chun Chan, > > If you want to source the code, you need to attach the > libraries required directly to your environment. Before > calling the TRT functions, just do: > > library(statmod) > > I think that should work, > > Best regards, > Alejandro > > > > Hi Alejandro, > > Thanks for the reply. I am trying to use the TRT functions > from the DEXSeq svn repository in Bioconductor. When I > tried to run: > > source('/bioconductor.svn/DEXSeq/R/TRT.R') > estimateDispersionsTRT( exonCounSet, nCores = 48 ) > > I got a series of errors complaining about "glmnb.fit": > > In addition: Warning message: > In FUN(c(10L, 20L, 30L, 40L, 50L, 60L, 70L, 80L, 90L, > 100L, 110L, : > Unable to estimate dispersions for > ENSG00000187634:ENSE00001631320 > Error : could not find function "glmnb.fit" > In addition: Warning message: > In FUN(c(7L, 17L, 27L, 37L, 47L, 57L, 67L, 77L, 87L, 97L, > 107L, : > Unable to estimate dispersions for > ENSG00000187634:ENSE00001884257 > Error : could not find function "glmnb.fit" > .... > > According to this post, > http://seqanswers.com/forums/showthread.php?t=16040, > glmnb.fit is supposed to come with statmod. According to > my sessionInfo(), I should have the latest versions: > > > sessionInfo() > R version 2.15.2 (2012-10-26) > Platform: x86_64-unknown-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 > [7] LC_PAPER=C LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] parallel stats graphics grDevices utils datasets > methods > [8] base > > other attached packages: > [1] optparse_1.0.0 DEXSeq_1.4.0 Biobase_2.18.0 > BiocGenerics_0.4.0 > > loaded via a namespace (and not attached): > [1] biomaRt_2.14.0 getopt_1.19.0 hwriter_1.3 > RCurl_1.95-3 statmod_1.4.16 > [6] stringr_0.6.2 tools_2.15.2 XML_3.95-0.1 > > > > Am I doing something wrong here? Thanks for your help, > > Fong > > > On Mon, Jan 14, 2013 at 1:44 AM, Alejandro Reyes > <alejandro.reyes@embl.de <mailto:alejandro.reyes@embl.de=""> > <mailto:alejandro.reyes@embl.de> <mailto:alejandro.reyes@embl.de>>> wrote: > > Sorry, there was an error in part of my code: > > The "TRT" would be like this, > > data(pasillaExons, package="pasilla") > > pasillaExonsTRT <- estimateSizeFactors( pasillaExons ) > pasillaExonsTRT <- estimateDispersionsTRT( > pasillaExonsTRT ) > pasillaExonsTRT <- fitDispersionFunction( > pasillaExonsTRT ) > pasillaExonsTRT <- testForDEUTRT( pasillaExonsTRT ) > > Bests, > Alejandro > > > Dear Fong Chun Chan, > > Thank you for your interest in DEXSeq and sorry in > advance for > the long e-mail. We have also noticed that the > computing time > increases considerably when you have a large number of > samples, conditions or number of exons of a gene. > For users in > these situations, we have implemented a variant of > this > functions (estimateDispersionsTRT and > testForDEUTRT) in the > most recent versions of DEXSeq in the svn. > > The difference relies on how the model matrix is > prepared, in > the "normal" functions, the model matrices used to > fit the > glms are prepared for each exon, such that each > exon bin is > treated individually, independently of which exon > you are > testing. For example, if you have a gene with 5 > exons, when > testing for exon E001, you would consider > independently E002, > E003, ... , E005 in the model. > > In the "TRT" implementation the same model matrix > is used for > all the exons. In the same example as before, you > would > consider E001 and the sum of all the rest exons of > the same > gene. This reduces the model and allows to use > DEXSeq with a > large number of samples. For more clarity, you > could try to > compare the normal model frame of a gene with the > TRT model frame: > > data(pasillaExons, package="pasilla") > modelFrameForGene(pasillaExons, "FBgn0000256") > # vs > modelFrameForTRT( pasillaExons ) > > Using the same example, in the last model frame, > "this" would > be the "E001" and "others" would be the sum of > E002 + E003 + > ... + E005. > > This would be the "normal" DEXSeq analysis: > > pasillaExons <- estimateSizeFactors( pasillaExons ) > pasillaExons <- estimateDispersions( pasillaExons ) > pasillaExons <- fitDispersionFunction( pasillaExons ) > pasillaExons <- testForDEU( pasillaExons ) > > This would be the "TRT", > > pasillaExonsTRT <- estimateSizeFactors( pasillaExons ) > pasillaExonsTRT <- estimateDispersionsTRT( > pasillaExons ) > pasillaExonsTRT <- fitDispersionFunction( > pasillaExons ) > pasillaExonsTRT <- testForDEUTRT( pasillaExons ) > > And you can see that you get the same results: > > plot(fData(pasillaExons)$pvalue, > fData(pasillaExonsTRT)$pvalue, log="xy") > > I have the "TRT" tried this for large cohorts with > complex > models and it works nicely and in reasonable > computing times. > > Best regards, > Alejandro Reyes > > ps. this changes need to be added to the vignette. > > > Hi all, > > I've been trying to get DEXSeq to run on a > fairly large > RNA-seq cohort that > I have. To be specific, I have 89 samples and > I am attempt > to generate DE > exon usage results on > 500,000 exons. > > I've followed the latest tutorial (1.5.6) on > Bioconductor > and it so far > I've had relatively no problems. It just the > two steps > that are mentioned, > estimateDispersions and testForDEU, are taking > a fairly > long time. I've > already attempted to parallelize this on a > 48-core 256GB > machine, but I get > very little progress on the run-time of these > functions. > > I was just wondering if anyone has a good way > of running > DEXSeq on such a > large cohort. Tips on how to reduce run time? > Are there > way to parallelize > these jobs across a cluster rather than rely > on a single > machine with > multi-cores? Any help would be greatly > appreciated. > > Thanks, > > Fong > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > <mailto:bioconductor@r-project.org> > <mailto:bioconductor@r-project.org> <mailto:bioconductor@r-project.org>> > > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > <mailto:bioconductor@r-project.org> > <mailto:bioconductor@r-project.org> <mailto:bioconductor@r-project.org>> > > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > > > [[alternative HTML version deleted]]
Hi Alejandro, Thanks for the reply. You are right in that the warning messages do not stop the execution of the program. What kind of speed-up should I be expecting by using the variant functions? Is the progress report of one '.' representing 100 processed genes still accurate in this case? Just observing the run-time, it is still taking an incredibly long time. Thanks, Fong On Thu, Jan 17, 2013 at 5:11 AM, Alejandro Reyes <alejandro.reyes@embl.de>wrote: > Dear Fong Chun Chan, > > I would definitely recommend to install the newest versions rather than > sourcing the files! > However I think this is not the cause of the errors. These warnings come > from exons where the glm function fails, e.g. when an exon has very high > variance and an NAs are generated somewhere inside the function. If I am > correct, these errors do not stop the function from estimating the rest of > the exon dispersions? If so it should not stop the overall workflow, but > you would loose that exon from the analysis. > > I think my intention was to print warnings instead. I will do it! > > Best wishes > Alejandro > > > Hi Alejandro, > > I am getting these errors when running the TRT functions: > > [1] "Using the TRT functions ..." > .................................................................... ...................................................................... ...................................................................... ...................................................................... ...................................................................... ...................................................................... ...................................................................... ...................................................................... ...................................................................... ...................................................................... ...................................................................... ...................................................................... ...................................................................... Error > in qr.default(mm * sqrt(w)) : > NA/NaN/Inf in foreign function call (arg 1) > In addition: There were 50 or more warnings (use warnings() to see the > first 50) > .Error in qr.default(mm * sqrt(w)) : > NA/NaN/Inf in foreign function call (arg 1) > In addition: There were 50 or more warnings (use warnings() to see the > first 50) > > .................................................................... ...................................................................... ...................................................................... ...................................................................... ...................................................................... ................... > > Perhaps I am not using the svn version fo DEXseq properly. I am actually > still calling library(DEXSeq) and then sourcing the following files from > the svn repository: > > library('statmod') > source('scripts/TRT.R') > source('~/bioconductor.svn/DEXSeq/R/core_functions.R') > source('~/bioconductor.svn/DEXSeq/R/coef_handling.R') > source('~/bioconductor.svn/DEXSeq/R/visual_funct.R') > > The reason why I was doing this was because it seemed that the order of > which files to source mattered as it was complaining. So I worked out the > most parsimonious set of files to source that still work with the current > DEXSeq library. Should I just be not bothering to use the current DEXSeq > library and depend on sourcing all the svn .R files? > > Thanks, > > Fong > > > On Tue, Jan 15, 2013 at 4:05 PM, Fong Chun Chan <fongchun@alumni.ubc.ca>wrote: > >> Great. Seems to work now. Thanks, >> >> Fong >> >> >> On Tue, Jan 15, 2013 at 2:25 PM, Alejandro Reyes <alejandro.reyes@embl.de>> > wrote: >> >>> Dear Fong Chun Chan, >>> >>> If you want to source the code, you need to attach the libraries >>> required directly to your environment. Before calling the TRT functions, >>> just do: >>> >>> library(statmod) >>> >>> I think that should work, >>> >>> Best regards, >>> Alejandro >>> >>> >>> >>> Hi Alejandro, >>>> >>>> Thanks for the reply. I am trying to use the TRT functions from the >>>> DEXSeq svn repository in Bioconductor. When I tried to run: >>>> >>>> source('/bioconductor.svn/DEXSeq/R/TRT.R') >>>> estimateDispersionsTRT( exonCounSet, nCores = 48 ) >>>> >>>> I got a series of errors complaining about "glmnb.fit": >>>> >>>> In addition: Warning message: >>>> In FUN(c(10L, 20L, 30L, 40L, 50L, 60L, 70L, 80L, 90L, 100L, 110L, : >>>> Unable to estimate dispersions for ENSG00000187634:ENSE00001631320 >>>> Error : could not find function "glmnb.fit" >>>> In addition: Warning message: >>>> In FUN(c(7L, 17L, 27L, 37L, 47L, 57L, 67L, 77L, 87L, 97L, 107L, : >>>> Unable to estimate dispersions for ENSG00000187634:ENSE00001884257 >>>> Error : could not find function "glmnb.fit" >>>> .... >>>> >>>> According to this post, >>>> http://seqanswers.com/forums/showthread.php?t=16040, glmnb.fit is >>>> supposed to come with statmod. According to my sessionInfo(), I should have >>>> the latest versions: >>>> >>>> > sessionInfo() >>>> R version 2.15.2 (2012-10-26) >>>> Platform: x86_64-unknown-linux-gnu (64-bit) >>>> >>>> locale: >>>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >>>> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 >>>> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 >>>> [7] LC_PAPER=C LC_NAME=C >>>> [9] LC_ADDRESS=C LC_TELEPHONE=C >>>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >>>> >>>> attached base packages: >>>> [1] parallel stats graphics grDevices utils datasets methods >>>> [8] base >>>> >>>> other attached packages: >>>> [1] optparse_1.0.0 DEXSeq_1.4.0 Biobase_2.18.0 >>>> BiocGenerics_0.4.0 >>>> >>>> loaded via a namespace (and not attached): >>>> [1] biomaRt_2.14.0 getopt_1.19.0 hwriter_1.3 RCurl_1.95-3 >>>> statmod_1.4.16 >>>> [6] stringr_0.6.2 tools_2.15.2 XML_3.95-0.1 >>>> > >>>> >>>> Am I doing something wrong here? Thanks for your help, >>>> >>>> Fong >>>> >>>> >>>> On Mon, Jan 14, 2013 at 1:44 AM, Alejandro Reyes < >>>> alejandro.reyes@embl.de <mailto:alejandro.reyes@embl.de>> wrote: >>>> >>>> Sorry, there was an error in part of my code: >>>> >>>> The "TRT" would be like this, >>>> >>>> data(pasillaExons, package="pasilla") >>>> >>>> pasillaExonsTRT <- estimateSizeFactors( pasillaExons ) >>>> pasillaExonsTRT <- estimateDispersionsTRT( pasillaExonsTRT ) >>>> pasillaExonsTRT <- fitDispersionFunction( pasillaExonsTRT ) >>>> pasillaExonsTRT <- testForDEUTRT( pasillaExonsTRT ) >>>> >>>> Bests, >>>> Alejandro >>>> >>>> >>>> Dear Fong Chun Chan, >>>> >>>> Thank you for your interest in DEXSeq and sorry in advance for >>>> the long e-mail. We have also noticed that the computing time >>>> increases considerably when you have a large number of >>>> samples, conditions or number of exons of a gene. For users in >>>> these situations, we have implemented a variant of this >>>> functions (estimateDispersionsTRT and testForDEUTRT) in the >>>> most recent versions of DEXSeq in the svn. >>>> >>>> The difference relies on how the model matrix is prepared, in >>>> the "normal" functions, the model matrices used to fit the >>>> glms are prepared for each exon, such that each exon bin is >>>> treated individually, independently of which exon you are >>>> testing. For example, if you have a gene with 5 exons, when >>>> testing for exon E001, you would consider independently E002, >>>> E003, ... , E005 in the model. >>>> >>>> In the "TRT" implementation the same model matrix is used for >>>> all the exons. In the same example as before, you would >>>> consider E001 and the sum of all the rest exons of the same >>>> gene. This reduces the model and allows to use DEXSeq with a >>>> large number of samples. For more clarity, you could try to >>>> compare the normal model frame of a gene with the TRT model >>>> frame: >>>> >>>> data(pasillaExons, package="pasilla") >>>> modelFrameForGene(pasillaExons, "FBgn0000256") >>>> # vs >>>> modelFrameForTRT( pasillaExons ) >>>> >>>> Using the same example, in the last model frame, "this" would >>>> be the "E001" and "others" would be the sum of E002 + E003 + >>>> ... + E005. >>>> >>>> This would be the "normal" DEXSeq analysis: >>>> >>>> pasillaExons <- estimateSizeFactors( pasillaExons ) >>>> pasillaExons <- estimateDispersions( pasillaExons ) >>>> pasillaExons <- fitDispersionFunction( pasillaExons ) >>>> pasillaExons <- testForDEU( pasillaExons ) >>>> >>>> This would be the "TRT", >>>> >>>> pasillaExonsTRT <- estimateSizeFactors( pasillaExons ) >>>> pasillaExonsTRT <- estimateDispersionsTRT( pasillaExons ) >>>> pasillaExonsTRT <- fitDispersionFunction( pasillaExons ) >>>> pasillaExonsTRT <- testForDEUTRT( pasillaExons ) >>>> >>>> And you can see that you get the same results: >>>> >>>> plot(fData(pasillaExons)$pvalue, >>>> fData(pasillaExonsTRT)$pvalue, log="xy") >>>> >>>> I have the "TRT" tried this for large cohorts with complex >>>> models and it works nicely and in reasonable computing times. >>>> >>>> Best regards, >>>> Alejandro Reyes >>>> >>>> ps. this changes need to be added to the vignette. >>>> >>>> >>>> Hi all, >>>> >>>> I've been trying to get DEXSeq to run on a fairly large >>>> RNA-seq cohort that >>>> I have. To be specific, I have 89 samples and I am attempt >>>> to generate DE >>>> exon usage results on > 500,000 exons. >>>> >>>> I've followed the latest tutorial (1.5.6) on Bioconductor >>>> and it so far >>>> I've had relatively no problems. It just the two steps >>>> that are mentioned, >>>> estimateDispersions and testForDEU, are taking a fairly >>>> long time. I've >>>> already attempted to parallelize this on a 48-core 256GB >>>> machine, but I get >>>> very little progress on the run-time of these functions. >>>> >>>> I was just wondering if anyone has a good way of running >>>> DEXSeq on such a >>>> large cohort. Tips on how to reduce run time? Are there >>>> way to parallelize >>>> these jobs across a cluster rather than rely on a single >>>> machine with >>>> multi-cores? Any help would be greatly appreciated. >>>> >>>> Thanks, >>>> >>>> Fong >>>> >>>> [[alternative HTML version deleted]] >>>> >>>> _______________________________________________ >>>> Bioconductor mailing list >>>> Bioconductor@r-project.org <mailto:>>>> Bioconductor@r-project.org> >>>> >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> Search the archives: >>>> >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>> >>>> >>>> _______________________________________________ >>>> Bioconductor mailing list >>>> Bioconductor@r-project.org <mailto:bioconductor@r-project.org> >>>> >>>> >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> Search the archives: >>>> >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>> >>>> >>>> >>>> >>> >> > > [[alternative HTML version deleted]]
Dear Fong, I contacted another user with a large dataset, similar to yours, this was the summary of her complete analysis with DEXSeq: Total number of samples analized: 58 pairs CPU time :2624604.00 sec. Max Memory : 51078 MB Max Swap : 55226 MB Max Processes : 23 Max Threads : 24 So using a 24 cores machine with 58 * 2 samples, her analysis was done in ~1.3 days. One "." should correspond to 100 processed genes. Best wishes, Alejandro > Hi Alejandro, > > Thanks for the reply. You are right in that the warning messages do > not stop the execution of the program. > > What kind of speed-up should I be expecting by using the variant > functions? Is the progress report of one '.' representing 100 > processed genes still accurate in this case? Just observing the > run-time, it is still taking an incredibly long time. > > Thanks, > > Fong > > > On Thu, Jan 17, 2013 at 5:11 AM, Alejandro Reyes > <alejandro.reyes@embl.de <mailto:alejandro.reyes@embl.de="">> wrote: > > Dear Fong Chun Chan, > > I would definitely recommend to install the newest versions rather > than sourcing the files! > However I think this is not the cause of the errors. These > warnings come from exons where the glm function fails, e.g. when > an exon has very high variance and an NAs are generated somewhere > inside the function. If I am correct, these errors do not stop the > function from estimating the rest of the exon dispersions? If so > it should not stop the overall workflow, but you would loose that > exon from the analysis. > > I think my intention was to print warnings instead. I will do it! > > Best wishes > Alejandro > > >> Hi Alejandro, >> >> I am getting these errors when running the TRT functions: >> >> [1] "Using the TRT functions ..." >> ............................................................... ...................................................................... ...................................................................... ...................................................................... ...................................................................... ...................................................................... ...................................................................... ...................................................................... ...................................................................... ...................................................................... ...................................................................... ...................................................................... ...................................................................... .....Error >> in qr.default(mm * sqrt(w)) : >> NA/NaN/Inf in foreign function call (arg 1) >> In addition: There were 50 or more warnings (use warnings() to >> see the first 50) >> .Error in qr.default(mm * sqrt(w)) : >> NA/NaN/Inf in foreign function call (arg 1) >> In addition: There were 50 or more warnings (use warnings() to >> see the first 50) >> ............................................................... ...................................................................... ...................................................................... ...................................................................... ...................................................................... ........................ >> >> Perhaps I am not using the svn version fo DEXseq properly. I am >> actually still calling library(DEXSeq) and then sourcing the >> following files from the svn repository: >> >> library('statmod') >> source('scripts/TRT.R') >> source('~/bioconductor.svn/DEXSeq/R/core_functions.R') >> source('~/bioconductor.svn/DEXSeq/R/coef_handling.R') >> source('~/bioconductor.svn/DEXSeq/R/visual_funct.R') >> >> The reason why I was doing this was because it seemed that the >> order of which files to source mattered as it was complaining. So >> I worked out the most parsimonious set of files to source that >> still work with the current DEXSeq library. Should I just be not >> bothering to use the current DEXSeq library and depend on >> sourcing all the svn .R files? >> >> Thanks, >> >> Fong >> >> >> On Tue, Jan 15, 2013 at 4:05 PM, Fong Chun Chan >> <fongchun@alumni.ubc.ca <mailto:fongchun@alumni.ubc.ca="">> wrote: >> >> Great. Seems to work now. Thanks, >> >> Fong >> >> >> On Tue, Jan 15, 2013 at 2:25 PM, Alejandro Reyes >> <alejandro.reyes@embl.de <mailto:alejandro.reyes@embl.de="">> wrote: >> >> Dear Fong Chun Chan, >> >> If you want to source the code, you need to attach the >> libraries required directly to your environment. Before >> calling the TRT functions, just do: >> >> library(statmod) >> >> I think that should work, >> >> Best regards, >> Alejandro >> >> >> >> Hi Alejandro, >> >> Thanks for the reply. I am trying to use the TRT >> functions from the DEXSeq svn repository in >> Bioconductor. When I tried to run: >> >> source('/bioconductor.svn/DEXSeq/R/TRT.R') >> estimateDispersionsTRT( exonCounSet, nCores = 48 ) >> >> I got a series of errors complaining about "glmnb.fit": >> >> In addition: Warning message: >> In FUN(c(10L, 20L, 30L, 40L, 50L, 60L, 70L, 80L, 90L, >> 100L, 110L, : >> Unable to estimate dispersions for >> ENSG00000187634:ENSE00001631320 >> Error : could not find function "glmnb.fit" >> In addition: Warning message: >> In FUN(c(7L, 17L, 27L, 37L, 47L, 57L, 67L, 77L, 87L, >> 97L, 107L, : >> Unable to estimate dispersions for >> ENSG00000187634:ENSE00001884257 >> Error : could not find function "glmnb.fit" >> .... >> >> According to this post, >> http://seqanswers.com/forums/showthread.php?t=16040, >> glmnb.fit is supposed to come with statmod. According >> to my sessionInfo(), I should have the latest versions: >> >> > sessionInfo() >> R version 2.15.2 (2012-10-26) >> Platform: x86_64-unknown-linux-gnu (64-bit) >> >> locale: >> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 >> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 >> [7] LC_PAPER=C LC_NAME=C >> [9] LC_ADDRESS=C LC_TELEPHONE=C >> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >> >> attached base packages: >> [1] parallel stats graphics grDevices utils >> datasets methods >> [8] base >> >> other attached packages: >> [1] optparse_1.0.0 DEXSeq_1.4.0 Biobase_2.18.0 >> BiocGenerics_0.4.0 >> >> loaded via a namespace (and not attached): >> [1] biomaRt_2.14.0 getopt_1.19.0 hwriter_1.3 >> RCurl_1.95-3 statmod_1.4.16 >> [6] stringr_0.6.2 tools_2.15.2 XML_3.95-0.1 >> > >> >> Am I doing something wrong here? Thanks for your help, >> >> Fong >> >> >> On Mon, Jan 14, 2013 at 1:44 AM, Alejandro Reyes >> <alejandro.reyes@embl.de>> <mailto:alejandro.reyes@embl.de> >> <mailto:alejandro.reyes@embl.de>> <mailto:alejandro.reyes@embl.de>>> wrote: >> >> Sorry, there was an error in part of my code: >> >> The "TRT" would be like this, >> >> data(pasillaExons, package="pasilla") >> >> pasillaExonsTRT <- estimateSizeFactors( >> pasillaExons ) >> pasillaExonsTRT <- estimateDispersionsTRT( >> pasillaExonsTRT ) >> pasillaExonsTRT <- fitDispersionFunction( >> pasillaExonsTRT ) >> pasillaExonsTRT <- testForDEUTRT( pasillaExonsTRT ) >> >> Bests, >> Alejandro >> >> >> Dear Fong Chun Chan, >> >> Thank you for your interest in DEXSeq and >> sorry in advance for >> the long e-mail. We have also noticed that >> the computing time >> increases considerably when you have a large >> number of >> samples, conditions or number of exons of a >> gene. For users in >> these situations, we have implemented a >> variant of this >> functions (estimateDispersionsTRT and >> testForDEUTRT) in the >> most recent versions of DEXSeq in the svn. >> >> The difference relies on how the model matrix >> is prepared, in >> the "normal" functions, the model matrices >> used to fit the >> glms are prepared for each exon, such that >> each exon bin is >> treated individually, independently of which >> exon you are >> testing. For example, if you have a gene with >> 5 exons, when >> testing for exon E001, you would consider >> independently E002, >> E003, ... , E005 in the model. >> >> In the "TRT" implementation the same model >> matrix is used for >> all the exons. In the same example as before, >> you would >> consider E001 and the sum of all the rest >> exons of the same >> gene. This reduces the model and allows to >> use DEXSeq with a >> large number of samples. For more clarity, >> you could try to >> compare the normal model frame of a gene with >> the TRT model frame: >> >> data(pasillaExons, package="pasilla") >> modelFrameForGene(pasillaExons, "FBgn0000256") >> # vs >> modelFrameForTRT( pasillaExons ) >> >> Using the same example, in the last model >> frame, "this" would >> be the "E001" and "others" would be the sum >> of E002 + E003 + >> ... + E005. >> >> This would be the "normal" DEXSeq analysis: >> >> pasillaExons <- estimateSizeFactors( >> pasillaExons ) >> pasillaExons <- estimateDispersions( >> pasillaExons ) >> pasillaExons <- fitDispersionFunction( >> pasillaExons ) >> pasillaExons <- testForDEU( pasillaExons ) >> >> This would be the "TRT", >> >> pasillaExonsTRT <- estimateSizeFactors( >> pasillaExons ) >> pasillaExonsTRT <- estimateDispersionsTRT( >> pasillaExons ) >> pasillaExonsTRT <- fitDispersionFunction( >> pasillaExons ) >> pasillaExonsTRT <- testForDEUTRT( pasillaExons ) >> >> And you can see that you get the same results: >> >> plot(fData(pasillaExons)$pvalue, >> fData(pasillaExonsTRT)$pvalue, log="xy") >> >> I have the "TRT" tried this for large cohorts >> with complex >> models and it works nicely and in reasonable >> computing times. >> >> Best regards, >> Alejandro Reyes >> >> ps. this changes need to be added to the >> vignette. >> >> >> Hi all, >> >> I've been trying to get DEXSeq to run on >> a fairly large >> RNA-seq cohort that >> I have. To be specific, I have 89 samples >> and I am attempt >> to generate DE >> exon usage results on > 500,000 exons. >> >> I've followed the latest tutorial (1.5.6) >> on Bioconductor >> and it so far >> I've had relatively no problems. It just >> the two steps >> that are mentioned, >> estimateDispersions and testForDEU, are taking a fairly >> long time. I've >> already attempted to parallelize this on >> a 48-core 256GB >> machine, but I get >> very little progress on the run-time of >> these functions. >> >> I was just wondering if anyone has a good >> way of running >> DEXSeq on such a >> large cohort. Tips on how to reduce run >> time? Are there >> way to parallelize >> these jobs across a cluster rather than >> rely on a single >> machine with >> multi-cores? Any help would be greatly >> appreciated. >> >> Thanks, >> >> Fong >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor@r-project.org >> <mailto:bioconductor@r-project.org> >> <mailto:bioconductor@r-project.org>> <mailto:bioconductor@r-project.org>> >> >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor@r-project.org >> <mailto:bioconductor@r-project.org> >> <mailto:bioconductor@r-project.org>> <mailto:bioconductor@r-project.org>> >> >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> >> >> >> >> > > [[alternative HTML version deleted]]