Question: Suitable learning sets, gene selection methods and classification methods for low replicated microarray samples
0
gravatar for Guest User
4.8 years ago by
Guest User12k
Guest User12k wrote:
Dear grateful R helpers, I'm a biologist who is learning gene expression profile study, and have to deal with low replicated sample number (2-3 biological replicates per group). Due to my lack of background in bioinformatics, I find CMA as a very user-friendly package for supervised classification task. However, I'm suffering with the truth that I really have no clue what suitable choics to choose for my low replicated sample classfication. These are the choices to: 1. Select method to generate learning datasets 2. Select the gene selection methods 3. Select classification methods 4. Acquire generated learning datasets to be applied with other gene selection methods not available in CMA package (for example, Rank production and LPE) Any suggestions would be more than appreciated. With Respects, Kaj Chokeshaiusaha -- output of sessionInfo(): R version 3.1.0 (2014-04-10) Platform: x86_64-pc-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8 [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8 [7] LC_PAPER=en_GB.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats graphics grDevices utils datasets methods [8] base other attached packages: [1] BiocInstaller_1.14.2 CMA_1.22.0 Biobase_2.24.0 [4] BiocGenerics_0.10.0 e1071_1.6-3 loaded via a namespace (and not attached): [1] class_7.3-10 tools_3.1.0 -- Sent via the guest posting facility at bioconductor.org.
classification cma • 626 views
ADD COMMENTlink modified 4.8 years ago by Sean Davis21k • written 4.8 years ago by Guest User12k
Answer: Suitable learning sets, gene selection methods and classification methods for lo
0
gravatar for Sean Davis
4.8 years ago by
Sean Davis21k
United States
Sean Davis21k wrote:
Hi, Kaj. On Thu, Jul 24, 2014 at 8:07 AM, Kaj Chokeshaiusaha [guest] < guest@bioconductor.org> wrote: > Dear grateful R helpers, > > I'm a biologist who is learning gene expression profile study, and have to > deal with low replicated sample number (2-3 biological replicates per > group). Due to my lack of background in bioinformatics, I find CMA as a > very user-friendly package for supervised classification task. > I suspect that you will find that this number of samples is difficult to use for machine learning, but the results will really depend on the strength and stability of the "biological signal". > However, I'm suffering with the truth that I really have no clue what > suitable choics to choose for my low replicated sample classfication. These > are the choices to: > > 1. Select method to generate learning datasets > 2. Select the gene selection methods > 3. Select classification methods > Depending on your machine learning method, these three steps may be included *together* in the training process. In general, though, using a subset of the data for training and the rest for testing is a common approach. The caret package provides many approaches for doing just such analyses, so you might look at that package as well. > 4. Acquire generated learning datasets to be applied with other gene > selection methods not available in CMA package (for example, Rank > production and LPE) > > Not sure what you mean here, but in general, applying your machine learning algorithm to *new* data will require you to use the same features as the training data; in other words, feature selection will not be used for new data. Hope that helps, Sean > Any suggestions would be more than appreciated. > > With Respects, > Kaj Chokeshaiusaha > > > -- output of sessionInfo(): > > R version 3.1.0 (2014-04-10) > Platform: x86_64-pc-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8 > [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8 > [7] LC_PAPER=en_GB.UTF-8 LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] parallel stats graphics grDevices utils datasets methods > [8] base > > other attached packages: > [1] BiocInstaller_1.14.2 CMA_1.22.0 Biobase_2.24.0 > [4] BiocGenerics_0.10.0 e1071_1.6-3 > > loaded via a namespace (and not attached): > [1] class_7.3-10 tools_3.1.0 > > > -- > Sent via the guest posting facility at bioconductor.org. > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]
ADD COMMENTlink written 4.8 years ago by Sean Davis21k
Dear Prof. Davis, Thank you very much for your suggestion. I will definitely follow the package you suggest. At this point, I may decide to use Monte-carlo Cross validation method to generate my learning sets. Any further suggestion would be more than appreciated. With Respects, Kaj 2014-07-24 20:45 GMT+07:00 Sean Davis <sdavis2@mail.nih.gov>: > Hi, Kaj. > > > > > On Thu, Jul 24, 2014 at 8:07 AM, Kaj Chokeshaiusaha [guest] < > guest@bioconductor.org> wrote: > >> Dear grateful R helpers, >> >> I'm a biologist who is learning gene expression profile study, and have >> to deal with low replicated sample number (2-3 biological replicates per >> group). Due to my lack of background in bioinformatics, I find CMA as a >> very user-friendly package for supervised classification task. >> > > I suspect that you will find that this number of samples is difficult to > use for machine learning, but the results will really depend on the > strength and stability of the "biological signal". > > >> However, I'm suffering with the truth that I really have no clue what >> suitable choics to choose for my low replicated sample classfication. These >> are the choices to: >> >> 1. Select method to generate learning datasets >> 2. Select the gene selection methods >> 3. Select classification methods >> > > Depending on your machine learning method, these three steps may be > included *together* in the training process. In general, though, using a > subset of the data for training and the rest for testing is a common > approach. The caret package provides many approaches for doing just such > analyses, so you might look at that package as well. > > >> 4. Acquire generated learning datasets to be applied with other gene >> selection methods not available in CMA package (for example, Rank >> production and LPE) >> >> > Not sure what you mean here, but in general, applying your machine > learning algorithm to *new* data will require you to use the same features > as the training data; in other words, feature selection will not be used > for new data. > > Hope that helps, > Sean > > > >> Any suggestions would be more than appreciated. >> >> With Respects, >> Kaj Chokeshaiusaha >> >> >> -- output of sessionInfo(): >> >> R version 3.1.0 (2014-04-10) >> Platform: x86_64-pc-linux-gnu (64-bit) >> >> locale: >> [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C >> [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8 >> [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8 >> [7] LC_PAPER=en_GB.UTF-8 LC_NAME=C >> [9] LC_ADDRESS=C LC_TELEPHONE=C >> [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C >> >> attached base packages: >> [1] parallel stats graphics grDevices utils datasets methods >> [8] base >> >> other attached packages: >> [1] BiocInstaller_1.14.2 CMA_1.22.0 Biobase_2.24.0 >> [4] BiocGenerics_0.10.0 e1071_1.6-3 >> >> loaded via a namespace (and not attached): >> [1] class_7.3-10 tools_3.1.0 >> >> >> -- >> Sent via the guest posting facility at bioconductor.org. >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor@r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > [[alternative HTML version deleted]]
ADD REPLYlink written 4.8 years ago by Kaj Chokeshaiusaha70
Dear Prof. Davis, I've read more according to your suggestion and find out that what I'm trying to do is not valid at all. Thank you very much for your suggestions. With Respects, Kaj 2014-07-25 19:22 GMT+07:00 Kaj Chokeshaiusaha <kaj.chk@gmail.com>: > Dear Prof. Davis, > > Thank you very much for your suggestion. I will definitely follow the > package you suggest. At this point, I may decide to use Monte-carlo Cross > validation method to generate my learning sets. Any further suggestion > would be more than appreciated. > > With Respects, > Kaj > > > 2014-07-24 20:45 GMT+07:00 Sean Davis <sdavis2@mail.nih.gov>: > > Hi, Kaj. >> >> >> >> >> On Thu, Jul 24, 2014 at 8:07 AM, Kaj Chokeshaiusaha [guest] < >> guest@bioconductor.org> wrote: >> >>> Dear grateful R helpers, >>> >>> I'm a biologist who is learning gene expression profile study, and have >>> to deal with low replicated sample number (2-3 biological replicates per >>> group). Due to my lack of background in bioinformatics, I find CMA as a >>> very user-friendly package for supervised classification task. >>> >> >> I suspect that you will find that this number of samples is difficult to >> use for machine learning, but the results will really depend on the >> strength and stability of the "biological signal". >> >> >>> However, I'm suffering with the truth that I really have no clue what >>> suitable choics to choose for my low replicated sample classfication. These >>> are the choices to: >>> >>> 1. Select method to generate learning datasets >>> 2. Select the gene selection methods >>> 3. Select classification methods >>> >> >> Depending on your machine learning method, these three steps may be >> included *together* in the training process. In general, though, using a >> subset of the data for training and the rest for testing is a common >> approach. The caret package provides many approaches for doing just such >> analyses, so you might look at that package as well. >> >> >>> 4. Acquire generated learning datasets to be applied with other gene >>> selection methods not available in CMA package (for example, Rank >>> production and LPE) >>> >>> >> Not sure what you mean here, but in general, applying your machine >> learning algorithm to *new* data will require you to use the same features >> as the training data; in other words, feature selection will not be used >> for new data. >> >> Hope that helps, >> Sean >> >> >> >>> Any suggestions would be more than appreciated. >>> >>> With Respects, >>> Kaj Chokeshaiusaha >>> >>> >>> -- output of sessionInfo(): >>> >>> R version 3.1.0 (2014-04-10) >>> Platform: x86_64-pc-linux-gnu (64-bit) >>> >>> locale: >>> [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C >>> [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8 >>> [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8 >>> [7] LC_PAPER=en_GB.UTF-8 LC_NAME=C >>> [9] LC_ADDRESS=C LC_TELEPHONE=C >>> [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C >>> >>> attached base packages: >>> [1] parallel stats graphics grDevices utils datasets methods >>> [8] base >>> >>> other attached packages: >>> [1] BiocInstaller_1.14.2 CMA_1.22.0 Biobase_2.24.0 >>> [4] BiocGenerics_0.10.0 e1071_1.6-3 >>> >>> loaded via a namespace (and not attached): >>> [1] class_7.3-10 tools_3.1.0 >>> >>> >>> -- >>> Sent via the guest posting facility at bioconductor.org. >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor@r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >> >> > [[alternative HTML version deleted]]
ADD REPLYlink written 4.8 years ago by Kaj Chokeshaiusaha70
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 141 users visited in the last hour