Suitable learning sets, gene selection methods and classification methods for low replicated microarray samples
1
0
Entering edit mode
Guest User ★ 13k
@guest-user-4897
Last seen 9.6 years ago
Dear grateful R helpers, I'm a biologist who is learning gene expression profile study, and have to deal with low replicated sample number (2-3 biological replicates per group). Due to my lack of background in bioinformatics, I find CMA as a very user-friendly package for supervised classification task. However, I'm suffering with the truth that I really have no clue what suitable choics to choose for my low replicated sample classfication. These are the choices to: 1. Select method to generate learning datasets 2. Select the gene selection methods 3. Select classification methods 4. Acquire generated learning datasets to be applied with other gene selection methods not available in CMA package (for example, Rank production and LPE) Any suggestions would be more than appreciated. With Respects, Kaj Chokeshaiusaha -- output of sessionInfo(): R version 3.1.0 (2014-04-10) Platform: x86_64-pc-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8 [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8 [7] LC_PAPER=en_GB.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats graphics grDevices utils datasets methods [8] base other attached packages: [1] BiocInstaller_1.14.2 CMA_1.22.0 Biobase_2.24.0 [4] BiocGenerics_0.10.0 e1071_1.6-3 loaded via a namespace (and not attached): [1] class_7.3-10 tools_3.1.0 -- Sent via the guest posting facility at bioconductor.org.
Classification CMA Classification CMA • 1.5k views
ADD COMMENT
0
Entering edit mode
@sean-davis-490
Last seen 3 months ago
United States
Hi, Kaj. On Thu, Jul 24, 2014 at 8:07 AM, Kaj Chokeshaiusaha [guest] < guest@bioconductor.org> wrote: > Dear grateful R helpers, > > I'm a biologist who is learning gene expression profile study, and have to > deal with low replicated sample number (2-3 biological replicates per > group). Due to my lack of background in bioinformatics, I find CMA as a > very user-friendly package for supervised classification task. > I suspect that you will find that this number of samples is difficult to use for machine learning, but the results will really depend on the strength and stability of the "biological signal". > However, I'm suffering with the truth that I really have no clue what > suitable choics to choose for my low replicated sample classfication. These > are the choices to: > > 1. Select method to generate learning datasets > 2. Select the gene selection methods > 3. Select classification methods > Depending on your machine learning method, these three steps may be included *together* in the training process. In general, though, using a subset of the data for training and the rest for testing is a common approach. The caret package provides many approaches for doing just such analyses, so you might look at that package as well. > 4. Acquire generated learning datasets to be applied with other gene > selection methods not available in CMA package (for example, Rank > production and LPE) > > Not sure what you mean here, but in general, applying your machine learning algorithm to *new* data will require you to use the same features as the training data; in other words, feature selection will not be used for new data. Hope that helps, Sean > Any suggestions would be more than appreciated. > > With Respects, > Kaj Chokeshaiusaha > > > -- output of sessionInfo(): > > R version 3.1.0 (2014-04-10) > Platform: x86_64-pc-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8 > [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8 > [7] LC_PAPER=en_GB.UTF-8 LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] parallel stats graphics grDevices utils datasets methods > [8] base > > other attached packages: > [1] BiocInstaller_1.14.2 CMA_1.22.0 Biobase_2.24.0 > [4] BiocGenerics_0.10.0 e1071_1.6-3 > > loaded via a namespace (and not attached): > [1] class_7.3-10 tools_3.1.0 > > > -- > Sent via the guest posting facility at bioconductor.org. > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]
ADD COMMENT
0
Entering edit mode
Dear Prof. Davis, Thank you very much for your suggestion. I will definitely follow the package you suggest. At this point, I may decide to use Monte-carlo Cross validation method to generate my learning sets. Any further suggestion would be more than appreciated. With Respects, Kaj 2014-07-24 20:45 GMT+07:00 Sean Davis <sdavis2@mail.nih.gov>: > Hi, Kaj. > > > > > On Thu, Jul 24, 2014 at 8:07 AM, Kaj Chokeshaiusaha [guest] < > guest@bioconductor.org> wrote: > >> Dear grateful R helpers, >> >> I'm a biologist who is learning gene expression profile study, and have >> to deal with low replicated sample number (2-3 biological replicates per >> group). Due to my lack of background in bioinformatics, I find CMA as a >> very user-friendly package for supervised classification task. >> > > I suspect that you will find that this number of samples is difficult to > use for machine learning, but the results will really depend on the > strength and stability of the "biological signal". > > >> However, I'm suffering with the truth that I really have no clue what >> suitable choics to choose for my low replicated sample classfication. These >> are the choices to: >> >> 1. Select method to generate learning datasets >> 2. Select the gene selection methods >> 3. Select classification methods >> > > Depending on your machine learning method, these three steps may be > included *together* in the training process. In general, though, using a > subset of the data for training and the rest for testing is a common > approach. The caret package provides many approaches for doing just such > analyses, so you might look at that package as well. > > >> 4. Acquire generated learning datasets to be applied with other gene >> selection methods not available in CMA package (for example, Rank >> production and LPE) >> >> > Not sure what you mean here, but in general, applying your machine > learning algorithm to *new* data will require you to use the same features > as the training data; in other words, feature selection will not be used > for new data. > > Hope that helps, > Sean > > > >> Any suggestions would be more than appreciated. >> >> With Respects, >> Kaj Chokeshaiusaha >> >> >> -- output of sessionInfo(): >> >> R version 3.1.0 (2014-04-10) >> Platform: x86_64-pc-linux-gnu (64-bit) >> >> locale: >> [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C >> [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8 >> [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8 >> [7] LC_PAPER=en_GB.UTF-8 LC_NAME=C >> [9] LC_ADDRESS=C LC_TELEPHONE=C >> [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C >> >> attached base packages: >> [1] parallel stats graphics grDevices utils datasets methods >> [8] base >> >> other attached packages: >> [1] BiocInstaller_1.14.2 CMA_1.22.0 Biobase_2.24.0 >> [4] BiocGenerics_0.10.0 e1071_1.6-3 >> >> loaded via a namespace (and not attached): >> [1] class_7.3-10 tools_3.1.0 >> >> >> -- >> Sent via the guest posting facility at bioconductor.org. >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor@r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > [[alternative HTML version deleted]]
ADD REPLY
0
Entering edit mode
Dear Prof. Davis, I've read more according to your suggestion and find out that what I'm trying to do is not valid at all. Thank you very much for your suggestions. With Respects, Kaj 2014-07-25 19:22 GMT+07:00 Kaj Chokeshaiusaha <kaj.chk@gmail.com>: > Dear Prof. Davis, > > Thank you very much for your suggestion. I will definitely follow the > package you suggest. At this point, I may decide to use Monte-carlo Cross > validation method to generate my learning sets. Any further suggestion > would be more than appreciated. > > With Respects, > Kaj > > > 2014-07-24 20:45 GMT+07:00 Sean Davis <sdavis2@mail.nih.gov>: > > Hi, Kaj. >> >> >> >> >> On Thu, Jul 24, 2014 at 8:07 AM, Kaj Chokeshaiusaha [guest] < >> guest@bioconductor.org> wrote: >> >>> Dear grateful R helpers, >>> >>> I'm a biologist who is learning gene expression profile study, and have >>> to deal with low replicated sample number (2-3 biological replicates per >>> group). Due to my lack of background in bioinformatics, I find CMA as a >>> very user-friendly package for supervised classification task. >>> >> >> I suspect that you will find that this number of samples is difficult to >> use for machine learning, but the results will really depend on the >> strength and stability of the "biological signal". >> >> >>> However, I'm suffering with the truth that I really have no clue what >>> suitable choics to choose for my low replicated sample classfication. These >>> are the choices to: >>> >>> 1. Select method to generate learning datasets >>> 2. Select the gene selection methods >>> 3. Select classification methods >>> >> >> Depending on your machine learning method, these three steps may be >> included *together* in the training process. In general, though, using a >> subset of the data for training and the rest for testing is a common >> approach. The caret package provides many approaches for doing just such >> analyses, so you might look at that package as well. >> >> >>> 4. Acquire generated learning datasets to be applied with other gene >>> selection methods not available in CMA package (for example, Rank >>> production and LPE) >>> >>> >> Not sure what you mean here, but in general, applying your machine >> learning algorithm to *new* data will require you to use the same features >> as the training data; in other words, feature selection will not be used >> for new data. >> >> Hope that helps, >> Sean >> >> >> >>> Any suggestions would be more than appreciated. >>> >>> With Respects, >>> Kaj Chokeshaiusaha >>> >>> >>> -- output of sessionInfo(): >>> >>> R version 3.1.0 (2014-04-10) >>> Platform: x86_64-pc-linux-gnu (64-bit) >>> >>> locale: >>> [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C >>> [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8 >>> [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8 >>> [7] LC_PAPER=en_GB.UTF-8 LC_NAME=C >>> [9] LC_ADDRESS=C LC_TELEPHONE=C >>> [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C >>> >>> attached base packages: >>> [1] parallel stats graphics grDevices utils datasets methods >>> [8] base >>> >>> other attached packages: >>> [1] BiocInstaller_1.14.2 CMA_1.22.0 Biobase_2.24.0 >>> [4] BiocGenerics_0.10.0 e1071_1.6-3 >>> >>> loaded via a namespace (and not attached): >>> [1] class_7.3-10 tools_3.1.0 >>> >>> >>> -- >>> Sent via the guest posting facility at bioconductor.org. >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor@r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >> >> > [[alternative HTML version deleted]]
ADD REPLY

Login before adding your answer.

Traffic: 597 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6