Suitable learning sets, gene selection methods and classification methods for low replicated microarray samples

0

Entering edit mode

Guest User ★ 13k

@guest-user-4897

Last seen 11.4 years ago

Dear grateful R helpers, I'm a biologist who is learning gene expression profile study, and have to deal with low replicated sample number (2-3 biological replicates per group). Due to my lack of background in bioinformatics, I find CMA as a very user-friendly package for supervised classification task. However, I'm suffering with the truth that I really have no clue what suitable choics to choose for my low replicated sample classfication. These are the choices to: 1. Select method to generate learning datasets 2. Select the gene selection methods 3. Select classification methods 4. Acquire generated learning datasets to be applied with other gene selection methods not available in CMA package (for example, Rank production and LPE) Any suggestions would be more than appreciated. With Respects, Kaj Chokeshaiusaha -- output of sessionInfo(): R version 3.1.0 (2014-04-10) Platform: x86_64-pc-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8 [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8 [7] LC_PAPER=en_GB.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats graphics grDevices utils datasets methods [8] base other attached packages: [1] BiocInstaller_1.14.2 CMA_1.22.0 Biobase_2.24.0 [4] BiocGenerics_0.10.0 e1071_1.6-3 loaded via a namespace (and not attached): [1] class_7.3-10 tools_3.1.0 -- Sent via the guest posting facility at bioconductor.org.

Classification CMA Classification CMA • 2.1k views

ADD COMMENT • link updated 11.6 years ago by Sean Davis 21k • written 11.6 years ago by Guest User ★ 13k

0

Entering edit mode

Sean Davis 21k

@sean-davis-490

Last seen 12 hours ago

United States

Hi, Kaj. On Thu, Jul 24, 2014 at 8:07 AM, Kaj Chokeshaiusaha [guest] < guest@bioconductor.org> wrote: > Dear grateful R helpers, > > I'm a biologist who is learning gene expression profile study, and have to > deal with low replicated sample number (2-3 biological replicates per > group). Due to my lack of background in bioinformatics, I find CMA as a > very user-friendly package for supervised classification task. > I suspect that you will find that this number of samples is difficult to use for machine learning, but the results will really depend on the strength and stability of the "biological signal". > However, I'm suffering with the truth that I really have no clue what > suitable choics to choose for my low replicated sample classfication. These > are the choices to: > > 1. Select method to generate learning datasets > 2. Select the gene selection methods > 3. Select classification methods > Depending on your machine learning method, these three steps may be included *together* in the training process. In general, though, using a subset of the data for training and the rest for testing is a common approach. The caret package provides many approaches for doing just such analyses, so you might look at that package as well. > 4. Acquire generated learning datasets to be applied with other gene > selection methods not available in CMA package (for example, Rank > production and LPE) > > Not sure what you mean here, but in general, applying your machine learning algorithm to *new* data will require you to use the same features as the training data; in other words, feature selection will not be used for new data. Hope that helps, Sean > Any suggestions would be more than appreciated. > > With Respects, > Kaj Chokeshaiusaha > > > -- output of sessionInfo(): > > R version 3.1.0 (2014-04-10) > Platform: x86_64-pc-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8 > [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8 > [7] LC_PAPER=en_GB.UTF-8 LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] parallel stats graphics grDevices utils datasets methods > [8] base > > other attached packages: > [1] BiocInstaller_1.14.2 CMA_1.22.0 Biobase_2.24.0 > [4] BiocGenerics_0.10.0 e1071_1.6-3 > > loaded via a namespace (and not attached): > [1] class_7.3-10 tools_3.1.0 > > > -- > Sent via the guest posting facility at bioconductor.org. > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]

ADD COMMENT • link 11.6 years ago Sean Davis 21k

0

Entering edit mode

Dear Prof. Davis, Thank you very much for your suggestion. I will definitely follow the package you suggest. At this point, I may decide to use Monte-carlo Cross validation method to generate my learning sets. Any further suggestion would be more than appreciated. With Respects, Kaj 2014-07-24 20:45 GMT+07:00 Sean Davis <sdavis2@mail.nih.gov>: > Hi, Kaj. > > > > > On Thu, Jul 24, 2014 at 8:07 AM, Kaj Chokeshaiusaha [guest] < > guest@bioconductor.org> wrote: > >> Dear grateful R helpers, >> >> I'm a biologist who is learning gene expression profile study, and have >> to deal with low replicated sample number (2-3 biological replicates per >> group). Due to my lack of background in bioinformatics, I find CMA as a >> very user-friendly package for supervised classification task. >> > > I suspect that you will find that this number of samples is difficult to > use for machine learning, but the results will really depend on the > strength and stability of the "biological signal". > > >> However, I'm suffering with the truth that I really have no clue what >> suitable choics to choose for my low replicated sample classfication. These >> are the choices to: >> >> 1. Select method to generate learning datasets >> 2. Select the gene selection methods >> 3. Select classification methods >> > > Depending on your machine learning method, these three steps may be > included *together* in the training process. In general, though, using a > subset of the data for training and the rest for testing is a common > approach. The caret package provides many approaches for doing just such > analyses, so you might look at that package as well. > > >> 4. Acquire generated learning datasets to be applied with other gene >> selection methods not available in CMA package (for example, Rank >> production and LPE) >> >> > Not sure what you mean here, but in general, applying your machine > learning algorithm to *new* data will require you to use the same features > as the training data; in other words, feature selection will not be used > for new data. > > Hope that helps, > Sean > > > >> Any suggestions would be more than appreciated. >> >> With Respects, >> Kaj Chokeshaiusaha >> >> >> -- output of sessionInfo(): >> >> R version 3.1.0 (2014-04-10) >> Platform: x86_64-pc-linux-gnu (64-bit) >> >> locale: >> [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C >> [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8 >> [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8 >> [7] LC_PAPER=en_GB.UTF-8 LC_NAME=C >> [9] LC_ADDRESS=C LC_TELEPHONE=C >> [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C >> >> attached base packages: >> [1] parallel stats graphics grDevices utils datasets methods >> [8] base >> >> other attached packages: >> [1] BiocInstaller_1.14.2 CMA_1.22.0 Biobase_2.24.0 >> [4] BiocGenerics_0.10.0 e1071_1.6-3 >> >> loaded via a namespace (and not attached): >> [1] class_7.3-10 tools_3.1.0 >> >> >> -- >> Sent via the guest posting facility at bioconductor.org. >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor@r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > [[alternative HTML version deleted]]

ADD REPLY • link 11.6 years ago Kaj Chokeshaiusaha ▴ 70

0

Entering edit mode

Dear Prof. Davis, I've read more according to your suggestion and find out that what I'm trying to do is not valid at all. Thank you very much for your suggestions. With Respects, Kaj 2014-07-25 19:22 GMT+07:00 Kaj Chokeshaiusaha <kaj.chk@gmail.com>: > Dear Prof. Davis, > > Thank you very much for your suggestion. I will definitely follow the > package you suggest. At this point, I may decide to use Monte-carlo Cross > validation method to generate my learning sets. Any further suggestion > would be more than appreciated. > > With Respects, > Kaj > > > 2014-07-24 20:45 GMT+07:00 Sean Davis <sdavis2@mail.nih.gov>: > > Hi, Kaj. >> >> >> >> >> On Thu, Jul 24, 2014 at 8:07 AM, Kaj Chokeshaiusaha [guest] < >> guest@bioconductor.org> wrote: >> >>> Dear grateful R helpers, >>> >>> I'm a biologist who is learning gene expression profile study, and have >>> to deal with low replicated sample number (2-3 biological replicates per >>> group). Due to my lack of background in bioinformatics, I find CMA as a >>> very user-friendly package for supervised classification task. >>> >> >> I suspect that you will find that this number of samples is difficult to >> use for machine learning, but the results will really depend on the >> strength and stability of the "biological signal". >> >> >>> However, I'm suffering with the truth that I really have no clue what >>> suitable choics to choose for my low replicated sample classfication. These >>> are the choices to: >>> >>> 1. Select method to generate learning datasets >>> 2. Select the gene selection methods >>> 3. Select classification methods >>> >> >> Depending on your machine learning method, these three steps may be >> included *together* in the training process. In general, though, using a >> subset of the data for training and the rest for testing is a common >> approach. The caret package provides many approaches for doing just such >> analyses, so you might look at that package as well. >> >> >>> 4. Acquire generated learning datasets to be applied with other gene >>> selection methods not available in CMA package (for example, Rank >>> production and LPE) >>> >>> >> Not sure what you mean here, but in general, applying your machine >> learning algorithm to *new* data will require you to use the same features >> as the training data; in other words, feature selection will not be used >> for new data. >> >> Hope that helps, >> Sean >> >> >> >>> Any suggestions would be more than appreciated. >>> >>> With Respects, >>> Kaj Chokeshaiusaha >>> >>> >>> -- output of sessionInfo(): >>> >>> R version 3.1.0 (2014-04-10) >>> Platform: x86_64-pc-linux-gnu (64-bit) >>> >>> locale: >>> [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C >>> [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8 >>> [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8 >>> [7] LC_PAPER=en_GB.UTF-8 LC_NAME=C >>> [9] LC_ADDRESS=C LC_TELEPHONE=C >>> [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C >>> >>> attached base packages: >>> [1] parallel stats graphics grDevices utils datasets methods >>> [8] base >>> >>> other attached packages: >>> [1] BiocInstaller_1.14.2 CMA_1.22.0 Biobase_2.24.0 >>> [4] BiocGenerics_0.10.0 e1071_1.6-3 >>> >>> loaded via a namespace (and not attached): >>> [1] class_7.3-10 tools_3.1.0 >>> >>> >>> -- >>> Sent via the guest posting facility at bioconductor.org. >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor@r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >> >> > [[alternative HTML version deleted]]

ADD REPLY • link 11.6 years ago Kaj Chokeshaiusaha ▴ 70

Login before adding your answer.