Please comment the way I'm thinking about the way to find differentially expressed genes

0

Entering edit mode

Guest User ★ 13k

@guest-user-4897

Last seen 9.6 years ago

Dear R helpers, I'm a starter in gene expression analysis, and I must apologize everyone in the first place if I'm posting something irritated. My attemp is just to figure out an alternative way to find out differentailly expressed genes in low replicated datasets. In case that, I have very few number of replicated datasets per group (2-3 replicates per group). I'm wondering whether I can generate several datasets from my original datasets I have (using methods like Bootstrap) and then perform the test to find out the lists of differentially expressed genes from my created datasets. After that I count the repeated genes from all lists and pick the top ones as differentially expressed genes. Please comment the idea, I don't want to slip too far in the wrong approach. With Respects, Kaj -- output of sessionInfo(): R version 3.1.0 (2014-04-10) Platform: x86_64-pc-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8 [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8 [7] LC_PAPER=en_GB.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats graphics grDevices utils datasets methods [8] base other attached packages: [1] CMA_1.22.0 Biobase_2.24.0 BiocGenerics_0.10.0 [4] e1071_1.6-3 loaded via a namespace (and not attached): [1] class_7.3-10 tools_3.1.0 -- Sent via the guest posting facility at bioconductor.org.

• 1.0k views

ADD COMMENT • link updated 9.8 years ago by Sean Davis 21k • written 9.8 years ago by Guest User ★ 13k

0

Entering edit mode

James W. MacDonald 65k

@james-w-macdonald-5106

Last seen 5 hours ago

United States

Hi Kaj, I don't see how resampling is going to help you at all with just 2-3 samples per group. Anyway, the bootstrap is in general used to generate improved estimates of the variance, not to generate 'new' data sets. Figuring out ways to improve variance estimates was a fairly hot area of research about 10 years ago, and people have in general settled on the idea of empirical Bayesian estimates like you get with limma. As a self-professed 'starter' in gene expression analysis, are you sure you are best equipped to improve on the accepted methods that were developed over several year by PhD statisticians? If not, I would just stick with using limma, especially if you want to publish your results. It's much easier to say 'I used the bioconductor limma package' then to explain your newfangled, unpublished method, especially if you are not a PhD statistician yourself. Best, Jim On 7/25/2014 11:20 AM, Kaj Chokeshaiusaha [guest] wrote: > Dear R helpers, > > I'm a starter in gene expression analysis, and I must apologize everyone in the first place if I'm posting something irritated. My attemp is just to figure out an alternative way to find out differentailly expressed genes in low replicated datasets. > > In case that, I have very few number of replicated datasets per group (2-3 replicates per group). I'm wondering whether I can generate several datasets from my original datasets I have (using methods like Bootstrap) and then perform the test to find out the lists of differentially expressed genes from my created datasets. After that I count the repeated genes from all lists and pick the top ones as differentially expressed genes. > > Please comment the idea, I don't want to slip too far in the wrong approach. > > With Respects, > Kaj > > > -- output of sessionInfo(): > > R version 3.1.0 (2014-04-10) > Platform: x86_64-pc-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8 > [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8 > [7] LC_PAPER=en_GB.UTF-8 LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] parallel stats graphics grDevices utils datasets methods > [8] base > > other attached packages: > [1] CMA_1.22.0 Biobase_2.24.0 BiocGenerics_0.10.0 > [4] e1071_1.6-3 > > loaded via a namespace (and not attached): > [1] class_7.3-10 tools_3.1.0 > > -- > Sent via the guest posting facility at bioconductor.org. > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > -- James W. MacDonald, M.S. Biostatistician University of Washington Environmental and Occupational Health Sciences 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099

ADD COMMENT • link 9.8 years ago James W. MacDonald 65k

0

Entering edit mode

Sean Davis 21k

@sean-davis-490

Last seen 3 months ago

United States

Hi, Kaj. You may be overthinking things a bit. Differential gene expression analysis has a lot of history and has developed around the constraints imposed by small sample sizes, so most modern tools for doing differential expression analysis will handle your data in a rational and statistically sound way. I would considering starting with limma; the user guide is excellent and the package is very highly utilized for experiments presumably just like yours. I don't want to discourage experimentation, but it is often best to start with a known analysis if only for comparison if you do try something more exotic. Sean On Fri, Jul 25, 2014 at 11:20 AM, Kaj Chokeshaiusaha [guest] < guest@bioconductor.org> wrote: > Dear R helpers, > > I'm a starter in gene expression analysis, and I must apologize everyone > in the first place if I'm posting something irritated. My attemp is just to > figure out an alternative way to find out differentailly expressed genes in > low replicated datasets. > > In case that, I have very few number of replicated datasets per group (2-3 > replicates per group). I'm wondering whether I can generate several > datasets from my original datasets I have (using methods like Bootstrap) > and then perform the test to find out the lists of differentially expressed > genes from my created datasets. After that I count the repeated genes from > all lists and pick the top ones as differentially expressed genes. > > Please comment the idea, I don't want to slip too far in the wrong > approach. > > With Respects, > Kaj > > > -- output of sessionInfo(): > > R version 3.1.0 (2014-04-10) > Platform: x86_64-pc-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8 > [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8 > [7] LC_PAPER=en_GB.UTF-8 LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] parallel stats graphics grDevices utils datasets methods > [8] base > > other attached packages: > [1] CMA_1.22.0 Biobase_2.24.0 BiocGenerics_0.10.0 > [4] e1071_1.6-3 > > loaded via a namespace (and not attached): > [1] class_7.3-10 tools_3.1.0 > > -- > Sent via the guest posting facility at bioconductor.org. > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]

ADD COMMENT • link 9.8 years ago Sean Davis 21k

0

Entering edit mode

Dear all, Thank you very much for your comments. I now feel confident to stick with the usual approach. There is one thing that sticks in my mind all the time. This is probably due to my lack of basic knowledge. I'm wondering about people who generate sets of data using methods like leave-one-out from their original data. After that applying test (like limma), and finally check for top genes most repeated in differentially expressed gene lists produced by all sets of data (for example, 4 out of 6). Is this kind of approach better than sticking to the list of differentially expressed genes list produced by original data? Thank you very much in advance for your patience with me. With Respects, Kaj 2557-07-25 22:53 GMT+07:00, Sean Davis <sdavis2 at="" mail.nih.gov="">: > Hi, Kaj. > > You may be overthinking things a bit. Differential gene expression > analysis has a lot of history and has developed around the constraints > imposed by small sample sizes, so most modern tools for doing differential > expression analysis will handle your data in a rational and statistically > sound way. I would considering starting with limma; the user guide is > excellent and the package is very highly utilized for experiments > presumably just like yours. I don't want to discourage experimentation, > but it is often best to start with a known analysis if only for comparison > if you do try something more exotic. > > Sean > > > > On Fri, Jul 25, 2014 at 11:20 AM, Kaj Chokeshaiusaha [guest] < > guest at bioconductor.org> wrote: > >> Dear R helpers, >> >> I'm a starter in gene expression analysis, and I must apologize everyone >> in the first place if I'm posting something irritated. My attemp is just >> to >> figure out an alternative way to find out differentailly expressed genes >> in >> low replicated datasets. >> >> In case that, I have very few number of replicated datasets per group >> (2-3 >> replicates per group). I'm wondering whether I can generate several >> datasets from my original datasets I have (using methods like Bootstrap) >> and then perform the test to find out the lists of differentially >> expressed >> genes from my created datasets. After that I count the repeated genes >> from >> all lists and pick the top ones as differentially expressed genes. >> >> Please comment the idea, I don't want to slip too far in the wrong >> approach. >> >> With Respects, >> Kaj >> >> >> -- output of sessionInfo(): >> >> R version 3.1.0 (2014-04-10) >> Platform: x86_64-pc-linux-gnu (64-bit) >> >> locale: >> [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C >> [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8 >> [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8 >> [7] LC_PAPER=en_GB.UTF-8 LC_NAME=C >> [9] LC_ADDRESS=C LC_TELEPHONE=C >> [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C >> >> attached base packages: >> [1] parallel stats graphics grDevices utils datasets methods >> [8] base >> >> other attached packages: >> [1] CMA_1.22.0 Biobase_2.24.0 BiocGenerics_0.10.0 >> [4] e1071_1.6-3 >> >> loaded via a namespace (and not attached): >> [1] class_7.3-10 tools_3.1.0 >> >> -- >> Sent via the guest posting facility at bioconductor.org. >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >

ADD REPLY • link 9.8 years ago Kaj Chokeshaiusaha ▴ 70

0

Entering edit mode

Hello, Cross-validation is usable when you have a large study, typically done with tissue samples. By large sample size, I mean at least twenty replicates for each condition. limma is appropriate for your situation. -------------------------------------- Dario Strbenac PhD Student University of Sydney Camperdown NSW 2050 Australia

ADD REPLY • link 9.8 years ago Dario Strbenac ★ 1.5k

Login before adding your answer.