how to test for genes of interest?

1

Entering edit mode

Jenny Drnevich ★ 2.0k

@jenny-drnevich-2812

Last seen 14 months ago

United States

Hi everyone,

I've always heard that one of the ways "around" the multiple testing problem of microarrays is for you to a priori identify a particular list of genes you're interested in, and then you only have to do the multiple test correction for this smaller list. I've never done this in practice, and I'm not sure at what point in the analysis it's proper to pull out just the smaller list. Obviously, all the data preprocessing and normalization will be done with all the genes, but should I pull out the genes before fitting the model, or after fitting the model right before the multiple test adjustment? I'm using the eBayes() shrinkage in limma, so which genes are in the model will make a big difference in the outcome.

I'm thinking it would be best to keep all the genes in the model, and then split them out into two groups (genes of interest and all the rest) and do a FDR correction separately for each group. What do you think?

Thanks,
Jenny

Jenny Drnevich, Ph.D.

Functional Genomics Bioinformatics Specialist
W.M. Keck Center for Comparative and Functional Genomics
Roy J. Carver Biotechnology Center
University of Illinois, Urbana-Champaign

330 ERML
1201 W. Gregory Dr.
Urbana, IL 61801
USA

ph: 217-244-7355
fax: 217-265-5066
e-mail: drnevich at illinois.edu

Normalization limma • 1.7k views

ADD COMMENT • link updated 10.9 years ago by Gordon Smyth 53k • written 17.6 years ago by Jenny Drnevich ★ 2.0k

0

Entering edit mode

Jenny Drnevich ★ 2.0k

@jenny-drnevich-2812

Last seen 14 months ago

United States

HI Glyn, "...mine it for biological significance" is a very vague, and in my experience, very subjective sort of analysis. I do agree that with a particular list, in this case immune genes, doing something like GSEA could be appropriate. However, GSEA gives an answer along the lines of "yes, immune genes appear to be important" and not "which immune genes are changing, and which are not?" Besides, data mining is not included in my basic statistical analysis service. :) I was just wondering if one was going to do the analysis I described, what is the proper way to do it? Thanks, Jenny At 10:48 AM 7/24/2008, Glyn Bradley wrote: >Hi Jenny >I may get shot down horribly for saying this on this list, but isn't >there a large school of thought which says don't do FDR at all, just >take the large list of genes out and mine it for biological >significance. >Certainly a large pharma I've a little experienmce of takes that >approach. Stats are just stats afterall. (and I'm sure you're going to >validate the results with some other wet lab technique anyway). > > >Glyn PhD >Bioinf and systems modelling >mycib.ac.uk > >On Thu, Jul 24, 2008 at 4:14 PM, Jenny Drnevich <drnevich at="" illinois.edu=""> wrote: > > Hi everyone, > > > > I've always heard that one of the ways "around" the multiple > testing problem > > of microarrays is for you to a priori identify a particular list of genes > > you're interested in, and then you only have to do the multiple test > > correction for this smaller list. I've never done this in practice, and I'm > > not sure at what point in the analysis it's proper to pull out just the > > smaller list. Obviously, all the data preprocessing and normalization will > > be done with all the genes, but should I pull out the genes before fitting > > the model, or after fitting the model right before the multiple test > > adjustment? I'm using the eBayes() shrinkage in limma, so which > genes are in > > the model will make a big difference in the outcome. > > > > I'm thinking it would be best to keep all the genes in the model, and then > > split them out into two groups (genes of interest and all the > rest) and do a > > FDR correction separately for each group. What do you think? > > > > Thanks, > > Jenny > > > > Jenny Drnevich, Ph.D. > > > > Functional Genomics Bioinformatics Specialist > > W.M. Keck Center for Comparative and Functional Genomics > > Roy J. Carver Biotechnology Center > > University of Illinois, Urbana-Champaign > > > > 330 ERML > > 1201 W. Gregory Dr. > > Urbana, IL 61801 > > USA > > > > ph: 217-244-7355 > > fax: 217-265-5066 > > e-mail: drnevich at illinois.edu > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor at stat.math.ethz.ch > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: > > http://news.gmane.org/gmane.science.biology.informatics.conductor > > Jenny Drnevich, Ph.D. Functional Genomics Bioinformatics Specialist W.M. Keck Center for Comparative and Functional Genomics Roy J. Carver Biotechnology Center University of Illinois, Urbana-Champaign 330 ERML 1201 W. Gregory Dr. Urbana, IL 61801 USA ph: 217-244-7355 fax: 217-265-5066 e-mail: drnevich at illinois.edu

ADD COMMENT • link 17.6 years ago Jenny Drnevich ★ 2.0k

0

Entering edit mode

'mine it for biological significance' used to be a very vague idea...until the advent of Ingenuity Pathway Analysis!! :) If it were me and I was going to do FDR, I'd do it on the full list. Glyn On Thu, Jul 24, 2008 at 5:00 PM, Jenny Drnevich <drnevich at="" illinois.edu=""> wrote: > HI Glyn, > > "...mine it for biological significance" is a very vague, and in my > experience, very subjective sort of analysis. I do agree that with a > particular list, in this case immune genes, doing something like GSEA could > be appropriate. However, GSEA gives an answer along the lines of "yes, > immune genes appear to be important" and not "which immune genes are > changing, and which are not?" Besides, data mining is not included in my > basic statistical analysis service. :) I was just wondering if one was > going to do the analysis I described, what is the proper way to do it? > > Thanks, > Jenny > > At 10:48 AM 7/24/2008, Glyn Bradley wrote: >> >> Hi Jenny >> I may get shot down horribly for saying this on this list, but isn't >> there a large school of thought which says don't do FDR at all, just >> take the large list of genes out and mine it for biological >> significance. >> Certainly a large pharma I've a little experienmce of takes that >> approach. Stats are just stats afterall. (and I'm sure you're going to >> validate the results with some other wet lab technique anyway). >> >> >> Glyn PhD >> Bioinf and systems modelling >> mycib.ac.uk >> >> On Thu, Jul 24, 2008 at 4:14 PM, Jenny Drnevich <drnevich at="" illinois.edu=""> >> wrote: >> > Hi everyone, >> > >> > I've always heard that one of the ways "around" the multiple testing >> > problem >> > of microarrays is for you to a priori identify a particular list of >> > genes >> > you're interested in, and then you only have to do the multiple test >> > correction for this smaller list. I've never done this in practice, and >> > I'm >> > not sure at what point in the analysis it's proper to pull out just the >> > smaller list. Obviously, all the data preprocessing and normalization >> > will >> > be done with all the genes, but should I pull out the genes before >> > fitting >> > the model, or after fitting the model right before the multiple test >> > adjustment? I'm using the eBayes() shrinkage in limma, so which genes >> > are in >> > the model will make a big difference in the outcome. >> > >> > I'm thinking it would be best to keep all the genes in the model, and >> > then >> > split them out into two groups (genes of interest and all the rest) and >> > do a >> > FDR correction separately for each group. What do you think? >> > >> > Thanks, >> > Jenny >> > >> > Jenny Drnevich, Ph.D. >> > >> > Functional Genomics Bioinformatics Specialist >> > W.M. Keck Center for Comparative and Functional Genomics >> > Roy J. Carver Biotechnology Center >> > University of Illinois, Urbana-Champaign >> > >> > 330 ERML >> > 1201 W. Gregory Dr. >> > Urbana, IL 61801 >> > USA >> > >> > ph: 217-244-7355 >> > fax: 217-265-5066 >> > e-mail: drnevich at illinois.edu >> > >> > _______________________________________________ >> > Bioconductor mailing list >> > Bioconductor at stat.math.ethz.ch >> > https://stat.ethz.ch/mailman/listinfo/bioconductor >> > Search the archives: >> > http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > > Jenny Drnevich, Ph.D. > > Functional Genomics Bioinformatics Specialist > W.M. Keck Center for Comparative and Functional Genomics > Roy J. Carver Biotechnology Center > University of Illinois, Urbana-Champaign > > 330 ERML > 1201 W. Gregory Dr. > Urbana, IL 61801 > USA > > ph: 217-244-7355 > fax: 217-265-5066 > e-mail: drnevich at illinois.edu >

ADD REPLY • link 17.6 years ago Glyn Bradley ▴ 30

0

Entering edit mode

Glyn Bradley ▴ 30

@glyn-bradley-2926

Last seen 11.5 years ago

Hi Jenny I may get shot down horribly for saying this on this list, but isn't there a large school of thought which says don't do FDR at all, just take the large list of genes out and mine it for biological significance. Certainly a large pharma I've a little experienmce of takes that approach. Stats are just stats afterall. (and I'm sure you're going to validate the results with some other wet lab technique anyway). Glyn PhD Bioinf and systems modelling mycib.ac.uk On Thu, Jul 24, 2008 at 4:14 PM, Jenny Drnevich <drnevich at="" illinois.edu=""> wrote: > Hi everyone, > > I've always heard that one of the ways "around" the multiple testing problem > of microarrays is for you to a priori identify a particular list of genes > you're interested in, and then you only have to do the multiple test > correction for this smaller list. I've never done this in practice, and I'm > not sure at what point in the analysis it's proper to pull out just the > smaller list. Obviously, all the data preprocessing and normalization will > be done with all the genes, but should I pull out the genes before fitting > the model, or after fitting the model right before the multiple test > adjustment? I'm using the eBayes() shrinkage in limma, so which genes are in > the model will make a big difference in the outcome. > > I'm thinking it would be best to keep all the genes in the model, and then > split them out into two groups (genes of interest and all the rest) and do a > FDR correction separately for each group. What do you think? > > Thanks, > Jenny > > Jenny Drnevich, Ph.D. > > Functional Genomics Bioinformatics Specialist > W.M. Keck Center for Comparative and Functional Genomics > Roy J. Carver Biotechnology Center > University of Illinois, Urbana-Champaign > > 330 ERML > 1201 W. Gregory Dr. > Urbana, IL 61801 > USA > > ph: 217-244-7355 > fax: 217-265-5066 > e-mail: drnevich at illinois.edu > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor >

ADD COMMENT • link 17.6 years ago Glyn Bradley ▴ 30

0

Entering edit mode

Hi Glyn, Jenny I think also that it is a valid point of view to use test p-values to rank the genes by priority, without believing the p-values literally, but to still use them, as well as further information (e.g., other data, known controls, gene set enrichment analysis or its various commercializations such as Ingenuity) to decide where to cut off. Probabilistic computations, such as FDR, can be highly instructive and useful; but one need not get carried away in confusing the probability model that resulted in the p-values you're looking at with an actual ensemble of experiments. (By doing so, one might in fact do a disservice to biologists who often view the microarray as a discovery tool in a context, not as a standalone confirmation.) Also note that care is needed if the criterion that selects your gene subset is data-driven. There are a number of papers about this (and I guess there will be more), but the bottomline is that -if you do care about the multiple testing aspect- you're not "cheating" in the multiple testing correction as long as the criterion is unaware of the contrast of interest that is subsequently tested. Best wishes Wolfgang ------------------------------------------------------------------ Wolfgang Huber EBI/EMBL Cambridge UK http://www.ebi.ac.uk/huber 24/07/2008 16:48 Glyn Bradley scripsit > Hi Jenny > I may get shot down horribly for saying this on this list, but isn't > there a large school of thought which says don't do FDR at all, just > take the large list of genes out and mine it for biological > significance. > Certainly a large pharma I've a little experienmce of takes that > approach. Stats are just stats afterall. (and I'm sure you're going to > validate the results with some other wet lab technique anyway). > > > Glyn PhD > Bioinf and systems modelling > mycib.ac.uk > > On Thu, Jul 24, 2008 at 4:14 PM, Jenny Drnevich <drnevich at="" illinois.edu=""> wrote: >> Hi everyone, >> >> I've always heard that one of the ways "around" the multiple testing problem >> of microarrays is for you to a priori identify a particular list of genes >> you're interested in, and then you only have to do the multiple test >> correction for this smaller list. I've never done this in practice, and I'm >> not sure at what point in the analysis it's proper to pull out just the >> smaller list. Obviously, all the data preprocessing and normalization will >> be done with all the genes, but should I pull out the genes before fitting >> the model, or after fitting the model right before the multiple test >> adjustment? I'm using the eBayes() shrinkage in limma, so which genes are in >> the model will make a big difference in the outcome. >> >> I'm thinking it would be best to keep all the genes in the model, and then >> split them out into two groups (genes of interest and all the rest) and do a >> FDR correction separately for each group. What do you think? >> >> Thanks, >> Jenny >> >> Jenny Drnevich, Ph.D. >> >> Functional Genomics Bioinformatics Specialist >> W.M. Keck Center for Comparative and Functional Genomics >> Roy J. Carver Biotechnology Center >> University of Illinois, Urbana-Champaign >> >> 330 ERML >> 1201 W. Gregory Dr. >> Urbana, IL 61801 >> USA >> >> ph: 217-244-7355 >> fax: 217-265-5066 >> e-mail: drnevich at illinois.edu

ADD REPLY • link 17.6 years ago Wolfgang Huber ★ 13k

0

Entering edit mode

Gordon Smyth 53k

@gordon-smyth

Last seen 12 hours ago

WEHI, Melbourne, Australia

Hi Jenny,

I basically agree with your suggested solution, and we do this ourselves all the time. To test your genes of special interest, simply subset the fit object that you give to topTable() or decideTests(), eg.

   topTable(fit[indicesofinterest,])

It is not usually necessary or desirable to subset before the eBayes() step, because that is intended to be whole-genome calculation.

In our work, we usually present a topTable for all genes, and a separate table for special genes of a priori interest.

Best wishes
Gordon

ADD COMMENT • link 10.9 years ago Gordon Smyth 53k

Login before adding your answer.