Statistics for next-generation sequencing transcriptomics

0

Entering edit mode

michael watson IAH-C ★ 3.4k

@michael-watson-iah-c-378

Last seen 9.6 years ago

Hi I'd like to have a discussion about statistics for transcriptomics using next-generation sequencing (if there hasn't already been one - if there has, then please someone point me to it!) What we're seeing in the literature, and here at IAH, are datasets where someone has sequenced the transcriptome of two samples using something like Illumina. These have been mapped to known sequences and counts produced. So what we have is something like this: geneA: 22000 sequences from 260000 match in sample 1, 43000 sequences from 507000 in sample 2. It's been suggested that one possible approach would be to construct 2x2 contingency tables and perform Fisher's exact test or the Chi- squared test, as has been applied to SAGE data. However, I've found that when I do that, the p-values for this type of data are incredibly, incredibly small, such that over 90% of my data points are significant, even after adjusting for multiple testing. I assume/hope that this is because these tests were not designed to cope with this type of data. For instance, applying Fisher's test to the example above yields a p-value of 3.798644e-23. As I see it there are three possibilities: 1) I'm doing something wrong 2) These tests are totally inappropriate for this type of data 3) All of my data points are highly significantly different I'm thinking that 2 is probably true, though I wouldn't rule out 1. Any thoughts and comments are very welcome, Mick [[alternative HTML version deleted]]

Sequencing SAGE Sequencing SAGE • 1.9k views

ADD COMMENT • link updated 14.7 years ago by Margaret Taub ▴ 10 • written 14.7 years ago by michael watson IAH-C ★ 3.4k

0

Entering edit mode

Naomi Altman ★ 6.0k

@naomi-altman-380

Last seen 3.0 years ago

United States

The problem is the FDR vs FNR problem (which we used to call power vs size). As the total sample size increases, we have the power to detect tiny, biologically negligible, differences. All frequentist tests will suffer from this, not just Fisher's exact test. --Naomi At 07:22 AM 7/24/2009, michael watson (IAH-C) wrote: >Hi > >I'd like to have a discussion about statistics for transcriptomics >using next-generation sequencing (if there hasn't already been one - >if there has, then please someone point me to it!) > >What we're seeing in the literature, and here at IAH, are datasets >where someone has sequenced the transcriptome of two samples using >something like Illumina. These have been mapped to known sequences >and counts produced. > >So what we have is something like this: > >geneA: 22000 sequences from 260000 match in sample 1, 43000 >sequences from 507000 in sample 2. > >It's been suggested that one possible approach would be to construct >2x2 contingency tables and perform Fisher's exact test or the >Chi-squared test, as has been applied to SAGE data. >However, I've found that when I do that, the p-values for this type >of data are incredibly, incredibly small, such that over 90% of my >data points are significant, even after adjusting for multiple >testing. I assume/hope that this is because these tests were not >designed to cope with this type of data. > >For instance, applying Fisher's test to the example above yields a >p-value of 3.798644e-23. > >As I see it there are three possibilities: >1) I'm doing something wrong >2) These tests are totally inappropriate for this type of data >3) All of my data points are highly significantly different > >I'm thinking that 2 is probably true, though I wouldn't rule out 1. > >Any thoughts and comments are very welcome, > >Mick > > > [[alternative HTML version deleted]] > >_______________________________________________ >Bioconductor mailing list >Bioconductor at stat.math.ethz.ch >https://stat.ethz.ch/mailman/listinfo/bioconductor >Search the archives: >http://news.gmane.org/gmane.science.biology.informatics.conductor Naomi S. Altman 814-865-3791 (voice) Associate Professor Dept. of Statistics 814-863-7114 (fax) Penn State University 814-865-1348 (Statistics) University Park, PA 16802-2111

ADD COMMENT • link 14.7 years ago Naomi Altman ★ 6.0k

0

Entering edit mode

John Herbert ▴ 90

@john-herbert-3373

Last seen 9.6 years ago

Hello Michael. Thanks for giving me the chance, I hope, to say something remotely useful here for onece. Firstly, we have recently developed a likelihood ratio statistics for looking at differential expression between two grops cDNA libraries which are effectively the same as comparing Illumina transcriptome libraries. It employs a FDR procedure q-value (Storey, Benjamini and Hochberg) for multiple testing. I don't know if 100% it will solve your problem but it accounts for different cDNA library sizes and is worth a go. The paper is here: http://www.biomedcentral.com/1471-2164/9/153 A link to a website that lets you download a Perl script to run these analyses is here: http://sara.molbiol.ox.ac.uk/userweb/jherbert/ Choose the StandAlone_MultiDiff and run the Perl script locally, preferably on some flavour of unix/linux. If you have any problems with it, I am happy to help. I think one other thing to remember is, that gene transcripts are different sizes and the sequencing from Illumina, if I am talking about the same thing, are not generated from one part of the transcript. So a shorter transcript will appear to be expressed more in your library as the RNA is fragmented, and as such, it will look like you are getting more copies expressed. There is a publication where they divided each counts for a transcript by the length of a transcript but I don't know if this is the most valid way to deal with it. If your Illumina libraries are just NGS SAGE libraries, then you don't have to worry about this and you will be fine with the above statistics. Sorry if this is meant to be an R/Bioconductor area only but I thought it may help Michael. Kind regards, John. ================================= Bioinformatics Officer Molecular Angiogenesis group First Floor, Institute of Biomedical Research The University of Birmingham Medical School Edgbaston Birmingham B15 2TT (UK) j.m.herbert at bham.ac.uk Tel: +44 121 414 3733 Fax: +44 121 415 8677 ________________________________ From: bioconductor-bounces@stat.math.ethz.ch on behalf of michael watson (IAH-C) Sent: Fri 24/07/2009 12:22 To: bioconductor at stat.math.ethz.ch Subject: [BioC] Statistics for next-generation sequencing transcriptomics Hi I'd like to have a discussion about statistics for transcriptomics using next-generation sequencing (if there hasn't already been one - if there has, then please someone point me to it!) What we're seeing in the literature, and here at IAH, are datasets where someone has sequenced the transcriptome of two samples using something like Illumina. These have been mapped to known sequences and counts produced. So what we have is something like this: geneA: 22000 sequences from 260000 match in sample 1, 43000 sequences from 507000 in sample 2. It's been suggested that one possible approach would be to construct 2x2 contingency tables and perform Fisher's exact test or the Chi- squared test, as has been applied to SAGE data. However, I've found that when I do that, the p-values for this type of data are incredibly, incredibly small, such that over 90% of my data points are significant, even after adjusting for multiple testing. I assume/hope that this is because these tests were not designed to cope with this type of data. For instance, applying Fisher's test to the example above yields a p-value of 3.798644e-23. As I see it there are three possibilities: 1) I'm doing something wrong 2) These tests are totally inappropriate for this type of data 3) All of my data points are highly significantly different I'm thinking that 2 is probably true, though I wouldn't rule out 1. Any thoughts and comments are very welcome, Mick [[alternative HTML version deleted]] _______________________________________________ Bioconductor mailing list Bioconductor at stat.math.ethz.ch https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD COMMENT • link 14.7 years ago John Herbert ▴ 90

0

Entering edit mode

Michael Dondrup ▴ 140

@michael-dondrup-3591

Last seen 9.6 years ago

Hi Michael, I am having a very similar problem using 454 seq data, so I am very much interested in this discussion. However, I do not quite understand how to for the contigency table and to achieve such small p-value here. My naive approach would be to count hits to GeneA and to count hits to the rest of the genome (all - #hits to gene A), giving a pretty much unbalanced 2x2 table like this: > mat Sample.1 Sample.2 Gene.A 22000 43000 The.rest 238000 464000 but then I do not see the point here, because there is a large p value, as I would expect: > fisher.test(mat) Fisher's Exact Test for Count Data data: mat p-value = 0.7717 alternative hypothesis: true odds ratio is not equal to 1 95 percent confidence interval: 0.9805937 1.0145920 sample estimates: odds ratio 0.9974594 Am I missing something? Best Michael Am 24.07.2009 um 13:22 schrieb michael watson (IAH-C): > Hi > > I'd like to have a discussion about statistics for transcriptomics > using next-generation sequencing (if there hasn't already been one - > if there has, then please someone point me to it!) > > What we're seeing in the literature, and here at IAH, are datasets > where someone has sequenced the transcriptome of two samples using > something like Illumina. These have been mapped to known sequences > and counts produced. > > So what we have is something like this: > > geneA: 22000 sequences from 260000 match in sample 1, 43000 > sequences from 507000 in sample 2. > > It's been suggested that one possible approach would be to construct > 2x2 contingency tables and perform Fisher's exact test or the Chi- > squared test, as has been applied to SAGE data. > However, I've found that when I do that, the p-values for this type > of data are incredibly, incredibly small, such that over 90% of my > data points are significant, even after adjusting for multiple > testing. I assume/hope that this is because these tests were not > designed to cope with this type of data. > > For instance, applying Fisher's test to the example above yields a p- > value of 3.798644e-23. > > As I see it there are three possibilities: > 1) I'm doing something wrong > 2) These tests are totally inappropriate for this type of data > 3) All of my data points are highly significantly different > > I'm thinking that 2 is probably true, though I wouldn't rule out 1. > > Any thoughts and comments are very welcome, > > Mick > > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD COMMENT • link 14.7 years ago Michael Dondrup ▴ 140

0

Entering edit mode

Hi Michael No, you're not missing anything, I wrote down my example incorrectly. I wrote down the elements of the contingency table rather than the totals, so it should have been: > mat <- matrix(c(22000,260000,48000,507000), nrow=2) > mat [,1] [,2] [1,] 22000 48000 [2,] 260000 507000 > fisher.test(mat) Fisher's Exact Test for Count Data data: mat p-value < 2.2e-16 alternative hypothesis: true odds ratio is not equal to 1 95 percent confidence interval: 0.8789286 0.9087655 sample estimates: odds ratio 0.8937356 Sorry about that! This is a case where I suspect there is a real difference, as the relative frequency rises from 0.084 to 0.094. However, as I mentioned, this result is masked by all the other "significant" results. As Naomi says, it is because as the sample size gets larger, we have the power to detect tiny changes as significant. So what is the solution? John Herbert has suggested something, and I will try that. Thanks Michael -----Original Message----- From: Michael Dondrup [mailto:Michael.Dondrup@bccs.uib.no] Sent: 24 July 2009 15:00 To: michael watson (IAH-C) Cc: bioconductor at stat.math.ethz.ch Subject: Re: [BioC] Statistics for next-generation sequencing transcriptomics Hi Michael, I am having a very similar problem using 454 seq data, so I am very much interested in this discussion. However, I do not quite understand how to for the contigency table and to achieve such small p-value here. My naive approach would be to count hits to GeneA and to count hits to the rest of the genome (all - #hits to gene A), giving a pretty much unbalanced 2x2 table like this: > mat Sample.1 Sample.2 Gene.A 22000 43000 The.rest 238000 464000 but then I do not see the point here, because there is a large p value, as I would expect: > fisher.test(mat) Fisher's Exact Test for Count Data data: mat p-value = 0.7717 alternative hypothesis: true odds ratio is not equal to 1 95 percent confidence interval: 0.9805937 1.0145920 sample estimates: odds ratio 0.9974594 Am I missing something? Best Michael Am 24.07.2009 um 13:22 schrieb michael watson (IAH-C): > Hi > > I'd like to have a discussion about statistics for transcriptomics > using next-generation sequencing (if there hasn't already been one - > if there has, then please someone point me to it!) > > What we're seeing in the literature, and here at IAH, are datasets > where someone has sequenced the transcriptome of two samples using > something like Illumina. These have been mapped to known sequences > and counts produced. > > So what we have is something like this: > > geneA: 22000 sequences from 260000 match in sample 1, 43000 > sequences from 507000 in sample 2. > > It's been suggested that one possible approach would be to construct > 2x2 contingency tables and perform Fisher's exact test or the Chi- > squared test, as has been applied to SAGE data. > However, I've found that when I do that, the p-values for this type > of data are incredibly, incredibly small, such that over 90% of my > data points are significant, even after adjusting for multiple > testing. I assume/hope that this is because these tests were not > designed to cope with this type of data. > > For instance, applying Fisher's test to the example above yields a p- > value of 3.798644e-23. > > As I see it there are three possibilities: > 1) I'm doing something wrong > 2) These tests are totally inappropriate for this type of data > 3) All of my data points are highly significantly different > > I'm thinking that 2 is probably true, though I wouldn't rule out 1. > > Any thoughts and comments are very welcome, > > Mick > > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD REPLY • link 14.7 years ago michael watson IAH-C ★ 3.4k

0

Entering edit mode

I don't know if this helps at all, but if the statistic is testing the null hypothesis that the counts are equal, would it not help to test for a difference > x? Of course we would then have to arbitrarily define "x", but perhaps we could get more biologically meaningful results? Cei michael watson (IAH-C) wrote: > Hi Michael > > No, you're not missing anything, I wrote down my example incorrectly. I wrote down the elements of the contingency table rather than the totals, so it should have been: > >> mat <- matrix(c(22000,260000,48000,507000), nrow=2) >> mat > [,1] [,2] > [1,] 22000 48000 > [2,] 260000 507000 >> fisher.test(mat) > > Fisher's Exact Test for Count Data > > data: mat > p-value < 2.2e-16 > alternative hypothesis: true odds ratio is not equal to 1 > 95 percent confidence interval: > 0.8789286 0.9087655 > sample estimates: > odds ratio > 0.8937356 > > Sorry about that! > > This is a case where I suspect there is a real difference, as the relative frequency rises from 0.084 to 0.094. However, as I mentioned, this result is masked by all the other "significant" results. As Naomi says, it is because as the sample size gets larger, we have the power to detect tiny changes as significant. > > So what is the solution? > > John Herbert has suggested something, and I will try that. > > Thanks > Michael > > -----Original Message----- > From: Michael Dondrup [mailto:Michael.Dondrup at bccs.uib.no] > Sent: 24 July 2009 15:00 > To: michael watson (IAH-C) > Cc: bioconductor at stat.math.ethz.ch > Subject: Re: [BioC] Statistics for next-generation sequencing transcriptomics > > Hi Michael, > I am having a very similar problem using 454 seq data, so I am very > much interested in this discussion. However, I do not quite > understand how to > for the contigency table and to achieve such small p-value here. My > naive approach would be to count hits to GeneA and to count hits to > the rest of the genome (all - #hits to gene A), giving a pretty much > unbalanced 2x2 table like this: > > mat > Sample.1 Sample.2 > Gene.A 22000 43000 > The.rest 238000 464000 > but then I do not see the point here, because there is a large p > value, as I would expect: > > > fisher.test(mat) > > Fisher's Exact Test for Count Data > > data: mat > p-value = 0.7717 > alternative hypothesis: true odds ratio is not equal to 1 > 95 percent confidence interval: > 0.9805937 1.0145920 > sample estimates: > odds ratio > 0.9974594 > > Am I missing something? > > Best > Michael > > > Am 24.07.2009 um 13:22 schrieb michael watson (IAH-C): > >> Hi >> >> I'd like to have a discussion about statistics for transcriptomics >> using next-generation sequencing (if there hasn't already been one - >> if there has, then please someone point me to it!) >> >> What we're seeing in the literature, and here at IAH, are datasets >> where someone has sequenced the transcriptome of two samples using >> something like Illumina. These have been mapped to known sequences >> and counts produced. >> >> So what we have is something like this: >> >> geneA: 22000 sequences from 260000 match in sample 1, 43000 >> sequences from 507000 in sample 2. >> >> It's been suggested that one possible approach would be to construct >> 2x2 contingency tables and perform Fisher's exact test or the Chi- >> squared test, as has been applied to SAGE data. >> However, I've found that when I do that, the p-values for this type >> of data are incredibly, incredibly small, such that over 90% of my >> data points are significant, even after adjusting for multiple >> testing. I assume/hope that this is because these tests were not >> designed to cope with this type of data. >> >> For instance, applying Fisher's test to the example above yields a p- >> value of 3.798644e-23. >> >> As I see it there are three possibilities: >> 1) I'm doing something wrong >> 2) These tests are totally inappropriate for this type of data >> 3) All of my data points are highly significantly different >> >> I'm thinking that 2 is probably true, though I wouldn't rule out 1. >> >> Any thoughts and comments are very welcome, >> >> Mick >> >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD REPLY • link 14.7 years ago Cei Abreu-Goodger ▴ 830

0

Entering edit mode

Hi again, I see the point now :) Just another idea, did you take the number of all reads sequenced as a total, or the number of all reads successfully mapped to the genome/ reference sequence? What happens, when only the mapped reads are counted? I suspect that the choice of this could influence whatever method is applied, because large fractions of unmapped reads (given there are such) could -artificially?- increase the confidence in the result. Michael Am 24.07.2009 um 17:25 schrieb michael watson (IAH-C): > Hi Michael > > No, you're not missing anything, I wrote down my example > incorrectly. I wrote down the elements of the contingency table > rather than the totals, so it should have been: > >> mat <- matrix(c(22000,260000,48000,507000), nrow=2) >> mat > [,1] [,2] > [1,] 22000 48000 > [2,] 260000 507000 >> fisher.test(mat) > > Fisher's Exact Test for Count Data > > data: mat > p-value < 2.2e-16 > alternative hypothesis: true odds ratio is not equal to 1 > 95 percent confidence interval: > 0.8789286 0.9087655 > sample estimates: > odds ratio > 0.8937356 > > Sorry about that! > > This is a case where I suspect there is a real difference, as the > relative frequency rises from 0.084 to 0.094. However, as I > mentioned, this result is masked by all the other "significant" > results. As Naomi says, it is because as the sample size gets > larger, we have the power to detect tiny changes as significant. > > So what is the solution? > > John Herbert has suggested something, and I will try that. > > Thanks > Michael > > -----Original Message----- > From: Michael Dondrup [mailto:Michael.Dondrup at bccs.uib.no] > Sent: 24 July 2009 15:00 > To: michael watson (IAH-C) > Cc: bioconductor at stat.math.ethz.ch > Subject: Re: [BioC] Statistics for next-generation sequencing > transcriptomics > > Hi Michael, > I am having a very similar problem using 454 seq data, so I am very > much interested in this discussion. However, I do not quite > understand how to > for the contigency table and to achieve such small p-value here. My > naive approach would be to count hits to GeneA and to count hits to > the rest of the genome (all - #hits to gene A), giving a pretty much > unbalanced 2x2 table like this: >> mat > Sample.1 Sample.2 > Gene.A 22000 43000 > The.rest 238000 464000 > but then I do not see the point here, because there is a large p > value, as I would expect: > >> fisher.test(mat) > > Fisher's Exact Test for Count Data > > data: mat > p-value = 0.7717 > alternative hypothesis: true odds ratio is not equal to 1 > 95 percent confidence interval: > 0.9805937 1.0145920 > sample estimates: > odds ratio > 0.9974594 > > Am I missing something? > > Best > Michael > > > Am 24.07.2009 um 13:22 schrieb michael watson (IAH-C): > >> Hi >> >> I'd like to have a discussion about statistics for transcriptomics >> using next-generation sequencing (if there hasn't already been one - >> if there has, then please someone point me to it!) >> >> What we're seeing in the literature, and here at IAH, are datasets >> where someone has sequenced the transcriptome of two samples using >> something like Illumina. These have been mapped to known sequences >> and counts produced. >> >> So what we have is something like this: >> >> geneA: 22000 sequences from 260000 match in sample 1, 43000 >> sequences from 507000 in sample 2. >> >> It's been suggested that one possible approach would be to construct >> 2x2 contingency tables and perform Fisher's exact test or the Chi- >> squared test, as has been applied to SAGE data. >> However, I've found that when I do that, the p-values for this type >> of data are incredibly, incredibly small, such that over 90% of my >> data points are significant, even after adjusting for multiple >> testing. I assume/hope that this is because these tests were not >> designed to cope with this type of data. >> >> For instance, applying Fisher's test to the example above yields a p- >> value of 3.798644e-23. >> >> As I see it there are three possibilities: >> 1) I'm doing something wrong >> 2) These tests are totally inappropriate for this type of data >> 3) All of my data points are highly significantly different >> >> I'm thinking that 2 is probably true, though I wouldn't rule out 1. >> >> Any thoughts and comments are very welcome, >> >> Mick >> >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > > Michael Dondrup, Ph.D. Bergen Center for Computational Science Computational Biology Unit Unifob AS - Thorm?hlensgate 55, N-5008 Bergen, Norway

ADD REPLY • link 14.7 years ago Michael Dondrup ▴ 140

0

Entering edit mode

Dear all, according to the described problem maybe binomial test could be designed for this purpose? Just a thought. Zeljko Debeljak, PhD Medical Biochemistry Specialist Osijek Clinical Hospital CROATIA 2009/7/24, Michael Dondrup <michael.dondrup at="" bccs.uib.no="">: > Hi again, > > I see the point now :) > Just another idea, did you take the number of all reads sequenced as > a total, or the number of all reads successfully mapped to the genome/ > reference sequence? What happens, when only the mapped reads are > counted? > I suspect that the choice of this could influence whatever method is > applied, because large fractions of unmapped reads (given there are > such) could -artificially?- increase the confidence in the result. > > Michael > > > Am 24.07.2009 um 17:25 schrieb michael watson (IAH-C): > >> Hi Michael >> >> No, you're not missing anything, I wrote down my example >> incorrectly. I wrote down the elements of the contingency table >> rather than the totals, so it should have been: >> >>> mat <- matrix(c(22000,260000,48000,507000), nrow=2) >>> mat >> [,1] [,2] >> [1,] 22000 48000 >> [2,] 260000 507000 >>> fisher.test(mat) >> >> Fisher's Exact Test for Count Data >> >> data: mat >> p-value < 2.2e-16 >> alternative hypothesis: true odds ratio is not equal to 1 >> 95 percent confidence interval: >> 0.8789286 0.9087655 >> sample estimates: >> odds ratio >> 0.8937356 >> >> Sorry about that! >> >> This is a case where I suspect there is a real difference, as the >> relative frequency rises from 0.084 to 0.094. However, as I >> mentioned, this result is masked by all the other "significant" >> results. As Naomi says, it is because as the sample size gets >> larger, we have the power to detect tiny changes as significant. >> >> So what is the solution? >> >> John Herbert has suggested something, and I will try that. >> >> Thanks >> Michael >> >> -----Original Message----- >> From: Michael Dondrup [mailto:Michael.Dondrup at bccs.uib.no] >> Sent: 24 July 2009 15:00 >> To: michael watson (IAH-C) >> Cc: bioconductor at stat.math.ethz.ch >> Subject: Re: [BioC] Statistics for next-generation sequencing >> transcriptomics >> >> Hi Michael, >> I am having a very similar problem using 454 seq data, so I am very >> much interested in this discussion. However, I do not quite >> understand how to >> for the contigency table and to achieve such small p-value here. My >> naive approach would be to count hits to GeneA and to count hits to >> the rest of the genome (all - #hits to gene A), giving a pretty much >> unbalanced 2x2 table like this: >>> mat >> Sample.1 Sample.2 >> Gene.A 22000 43000 >> The.rest 238000 464000 >> but then I do not see the point here, because there is a large p >> value, as I would expect: >> >>> fisher.test(mat) >> >> Fisher's Exact Test for Count Data >> >> data: mat >> p-value = 0.7717 >> alternative hypothesis: true odds ratio is not equal to 1 >> 95 percent confidence interval: >> 0.9805937 1.0145920 >> sample estimates: >> odds ratio >> 0.9974594 >> >> Am I missing something? >> >> Best >> Michael >> >> >> Am 24.07.2009 um 13:22 schrieb michael watson (IAH-C): >> >>> Hi >>> >>> I'd like to have a discussion about statistics for transcriptomics >>> using next-generation sequencing (if there hasn't already been one - >>> if there has, then please someone point me to it!) >>> >>> What we're seeing in the literature, and here at IAH, are datasets >>> where someone has sequenced the transcriptome of two samples using >>> something like Illumina. These have been mapped to known sequences >>> and counts produced. >>> >>> So what we have is something like this: >>> >>> geneA: 22000 sequences from 260000 match in sample 1, 43000 >>> sequences from 507000 in sample 2. >>> >>> It's been suggested that one possible approach would be to construct >>> 2x2 contingency tables and perform Fisher's exact test or the Chi- >>> squared test, as has been applied to SAGE data. >>> However, I've found that when I do that, the p-values for this type >>> of data are incredibly, incredibly small, such that over 90% of my >>> data points are significant, even after adjusting for multiple >>> testing. I assume/hope that this is because these tests were not >>> designed to cope with this type of data. >>> >>> For instance, applying Fisher's test to the example above yields a p- >>> value of 3.798644e-23. >>> >>> As I see it there are three possibilities: >>> 1) I'm doing something wrong >>> 2) These tests are totally inappropriate for this type of data >>> 3) All of my data points are highly significantly different >>> >>> I'm thinking that 2 is probably true, though I wouldn't rule out 1. >>> >>> Any thoughts and comments are very welcome, >>> >>> Mick >>> >>> >>> [[alternative HTML version deleted]] >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at stat.math.ethz.ch >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> >> > > Michael Dondrup, Ph.D. > Bergen Center for Computational Science > Computational Biology Unit > Unifob AS - Thorm?hlensgate 55, N-5008 Bergen, Norway > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor >

ADD REPLY • link 14.7 years ago Zeljko Debeljak ▴ 50

0

Entering edit mode

Margaret Taub ▴ 10

@margaret-taub-3592

Last seen 9.6 years ago

I've recently spent some time investigating this topic and on the data set I have examined, where a Poisson model seems to hold very well across technical replicate lanes (based on chi-squared goodness-of-fit statistics), Fisher's exact test, the likelihood ratio test, and the binomial test all perform equivalently in terms of identifying differentially expressed genes and in terms of producing very low p- values. I also compared against a recently developed method for SAGE analysis, the edgeR package of Robinson and Smyth, but perhaps due to the almost exact Poisson variability in my data, their method did not perform better than the others. (The method is designed to work particularly well on overdispersed Poisson data.) And using just mapped reads for the lane totals (suggested in this thread by Michael Dondrup) did not have much effect - the counts were still very high. I think overall that Naomi is right - in this Poisson context, frequentist methods will produce very small p-values. I do think that part of the problem is that most RNASeq data sets produced today are not capturing biological variability, so we are really just doing a one-versus-one comparison even when we have "replicate" lanes (which are almost exclusively technical replicates, or at best replicates on cell lines) and hopefully once this is taken into account more reasonable test statistics will be seen. I also agree with John Herbert's point about transcript length having an effect here, and I'm not sure that a good solution has been developed for this yet either, see Oshlack and Wakefield's recent paper for a discussion: http://www.pubmedcentral.nih.gov/articlerender .fcgi?artid=2678084&tool=pmcentrez . They point out that for Poisson data, dividing by transcript length does not fix the length-bias problem. Cheers, Margaret Margaret Taub PhD Candidate UC Berkeley Dept of Statistics >> From: Zeljko Debeljak <zeljko.debeljak at="" gmail.com=""> >> Date: July 24, 2009 9:21:40 AM PDT >> To: Michael Dondrup <michael.dondrup at="" bccs.uib.no=""> >> Cc: "bioconductor at stat.math.ethz.ch" <bioconductor at="" stat.math.ethz.ch=""> >> Subject: Re: [BioC] Statistics for next-generation sequencing >> transcriptomics >> >> Dear all, >> >> according to the described problem maybe binomial test could be >> designed for this purpose? Just a thought. >> >> Zeljko Debeljak, PhD >> Medical Biochemistry Specialist >> Osijek Clinical Hospital >> CROATIA >> >> 2009/7/24, Michael Dondrup <michael.dondrup at="" bccs.uib.no="">: >>> Hi again, >>> >>> I see the point now :) >>> Just another idea, did you take the number of all reads sequenced as >>> a total, or the number of all reads successfully mapped to the >>> genome/ >>> reference sequence? What happens, when only the mapped reads are >>> counted? >>> I suspect that the choice of this could influence whatever method is >>> applied, because large fractions of unmapped reads (given there are >>> such) could -artificially?- increase the confidence in the result. >>> >>> Michael >>> >>> >>> Am 24.07.2009 um 17:25 schrieb michael watson (IAH-C): >>> >>>> Hi Michael >>>> >>>> No, you're not missing anything, I wrote down my example >>>> incorrectly. I wrote down the elements of the contingency table >>>> rather than the totals, so it should have been: >>>> >>>>> mat <- matrix(c(22000,260000,48000,507000), nrow=2) >>>>> mat >>>> [,1] [,2] >>>> [1,] 22000 48000 >>>> [2,] 260000 507000 >>>>> fisher.test(mat) >>>> >>>> Fisher's Exact Test for Count Data >>>> >>>> data: mat >>>> p-value < 2.2e-16 >>>> alternative hypothesis: true odds ratio is not equal to 1 >>>> 95 percent confidence interval: >>>> 0.8789286 0.9087655 >>>> sample estimates: >>>> odds ratio >>>> 0.8937356 >>>> >>>> Sorry about that! >>>> >>>> This is a case where I suspect there is a real difference, as the >>>> relative frequency rises from 0.084 to 0.094. However, as I >>>> mentioned, this result is masked by all the other "significant" >>>> results. As Naomi says, it is because as the sample size gets >>>> larger, we have the power to detect tiny changes as significant. >>>> >>>> So what is the solution? >>>> >>>> John Herbert has suggested something, and I will try that. >>>> >>>> Thanks >>>> Michael >>>> >>>> -----Original Message----- >>>> From: Michael Dondrup [mailto:Michael.Dondrup at bccs.uib.no] >>>> Sent: 24 July 2009 15:00 >>>> To: michael watson (IAH-C) >>>> Cc: bioconductor at stat.math.ethz.ch >>>> Subject: Re: [BioC] Statistics for next-generation sequencing >>>> transcriptomics >>>> >>>> Hi Michael, >>>> I am having a very similar problem using 454 seq data, so I am very >>>> much interested in this discussion. However, I do not quite >>>> understand how to >>>> for the contigency table and to achieve such small p-value here. My >>>> naive approach would be to count hits to GeneA and to count hits to >>>> the rest of the genome (all - #hits to gene A), giving a pretty >>>> much >>>> unbalanced 2x2 table like this: >>>>> mat >>>> Sample.1 Sample.2 >>>> Gene.A 22000 43000 >>>> The.rest 238000 464000 >>>> but then I do not see the point here, because there is a large p >>>> value, as I would expect: >>>> >>>>> fisher.test(mat) >>>> >>>> Fisher's Exact Test for Count Data >>>> >>>> data: mat >>>> p-value = 0.7717 >>>> alternative hypothesis: true odds ratio is not equal to 1 >>>> 95 percent confidence interval: >>>> 0.9805937 1.0145920 >>>> sample estimates: >>>> odds ratio >>>> 0.9974594 >>>> >>>> Am I missing something? >>>> >>>> Best >>>> Michael >>>> >>>> >>>> Am 24.07.2009 um 13:22 schrieb michael watson (IAH-C): >>>> >>>>> Hi >>>>> >>>>> I'd like to have a discussion about statistics for transcriptomics >>>>> using next-generation sequencing (if there hasn't already been >>>>> one - >>>>> if there has, then please someone point me to it!) >>>>> >>>>> What we're seeing in the literature, and here at IAH, are datasets >>>>> where someone has sequenced the transcriptome of two samples using >>>>> something like Illumina. These have been mapped to known >>>>> sequences >>>>> and counts produced. >>>>> >>>>> So what we have is something like this: >>>>> >>>>> geneA: 22000 sequences from 260000 match in sample 1, 43000 >>>>> sequences from 507000 in sample 2. >>>>> >>>>> It's been suggested that one possible approach would be to >>>>> construct >>>>> 2x2 contingency tables and perform Fisher's exact test or the Chi- >>>>> squared test, as has been applied to SAGE data. >>>>> However, I've found that when I do that, the p-values for this >>>>> type >>>>> of data are incredibly, incredibly small, such that over 90% of my >>>>> data points are significant, even after adjusting for multiple >>>>> testing. I assume/hope that this is because these tests were not >>>>> designed to cope with this type of data. >>>>> >>>>> For instance, applying Fisher's test to the example above yields >>>>> a p- >>>>> value of 3.798644e-23. >>>>> >>>>> As I see it there are three possibilities: >>>>> 1) I'm doing something wrong >>>>> 2) These tests are totally inappropriate for this type of data >>>>> 3) All of my data points are highly significantly different >>>>> >>>>> I'm thinking that 2 is probably true, though I wouldn't rule out >>>>> 1. >>>>> >>>>> Any thoughts and comments are very welcome, >>>>> >>>>> Mick >>>>> >>>>> >>>>> [[alternative HTML version deleted]] >>>>> >>>>> _______________________________________________ >>>>> Bioconductor mailing list >>>>> Bioconductor at stat.math.ethz.ch >>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>> Search the archives: >>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>> >>>> >>>> >>> >>> Michael Dondrup, Ph.D. >>> Bergen Center for Computational Science >>> Computational Biology Unit >>> Unifob AS - Thorm?hlensgate 55, N-5008 Bergen, Norway >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at stat.math.ethz.ch >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD COMMENT • link 14.7 years ago Margaret Taub ▴ 10

0

Entering edit mode

Has anyone considered the possibility to use some resampling statistics concepts for these purposes? I have found some interesting alternatives to Chi2 and similar tests based on the resampling approach in the book "Data Analysis by Resampling: Concepts and Applications" by C.E. Lunneborg. It seems that such approaches are "resilient" to the type of the underlying distribution. I must say I didn't use these methods for this particular type of problem but I do have some positive experience with other applications of resampling in the statistical hypothesis testing field. Zeljko Debeljak, PhD Medical Biochemistry Specialist Osijek Clinical Hospital CROATIA 2009/7/25, Margaret Taub <mtaub at="" stat.berkeley.edu="">: > > I've recently spent some time investigating this topic and on the data > set I have examined, where a Poisson model seems to hold very well > across technical replicate lanes (based on chi-squared goodness-of- fit > statistics), Fisher's exact test, the likelihood ratio test, and the > binomial test all perform equivalently in terms of identifying > differentially expressed genes and in terms of producing very low p- > values. I also compared against a recently developed method for SAGE > analysis, the edgeR package of Robinson and Smyth, but perhaps due to > the almost exact Poisson variability in my data, their method did not > perform better than the others. (The method is designed to work > particularly well on overdispersed Poisson data.) And using just > mapped reads for the lane totals (suggested in this thread by Michael > Dondrup) did not have much effect - the counts were still very high. > > I think overall that Naomi is right - in this Poisson context, > frequentist methods will produce very small p-values. I do think that > part of the problem is that most RNASeq data sets produced today are > not capturing biological variability, so we are really just doing a > one-versus-one comparison even when we have "replicate" lanes (which > are almost exclusively technical replicates, or at best replicates on > cell lines) and hopefully once this is taken into account more > reasonable test statistics will be seen. > > I also agree with John Herbert's point about transcript length having > an effect here, and I'm not sure that a good solution has been > developed for this yet either, see Oshlack and Wakefield's recent > paper for a discussion: > http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=2678084&to ol=pmcentrez > . They point out that for Poisson data, dividing by transcript length > does not fix the length-bias problem. > > Cheers, > Margaret > > > Margaret Taub > PhD Candidate > UC Berkeley Dept of Statistics > > > >>> From: Zeljko Debeljak <zeljko.debeljak at="" gmail.com=""> >>> Date: July 24, 2009 9:21:40 AM PDT >>> To: Michael Dondrup <michael.dondrup at="" bccs.uib.no=""> >>> Cc: "bioconductor at stat.math.ethz.ch" <bioconductor at="" stat.math.ethz.ch=""> >>> Subject: Re: [BioC] Statistics for next-generation sequencing >>> transcriptomics >>> >>> Dear all, >>> >>> according to the described problem maybe binomial test could be >>> designed for this purpose? Just a thought. >>> >>> Zeljko Debeljak, PhD >>> Medical Biochemistry Specialist >>> Osijek Clinical Hospital >>> CROATIA >>> >>> 2009/7/24, Michael Dondrup <michael.dondrup at="" bccs.uib.no="">: >>>> Hi again, >>>> >>>> I see the point now :) >>>> Just another idea, did you take the number of all reads sequenced as >>>> a total, or the number of all reads successfully mapped to the >>>> genome/ >>>> reference sequence? What happens, when only the mapped reads are >>>> counted? >>>> I suspect that the choice of this could influence whatever method is >>>> applied, because large fractions of unmapped reads (given there are >>>> such) could -artificially?- increase the confidence in the result. >>>> >>>> Michael >>>> >>>> >>>> Am 24.07.2009 um 17:25 schrieb michael watson (IAH-C): >>>> >>>>> Hi Michael >>>>> >>>>> No, you're not missing anything, I wrote down my example >>>>> incorrectly. I wrote down the elements of the contingency table >>>>> rather than the totals, so it should have been: >>>>> >>>>>> mat <- matrix(c(22000,260000,48000,507000), nrow=2) >>>>>> mat >>>>> [,1] [,2] >>>>> [1,] 22000 48000 >>>>> [2,] 260000 507000 >>>>>> fisher.test(mat) >>>>> >>>>> Fisher's Exact Test for Count Data >>>>> >>>>> data: mat >>>>> p-value < 2.2e-16 >>>>> alternative hypothesis: true odds ratio is not equal to 1 >>>>> 95 percent confidence interval: >>>>> 0.8789286 0.9087655 >>>>> sample estimates: >>>>> odds ratio >>>>> 0.8937356 >>>>> >>>>> Sorry about that! >>>>> >>>>> This is a case where I suspect there is a real difference, as the >>>>> relative frequency rises from 0.084 to 0.094. However, as I >>>>> mentioned, this result is masked by all the other "significant" >>>>> results. As Naomi says, it is because as the sample size gets >>>>> larger, we have the power to detect tiny changes as significant. >>>>> >>>>> So what is the solution? >>>>> >>>>> John Herbert has suggested something, and I will try that. >>>>> >>>>> Thanks >>>>> Michael >>>>> >>>>> -----Original Message----- >>>>> From: Michael Dondrup [mailto:Michael.Dondrup at bccs.uib.no] >>>>> Sent: 24 July 2009 15:00 >>>>> To: michael watson (IAH-C) >>>>> Cc: bioconductor at stat.math.ethz.ch >>>>> Subject: Re: [BioC] Statistics for next-generation sequencing >>>>> transcriptomics >>>>> >>>>> Hi Michael, >>>>> I am having a very similar problem using 454 seq data, so I am very >>>>> much interested in this discussion. However, I do not quite >>>>> understand how to >>>>> for the contigency table and to achieve such small p-value here. My >>>>> naive approach would be to count hits to GeneA and to count hits to >>>>> the rest of the genome (all - #hits to gene A), giving a pretty >>>>> much >>>>> unbalanced 2x2 table like this: >>>>>> mat >>>>> Sample.1 Sample.2 >>>>> Gene.A 22000 43000 >>>>> The.rest 238000 464000 >>>>> but then I do not see the point here, because there is a large p >>>>> value, as I would expect: >>>>> >>>>>> fisher.test(mat) >>>>> >>>>> Fisher's Exact Test for Count Data >>>>> >>>>> data: mat >>>>> p-value = 0.7717 >>>>> alternative hypothesis: true odds ratio is not equal to 1 >>>>> 95 percent confidence interval: >>>>> 0.9805937 1.0145920 >>>>> sample estimates: >>>>> odds ratio >>>>> 0.9974594 >>>>> >>>>> Am I missing something? >>>>> >>>>> Best >>>>> Michael >>>>> >>>>> >>>>> Am 24.07.2009 um 13:22 schrieb michael watson (IAH-C): >>>>> >>>>>> Hi >>>>>> >>>>>> I'd like to have a discussion about statistics for transcriptomics >>>>>> using next-generation sequencing (if there hasn't already been >>>>>> one - >>>>>> if there has, then please someone point me to it!) >>>>>> >>>>>> What we're seeing in the literature, and here at IAH, are datasets >>>>>> where someone has sequenced the transcriptome of two samples using >>>>>> something like Illumina. These have been mapped to known >>>>>> sequences >>>>>> and counts produced. >>>>>> >>>>>> So what we have is something like this: >>>>>> >>>>>> geneA: 22000 sequences from 260000 match in sample 1, 43000 >>>>>> sequences from 507000 in sample 2. >>>>>> >>>>>> It's been suggested that one possible approach would be to >>>>>> construct >>>>>> 2x2 contingency tables and perform Fisher's exact test or the Chi- >>>>>> squared test, as has been applied to SAGE data. >>>>>> However, I've found that when I do that, the p-values for this >>>>>> type >>>>>> of data are incredibly, incredibly small, such that over 90% of my >>>>>> data points are significant, even after adjusting for multiple >>>>>> testing. I assume/hope that this is because these tests were not >>>>>> designed to cope with this type of data. >>>>>> >>>>>> For instance, applying Fisher's test to the example above yields >>>>>> a p- >>>>>> value of 3.798644e-23. >>>>>> >>>>>> As I see it there are three possibilities: >>>>>> 1) I'm doing something wrong >>>>>> 2) These tests are totally inappropriate for this type of data >>>>>> 3) All of my data points are highly significantly different >>>>>> >>>>>> I'm thinking that 2 is probably true, though I wouldn't rule out >>>>>> 1. >>>>>> >>>>>> Any thoughts and comments are very welcome, >>>>>> >>>>>> Mick >>>>>> >>>>>> >>>>>> [[alternative HTML version deleted]] >>>>>> >>>>>> _______________________________________________ >>>>>> Bioconductor mailing list >>>>>> Bioconductor at stat.math.ethz.ch >>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>> Search the archives: >>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>> >>>>> >>>>> >>>> >>>> Michael Dondrup, Ph.D. >>>> Bergen Center for Computational Science >>>> Computational Biology Unit >>>> Unifob AS - Thorm?hlensgate 55, N-5008 Bergen, Norway >>>> >>>> _______________________________________________ >>>> Bioconductor mailing list >>>> Bioconductor at stat.math.ethz.ch >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> Search the archives: >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>> >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at stat.math.ethz.ch >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor >

ADD REPLY • link 14.7 years ago Zeljko Debeljak ▴ 50

0

Entering edit mode

So, if we resample many times with replacement from sample 2 with sample size n1 (where n1 is the size of sample 1), we will build up a distribution of counts for each gene, and we can then compare observed counts from sample 1 to the resampled distributions which are based on the count structure of sample 2. ________________________________________ From: bioconductor-bounces@stat.math.ethz.ch [bioconductor- bounces@stat.math.ethz.ch] On Behalf Of Zeljko Debeljak [zeljko.debeljak@gmail.com] Sent: 25 July 2009 09:11 To: Margaret Taub Cc: bioconductor at stat.math.ethz.ch Subject: Re: [BioC] Statistics for next-generation sequencing transcriptomics Has anyone considered the possibility to use some resampling statistics concepts for these purposes? I have found some interesting alternatives to Chi2 and similar tests based on the resampling approach in the book "Data Analysis by Resampling: Concepts and Applications" by C.E. Lunneborg. It seems that such approaches are "resilient" to the type of the underlying distribution. I must say I didn't use these methods for this particular type of problem but I do have some positive experience with other applications of resampling in the statistical hypothesis testing field. Zeljko Debeljak, PhD Medical Biochemistry Specialist Osijek Clinical Hospital CROATIA 2009/7/25, Margaret Taub <mtaub at="" stat.berkeley.edu="">: > > I've recently spent some time investigating this topic and on the data > set I have examined, where a Poisson model seems to hold very well > across technical replicate lanes (based on chi-squared goodness-of- fit > statistics), Fisher's exact test, the likelihood ratio test, and the > binomial test all perform equivalently in terms of identifying > differentially expressed genes and in terms of producing very low p- > values. I also compared against a recently developed method for SAGE > analysis, the edgeR package of Robinson and Smyth, but perhaps due to > the almost exact Poisson variability in my data, their method did not > perform better than the others. (The method is designed to work > particularly well on overdispersed Poisson data.) And using just > mapped reads for the lane totals (suggested in this thread by Michael > Dondrup) did not have much effect - the counts were still very high. > > I think overall that Naomi is right - in this Poisson context, > frequentist methods will produce very small p-values. I do think that > part of the problem is that most RNASeq data sets produced today are > not capturing biological variability, so we are really just doing a > one-versus-one comparison even when we have "replicate" lanes (which > are almost exclusively technical replicates, or at best replicates on > cell lines) and hopefully once this is taken into account more > reasonable test statistics will be seen. > > I also agree with John Herbert's point about transcript length having > an effect here, and I'm not sure that a good solution has been > developed for this yet either, see Oshlack and Wakefield's recent > paper for a discussion: > http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=2678084&to ol=pmcentrez > . They point out that for Poisson data, dividing by transcript length > does not fix the length-bias problem. > > Cheers, > Margaret > > > Margaret Taub > PhD Candidate > UC Berkeley Dept of Statistics > > > >>> From: Zeljko Debeljak <zeljko.debeljak at="" gmail.com=""> >>> Date: July 24, 2009 9:21:40 AM PDT >>> To: Michael Dondrup <michael.dondrup at="" bccs.uib.no=""> >>> Cc: "bioconductor at stat.math.ethz.ch" <bioconductor at="" stat.math.ethz.ch=""> >>> Subject: Re: [BioC] Statistics for next-generation sequencing >>> transcriptomics >>> >>> Dear all, >>> >>> according to the described problem maybe binomial test could be >>> designed for this purpose? Just a thought. >>> >>> Zeljko Debeljak, PhD >>> Medical Biochemistry Specialist >>> Osijek Clinical Hospital >>> CROATIA >>> >>> 2009/7/24, Michael Dondrup <michael.dondrup at="" bccs.uib.no="">: >>>> Hi again, >>>> >>>> I see the point now :) >>>> Just another idea, did you take the number of all reads sequenced as >>>> a total, or the number of all reads successfully mapped to the >>>> genome/ >>>> reference sequence? What happens, when only the mapped reads are >>>> counted? >>>> I suspect that the choice of this could influence whatever method is >>>> applied, because large fractions of unmapped reads (given there are >>>> such) could -artificially?- increase the confidence in the result. >>>> >>>> Michael >>>> >>>> >>>> Am 24.07.2009 um 17:25 schrieb michael watson (IAH-C): >>>> >>>>> Hi Michael >>>>> >>>>> No, you're not missing anything, I wrote down my example >>>>> incorrectly. I wrote down the elements of the contingency table >>>>> rather than the totals, so it should have been: >>>>> >>>>>> mat <- matrix(c(22000,260000,48000,507000), nrow=2) >>>>>> mat >>>>> [,1] [,2] >>>>> [1,] 22000 48000 >>>>> [2,] 260000 507000 >>>>>> fisher.test(mat) >>>>> >>>>> Fisher's Exact Test for Count Data >>>>> >>>>> data: mat >>>>> p-value < 2.2e-16 >>>>> alternative hypothesis: true odds ratio is not equal to 1 >>>>> 95 percent confidence interval: >>>>> 0.8789286 0.9087655 >>>>> sample estimates: >>>>> odds ratio >>>>> 0.8937356 >>>>> >>>>> Sorry about that! >>>>> >>>>> This is a case where I suspect there is a real difference, as the >>>>> relative frequency rises from 0.084 to 0.094. However, as I >>>>> mentioned, this result is masked by all the other "significant" >>>>> results. As Naomi says, it is because as the sample size gets >>>>> larger, we have the power to detect tiny changes as significant. >>>>> >>>>> So what is the solution? >>>>> >>>>> John Herbert has suggested something, and I will try that. >>>>> >>>>> Thanks >>>>> Michael >>>>> >>>>> -----Original Message----- >>>>> From: Michael Dondrup [mailto:Michael.Dondrup at bccs.uib.no] >>>>> Sent: 24 July 2009 15:00 >>>>> To: michael watson (IAH-C) >>>>> Cc: bioconductor at stat.math.ethz.ch >>>>> Subject: Re: [BioC] Statistics for next-generation sequencing >>>>> transcriptomics >>>>> >>>>> Hi Michael, >>>>> I am having a very similar problem using 454 seq data, so I am very >>>>> much interested in this discussion. However, I do not quite >>>>> understand how to >>>>> for the contigency table and to achieve such small p-value here. My >>>>> naive approach would be to count hits to GeneA and to count hits to >>>>> the rest of the genome (all - #hits to gene A), giving a pretty >>>>> much >>>>> unbalanced 2x2 table like this: >>>>>> mat >>>>> Sample.1 Sample.2 >>>>> Gene.A 22000 43000 >>>>> The.rest 238000 464000 >>>>> but then I do not see the point here, because there is a large p >>>>> value, as I would expect: >>>>> >>>>>> fisher.test(mat) >>>>> >>>>> Fisher's Exact Test for Count Data >>>>> >>>>> data: mat >>>>> p-value = 0.7717 >>>>> alternative hypothesis: true odds ratio is not equal to 1 >>>>> 95 percent confidence interval: >>>>> 0.9805937 1.0145920 >>>>> sample estimates: >>>>> odds ratio >>>>> 0.9974594 >>>>> >>>>> Am I missing something? >>>>> >>>>> Best >>>>> Michael >>>>> >>>>> >>>>> Am 24.07.2009 um 13:22 schrieb michael watson (IAH-C): >>>>> >>>>>> Hi >>>>>> >>>>>> I'd like to have a discussion about statistics for transcriptomics >>>>>> using next-generation sequencing (if there hasn't already been >>>>>> one - >>>>>> if there has, then please someone point me to it!) >>>>>> >>>>>> What we're seeing in the literature, and here at IAH, are datasets >>>>>> where someone has sequenced the transcriptome of two samples using >>>>>> something like Illumina. These have been mapped to known >>>>>> sequences >>>>>> and counts produced. >>>>>> >>>>>> So what we have is something like this: >>>>>> >>>>>> geneA: 22000 sequences from 260000 match in sample 1, 43000 >>>>>> sequences from 507000 in sample 2. >>>>>> >>>>>> It's been suggested that one possible approach would be to >>>>>> construct >>>>>> 2x2 contingency tables and perform Fisher's exact test or the Chi- >>>>>> squared test, as has been applied to SAGE data. >>>>>> However, I've found that when I do that, the p-values for this >>>>>> type >>>>>> of data are incredibly, incredibly small, such that over 90% of my >>>>>> data points are significant, even after adjusting for multiple >>>>>> testing. I assume/hope that this is because these tests were not >>>>>> designed to cope with this type of data. >>>>>> >>>>>> For instance, applying Fisher's test to the example above yields >>>>>> a p- >>>>>> value of 3.798644e-23. >>>>>> >>>>>> As I see it there are three possibilities: >>>>>> 1) I'm doing something wrong >>>>>> 2) These tests are totally inappropriate for this type of data >>>>>> 3) All of my data points are highly significantly different >>>>>> >>>>>> I'm thinking that 2 is probably true, though I wouldn't rule out >>>>>> 1. >>>>>> >>>>>> Any thoughts and comments are very welcome, >>>>>> >>>>>> Mick >>>>>> >>>>>> >>>>>> [[alternative HTML version deleted]] >>>>>> >>>>>> _______________________________________________ >>>>>> Bioconductor mailing list >>>>>> Bioconductor at stat.math.ethz.ch >>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>> Search the archives: >>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>> >>>>> >>>>> >>>> >>>> Michael Dondrup, Ph.D. >>>> Bergen Center for Computational Science >>>> Computational Biology Unit >>>> Unifob AS - Thorm?hlensgate 55, N-5008 Bergen, Norway >>>> >>>> _______________________________________________ >>>> Bioconductor mailing list >>>> Bioconductor at stat.math.ethz.ch >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> Search the archives: >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>> >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at stat.math.ethz.ch >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > _______________________________________________ Bioconductor mailing list Bioconductor at stat.math.ethz.ch https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD REPLY • link 14.7 years ago michael watson IAH-C ★ 3.4k

Login before adding your answer.