Up & Downregulated genes using DESeq

0

Entering edit mode

OD ▴ 40

@od-5180

Last seen 8.8 years ago

United States

Hello All, I am using DESeq for identify differentially expressed genes among a couple of samples. At the end of the calculations i write the results to 3 files as follow: resSig <- res[ res$padj < .01, ] write.table(resSig,"~DESeq\\All.txt") write.table(resSig[ order( resSig$foldChange, -resSig$baseMean ), ] ,"~\\DESeq\\DownRegulated.txt") write.table(resSig[ order( -resSig$foldChange, -resSig$baseMean ), ],"~\\DESeq\\UpRegulated.txt") My question here is, - In the DESeq paper it is mentioned that the last two lines of code extracts the up and downregulated genes, So I'm thinking the sum of the number of genes in (DownRegulated.txt + UpRegulated.txt) should equals the number of genes with padj value < .01 in the file All.txt. But, that does not apply in my case as the numbers not even close. Any explanation or help is appreciated. -- Thank you, [[alternative HTML version deleted]]

DESeq DESeq • 6.6k views

ADD COMMENT • link updated 12.1 years ago by Sunny Yu Liu ▴ 80 • written 12.1 years ago by OD ▴ 40

0

Entering edit mode

Steve Lianoglou ★ 13k

@steve-lianoglou-2771

Last seen 13 months ago

United States

Hi, On Mon, Mar 26, 2012 at 10:59 PM, Omar Darwish <odarwish83 at="" gmail.com=""> wrote: > Hello All, > > I am using DESeq for identify differentially expressed genes among a couple > of samples. At the end of the calculations i write the results to 3 files > as follow: > > resSig <- res[ res$padj < .01, ] > write.table(resSig,"~DESeq\\All.txt") > write.table(resSig[ order( resSig$foldChange, -resSig$baseMean ), ] > ,"~\\DESeq\\DownRegulated.txt") > write.table(resSig[ order( -resSig$foldChange, -resSig$baseMean ), > ],"~\\DESeq\\UpRegulated.txt") > > My question here is, > > ? - In the DESeq paper it is mentioned that the last two lines of code > ? extracts the up and downregulated genes, So I'm thinking the sum of the > ? number of genes in (DownRegulated.txt + UpRegulated.txt) should equals the > ? number of genes with padj value < .01 in the file All.txt. But, that does > ? not apply in my case as the numbers not even close. Any explanation or help > ? is appreciated. I guess I must be missing something, but aren't the number of genes in your DownRegualted.txt file equal to the number of genes in your UpRegulated.txt file, which is also equal to the number of genes in your All.txt file? So, using your lingo: DownRegulated.txt + UpRegulated.txt == 2 * All.txt Right? You calls to `order(resSig$foldChange ...)` are just shuffling your rows around ... you're not removing any of them. If you want to split up and down regulated genes, you can perhaps subset rows that have foldChanges above (or below) zero, ie: up <- subset(resSig, foldChange > 0) write.table(up[order(up$foldChange, decreasing=TRUE),], "UpRegulated.txt") etc. HTH, -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology ?| Memorial Sloan-Kettering Cancer Center ?| Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact

ADD COMMENT • link 12.1 years ago Steve Lianoglou ★ 13k

0

Entering edit mode

Thanks Steve, you are right (DownRegulated.txt + UpRegulated.txt == 2 * All.txt), So, you are saying no need for the two lines of code to write (DownRegulated.txt and UpRegulated.txt) and I can extract the Upregulated using (up <- subset(resSig, foldChange > 0)) and the downregulated (down <- subset(resSig, foldChange < 0)), right? On Mon, Mar 26, 2012 at 11:09 PM, Steve Lianoglou < mailinglist.honeypot@gmail.com> wrote: > Hi, > > On Mon, Mar 26, 2012 at 10:59 PM, Omar Darwish <odarwish83@gmail.com> > wrote: > > Hello All, > > > > I am using DESeq for identify differentially expressed genes among a > couple > > of samples. At the end of the calculations i write the results to 3 files > > as follow: > > > > resSig <- res[ res$padj < .01, ] > > write.table(resSig,"~DESeq\\All.txt") > > write.table(resSig[ order( resSig$foldChange, -resSig$baseMean ), ] > > ,"~\\DESeq\\DownRegulated.txt") > > write.table(resSig[ order( -resSig$foldChange, -resSig$baseMean ), > > ],"~\\DESeq\\UpRegulated.txt") > > > > My question here is, > > > > - In the DESeq paper it is mentioned that the last two lines of code > > extracts the up and downregulated genes, So I'm thinking the sum of the > > number of genes in (DownRegulated.txt + UpRegulated.txt) should equals > the > > number of genes with padj value < .01 in the file All.txt. But, that > does > > not apply in my case as the numbers not even close. Any explanation or > help > > is appreciated. > > I guess I must be missing something, but aren't the number of genes in > your DownRegualted.txt file equal to the number of genes in your > UpRegulated.txt file, which is also equal to the number of genes in > your All.txt file? So, using your lingo: > > DownRegulated.txt + UpRegulated.txt == 2 * All.txt > > Right? > > You calls to `order(resSig$foldChange ...)` are just shuffling your > rows around ... you're not removing any of them. If you want to split > up and down regulated genes, you can perhaps subset rows that have > foldChanges above (or below) zero, ie: > > up <- subset(resSig, foldChange > 0) > write.table(up[order(up$foldChange, decreasing=TRUE),], "UpRegulated.txt") > > etc. > > HTH, > -steve > > -- > Steve Lianoglou > Graduate Student: Computational Systems Biology > | Memorial Sloan-Kettering Cancer Center > | Weill Medical College of Cornell University > Contact Info: http://cbio.mskcc.org/~lianos/contact > -- Regards, Omar [[alternative HTML version deleted]]

ADD REPLY • link 12.1 years ago OD ▴ 40

0

Entering edit mode

Hi, As far as i gather, the threshold for upregulated or downregulated genes should be >2FoldChange and >-2 FoldChange respectively. Not >0 or <0. Best, Ekta -----Original Message----- From: bioconductor-bounces@r-project.org [mailto:bioconductor- bounces@r-project.org] On Behalf Of Omar Darwish Sent: 27 March 2012 09:53 To: Steve Lianoglou Cc: bioconductor at r-project.org Subject: Re: [BioC] Up & Downregulated genes using DESeq Thanks Steve, you are right (DownRegulated.txt + UpRegulated.txt == 2 * All.txt), So, you are saying no need for the two lines of code to write (DownRegulated.txt and UpRegulated.txt) and I can extract the Upregulated using (up <- subset(resSig, foldChange > 0)) and the downregulated (down <- subset(resSig, foldChange < 0)), right? On Mon, Mar 26, 2012 at 11:09 PM, Steve Lianoglou < mailinglist.honeypot at gmail.com> wrote: > Hi, > > On Mon, Mar 26, 2012 at 10:59 PM, Omar Darwish <odarwish83 at="" gmail.com=""> > wrote: > > Hello All, > > > > I am using DESeq for identify differentially expressed genes among a > couple > > of samples. At the end of the calculations i write the results to 3 files > > as follow: > > > > resSig <- res[ res$padj < .01, ] > > write.table(resSig,"~DESeq\\All.txt") > > write.table(resSig[ order( resSig$foldChange, -resSig$baseMean ), ] > > ,"~\\DESeq\\DownRegulated.txt") > > write.table(resSig[ order( -resSig$foldChange, -resSig$baseMean ), > > ],"~\\DESeq\\UpRegulated.txt") > > > > My question here is, > > > > - In the DESeq paper it is mentioned that the last two lines of code > > extracts the up and downregulated genes, So I'm thinking the sum of the > > number of genes in (DownRegulated.txt + UpRegulated.txt) should equals > the > > number of genes with padj value < .01 in the file All.txt. But, that > does > > not apply in my case as the numbers not even close. Any explanation or > help > > is appreciated. > > I guess I must be missing something, but aren't the number of genes in > your DownRegualted.txt file equal to the number of genes in your > UpRegulated.txt file, which is also equal to the number of genes in > your All.txt file? So, using your lingo: > > DownRegulated.txt + UpRegulated.txt == 2 * All.txt > > Right? > > You calls to `order(resSig$foldChange ...)` are just shuffling your > rows around ... you're not removing any of them. If you want to split > up and down regulated genes, you can perhaps subset rows that have > foldChanges above (or below) zero, ie: > > up <- subset(resSig, foldChange > 0) > write.table(up[order(up$foldChange, decreasing=TRUE),], "UpRegulated.txt") > > etc. > > HTH, > -steve > > -- > Steve Lianoglou > Graduate Student: Computational Systems Biology > | Memorial Sloan-Kettering Cancer Center > | Weill Medical College of Cornell University > Contact Info: http://cbio.mskcc.org/~lianos/contact > -- Regards, Omar [[alternative HTML version deleted]] _______________________________________________ Bioconductor mailing list Bioconductor at r-project.org https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor The information contained in this electronic message and in any attachments to this message is confidential, legally privileged and intended only for use by the person or entity to which this electronic message is addressed. If you are not the intended recipient, and have received this message in error, please notify the sender and system manager by return email and delete the message and its attachments and also you are hereby notified that any distribution, copying, review, retransmission, dissemination or other use of this electronic transmission or the information contained in it is strictly prohibited. Please note that any views or opinions presented in this email are solely those of the author and may not represent those of the Company or bind the Company. Any commitments made over e-mail are not financially binding on the company unless accompanied or followed by a valid purchase order. This message has been scanned for viruses and dangerous content by Mail Scanner, and is believed to be clean. The Company accepts no liability for any damage caused by any virus transmitted by this email. www.jubl.com

ADD REPLY • link 12.1 years ago Ekta Jain ▴ 370

0

Entering edit mode

On 03/27/2012 06:34 AM, Ekta Jain wrote: > As far as i gather, the threshold for upregulated or downregulated genes should be>2FoldChange and>-2 FoldChange respectively. Not>0 or<0. Why 2? Simon

ADD REPLY • link 12.1 years ago Simon Anders ★ 3.7k

0

Entering edit mode

Would be very interested to know how to define a fold change cutoff for RNAseq data. Doesn't seem to be any power analysis that might be of use. I use 2 (arbitrarily) also. Bruce. -----Original Message----- From: bioconductor-bounces@r-project.org [mailto:bioconductor-bounces at r-project.org] On Behalf Of Simon Anders Sent: 27 March 2012 10:15 To: bioconductor at r-project.org Subject: Re: [BioC] Up & Downregulated genes using DESeq On 03/27/2012 06:34 AM, Ekta Jain wrote: > As far as i gather, the threshold for upregulated or downregulated genes should be>2FoldChange and>-2 FoldChange respectively. Not>0 or<0. Why 2? Simon _______________________________________________ Bioconductor mailing list Bioconductor at r-project.org https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor ---------------------------------------------------------------------- -------- Attention: This e-mail is privileged and confidential. If you are not the intended recipient please delete the message and notify the sender. Any views or opinions presented are solely those of the author. This email was scanned by Teagasc and has been certified virus free with the pattern file currently in use. This however cannot guarantee that it does not contain malicious content. Tabhair aire: Ta an r-phost seo faoi phribhleid agus faoi run. Mura tusa an duine a bhi beartaithe leis an teachtaireacht seo a fhail, scrios e le do thoil agus cuir an seoltoir ar an eolas. Is leis an udar amhain aon dearcai no tuairimi a leiritear. Scanadh an r-phost seo le Teagasc agus deimhniodh go raibh se saor o vioras leis an bpatrunchomhad ata in usaid faoi lathair. Ni feidir a rathu leis seo afach nach bhfuil abhar mailiseach ann.

ADD REPLY • link 12.1 years ago Bruce MoranExternal ▴ 50

0

Entering edit mode

I am not a statistician but from whatever experience i have had to find differentially expressed genes, the logFC cut off used has been of a two-fold change (increase or decrease). Of course, if in any analysis your fold change gene list needs be reduced or increased for number of genes you could change the two-fold threshold to zero or three. For a first off step, arbitrarily (as bruce also put it), two fold change seems ideal to use. Best, --E -----Original Message----- From: bioconductor-bounces@r-project.org [mailto:bioconductor- bounces@r-project.org] On Behalf Of Bruce Moran(External) Sent: 27 March 2012 15:07 To: Simon Anders; bioconductor at r-project.org Subject: Re: [BioC] Up & Downregulated genes using DESeq Would be very interested to know how to define a fold change cutoff for RNAseq data. Doesn't seem to be any power analysis that might be of use. I use 2 (arbitrarily) also. Bruce. -----Original Message----- From: bioconductor-bounces@r-project.org [mailto:bioconductor-bounces at r-project.org] On Behalf Of Simon Anders Sent: 27 March 2012 10:15 To: bioconductor at r-project.org Subject: Re: [BioC] Up & Downregulated genes using DESeq On 03/27/2012 06:34 AM, Ekta Jain wrote: > As far as i gather, the threshold for upregulated or downregulated genes should be>2FoldChange and>-2 FoldChange respectively. Not>0 or<0. Why 2? Simon _______________________________________________ Bioconductor mailing list Bioconductor at r-project.org https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor ---------------------------------------------------------------------- -------- Attention: This e-mail is privileged and confidential. If you are not the intended recipient please delete the message and notify the sender. Any views or opinions presented are solely those of the author. This email was scanned by Teagasc and has been certified virus free with the pattern file currently in use. This however cannot guarantee that it does not contain malicious content. Tabhair aire: Ta an r-phost seo faoi phribhleid agus faoi run. Mura tusa an duine a bhi beartaithe leis an teachtaireacht seo a fhail, scrios e le do thoil agus cuir an seoltoir ar an eolas. Is leis an udar amhain aon dearcai no tuairimi a leiritear. Scanadh an r-phost seo le Teagasc agus deimhniodh go raibh se saor o vioras leis an bpatrunchomhad ata in usaid faoi lathair. Ni feidir a rathu leis seo afach nach bhfuil abhar mailiseach ann. _______________________________________________ Bioconductor mailing list Bioconductor at r-project.org https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor The information contained in this electronic message and in any attachments to this message is confidential, legally privileged and intended only for use by the person or entity to which this electronic message is addressed. If you are not the intended recipient, and have received this message in error, please notify the sender and system manager by return email and delete the message and its attachments and also you are hereby notified that any distribution, copying, review, retransmission, dissemination or other use of this electronic transmission or the information contained in it is strictly prohibited. Please note that any views or opinions presented in this email are solely those of the author and may not represent those of the Company or bind the Company. Any commitments made over e-mail are not financially binding on the company unless accompanied or followed by a valid purchase order. This message has been scanned for viruses and dangerous content by Mail Scanner, and is believed to be clean. The Company accepts no liability for any damage caused by any virus transmitted by this email. www.jubl.com

ADD REPLY • link 12.1 years ago Ekta Jain ▴ 370

0

Entering edit mode

Dear Ekta and Bruce just for the record: Omar's initial question was how to export a result table from DESeq into text files for subsequent analysis e.g. by spreadsheets. That was answered by Steve. Then a discussion started on fold change cutoffs that might be applied in addition to the criterion of statistical significance (p-value). People might enjoy various arbitrary choices here, including "2", but this has nothing to do with DESeq or with Omar's question. best wishes Wolfgang Ekta Jain scripsit 03/27/2012 12:46 PM: > I am not a statistician but from whatever experience i have had to find differentially expressed genes, the logFC cut off used has been of a two-fold change (increase or decrease). Of course, if in any analysis your fold change gene list needs be reduced or increased for number of genes you could change the two-fold threshold to zero or three. > > For a first off step, arbitrarily (as bruce also put it), two fold change seems ideal to use. > > Best, > --E > > > > -----Original Message----- > From: bioconductor-bounces at r-project.org [mailto:bioconductor- bounces at r-project.org] On Behalf Of Bruce Moran(External) > Sent: 27 March 2012 15:07 > To: Simon Anders; bioconductor at r-project.org > Subject: Re: [BioC] Up& Downregulated genes using DESeq > > Would be very interested to know how to define a fold change cutoff for > RNAseq data. Doesn't seem to be any power analysis that might be of use. > > I use 2 (arbitrarily) also. > > Bruce. > > -----Original Message----- > From: bioconductor-bounces at r-project.org > [mailto:bioconductor-bounces at r-project.org] On Behalf Of Simon Anders > Sent: 27 March 2012 10:15 > To: bioconductor at r-project.org > Subject: Re: [BioC] Up& Downregulated genes using DESeq > > On 03/27/2012 06:34 AM, Ekta Jain wrote: >> As far as i gather, the threshold for upregulated or downregulated > genes should be>2FoldChange and>-2 FoldChange respectively. Not>0 or<0. > > Why 2? > > Simon > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > -------------------------------------------------------------------- ---------- > Attention: > This e-mail is privileged and confidential. If you are not the intended recipient please delete the message and notify the sender. Any views or opinions presented are solely those of the author. > This email was scanned by Teagasc and has been certified virus free with the pattern file currently in use. This however cannot guarantee that it does not contain malicious content. > > Tabhair aire: > Ta an r-phost seo faoi phribhleid agus faoi run. Mura tusa an duine a bhi beartaithe leis an teachtaireacht seo a fhail, scrios e le do thoil agus cuir an seoltoir ar an eolas. Is leis an udar amhain aon dearcai no tuairimi a leiritear. > Scanadh an r-phost seo le Teagasc agus deimhniodh go raibh se saor o vioras leis an bpatrunchomhad ata in usaid faoi lathair. Ni feidir a rathu leis seo afach nach bhfuil abhar mailiseach ann. > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > The information contained in this electronic message and in any attachments to this message is confidential, legally privileged and intended only for use by the person or entity to which this electronic message is addressed. If you are not the intended recipient, and have received this message in error, please notify the sender and system manager by return email and delete the message and its attachments and also you are hereby notified that any distribution, copying, review, retransmission, dissemination or other use of this electronic transmission or the information contained in it is strictly prohibited. Please note that any views or opinions presented in this email are solely those of the author and may not represent those of the Company or bind the Company. Any commitments made over e-mail are not financially binding on the company unless accompanied or followed by a valid purchase order. This message has been scanned for viruses and dangerous content by Mail Scanner, a! > nd is believed to be clean. The Company accepts no liability for any damage caused by any virus transmitted by this email. > www.jubl.com > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Best wishes Wolfgang Wolfgang Huber EMBL http://www.embl.de/research/units/genome_biology/huber

ADD REPLY • link 12.1 years ago Wolfgang Huber ★ 13k

0

Entering edit mode

Sunny Yu Liu ▴ 80

@sunny-yu-liu-4982

Last seen 9.6 years ago

Because no matter how fancy the statistics is, we have to find something having biological significance. For example, in most time, even have p<0.000000000001, still get no biological effects. For most gene expression change, people always use fold change 2 as a cutoff for microarray or qPCR. As for RNAseq, since the method is much more sensitive, I guess it must lose some specificity, so I think it may need a higher cutoff number than 2. -----Original Message----- From: bioconductor-bounces@r-project.org on behalf of Simon Anders Sent: Tue 3/27/2012 5:14 AM To: bioconductor@r-project.org Subject: Re: [BioC] Up & Downregulated genes using DESeq On 03/27/2012 06:34 AM, Ekta Jain wrote: > As far as i gather, the threshold for upregulated or downregulated genes should be>2FoldChange and>-2 FoldChange respectively. Not>0 or<0. Why 2? Simon _______________________________________________ Bioconductor mailing list Bioconductor@r-project.org https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor [[alternative HTML version deleted]]

ADD COMMENT • link 12.1 years ago Sunny Yu Liu ▴ 80

0

Entering edit mode

Hi On 03/27/2012 03:43 PM, Sunny Yu Liu wrote: > Because no matter how fancy the statistics is, we have to find something > having biological significance. For example, in most time, even have > p<0.000000000001, still get no biological effects. > For most gene expression change, people always use fold change 2 as a > cutoff for microarray or qPCR. As for RNAseq, since the method is much > more sensitive, I guess it must lose some specificity, so I think it may > need a higher cutoff number than 2. I guess, this needs some clarification, before we confuse newcomers to RNA-Seq too much, A cut-off of 2 on a log2 scale means a fold change of at least four-fold. This is a lot, and there is plenty of cases where a weaker signal is biological meaningful. Depending on the experimental setup, fold changes of +/-20% (.26 on a log2 scale) or even much less can be informative. In most RNA-Seq experiments, a p value cut-off to control false discover rate at some sensible value, say 5% or 10%, will not allow any genes with a log2 fold change below a certain value to be called significant, and in my experience, this value nearly always make further fold- change cut-offs unnecessary. This belief that fold-change cut-offs are important may stem from the proliferation of incorrect analysis methods in the RNA-Seq literature. In very many papers, analysis methods based on Fisher's test or on a likelihood ratio test based on a Poisson distribution are used. In fact, there are even several reviews which suggest such approaches. Apart from the fact that these tests simply inadmissible (see e.g. Baggerly et al., Bioinformatics 19 (2003) 1477), they cause a peculiar pattern: They give all genes with expression strength above a few tens of thousand reads absurdly low p values even in case of very weak fold changes, so that nearly all strongly expressed genes are significant. I've seen in several posts on this and other mailing lists the advice to use a very small p value threshold (adjusted p value < .001 or the like), combined with a fold change cut-off, to rectify the situation. With a correct test, it is rare that you have significant genes with so weak fold change that you have to doubt their biological relevance. I see, however, at least one case where a fold-change cut-off is useful: It is unavoidable that the decision boundary that separated significant from non-significant fold-changes goes down with increasing count values. Hence, any list of "hits" will be enriched for strong genes. If this is problematic for downstream analysis, one may opt, as a crude but workable remedy, to decide on a fold change cut-off and consider all genes below this as not significant _and_ omit from the universe of all enrichment tests all genes with a count value below the count required for this fold change to becone significant (i.e., considering these weak genes as essentially "not testable" rather than "not significant"). Simon

ADD REPLY • link 12.1 years ago Simon Anders ★ 3.7k

0

Entering edit mode

Sunny Yu Liu ▴ 80

@sunny-yu-liu-4982

Last seen 9.6 years ago

Great explanation! Thanks! Have to clarify that when I mention foldchange 2, it is not log2. For differentiated expression genes, if taking high confident hits for downstream validation and analysis, it should be more consistent and reliable. However, I think false positive rate should increase for hits close to the cutoff line, and I am wondering how much this is caused/affected by the noise level of the method/data. Is any literature discussing about this? Say how the noise of data affects cutoff and false positive rate. If so, can cite them here? Thanks! -----Original Message----- From: bioconductor-bounces@r-project.org on behalf of Simon Anders Sent: Tue 3/27/2012 10:06 AM Cc: bioconductor@r-project.org Subject: Re: [BioC] Up & Downregulated genes using DESeq Hi On 03/27/2012 03:43 PM, Sunny Yu Liu wrote: > Because no matter how fancy the statistics is, we have to find something > having biological significance. For example, in most time, even have > p<0.000000000001, still get no biological effects. > For most gene expression change, people always use fold change 2 as a > cutoff for microarray or qPCR. As for RNAseq, since the method is much > more sensitive, I guess it must lose some specificity, so I think it may > need a higher cutoff number than 2. I guess, this needs some clarification, before we confuse newcomers to RNA-Seq too much, A cut-off of 2 on a log2 scale means a fold change of at least four-fold. This is a lot, and there is plenty of cases where a weaker signal is biological meaningful. Depending on the experimental setup, fold changes of +/-20% (.26 on a log2 scale) or even much less can be informative. In most RNA-Seq experiments, a p value cut-off to control false discover rate at some sensible value, say 5% or 10%, will not allow any genes with a log2 fold change below a certain value to be called significant, and in my experience, this value nearly always make further fold- change cut-offs unnecessary. This belief that fold-change cut-offs are important may stem from the proliferation of incorrect analysis methods in the RNA-Seq literature. In very many papers, analysis methods based on Fisher's test or on a likelihood ratio test based on a Poisson distribution are used. In fact, there are even several reviews which suggest such approaches. Apart from the fact that these tests simply inadmissible (see e.g. Baggerly et al., Bioinformatics 19 (2003) 1477), they cause a peculiar pattern: They give all genes with expression strength above a few tens of thousand reads absurdly low p values even in case of very weak fold changes, so that nearly all strongly expressed genes are significant. I've seen in several posts on this and other mailing lists the advice to use a very small p value threshold (adjusted p value < .001 or the like), combined with a fold change cut-off, to rectify the situation. With a correct test, it is rare that you have significant genes with so weak fold change that you have to doubt their biological relevance. I see, however, at least one case where a fold-change cut-off is useful: It is unavoidable that the decision boundary that separated significant from non-significant fold-changes goes down with increasing count values. Hence, any list of "hits" will be enriched for strong genes. If this is problematic for downstream analysis, one may opt, as a crude but workable remedy, to decide on a fold change cut-off and consider all genes below this as not significant _and_ omit from the universe of all enrichment tests all genes with a count value below the count required for this fold change to becone significant (i.e., considering these weak genes as essentially "not testable" rather than "not significant"). Simon _______________________________________________ Bioconductor mailing list Bioconductor@r-project.org https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor [[alternative HTML version deleted]]

ADD COMMENT • link 12.1 years ago Sunny Yu Liu ▴ 80

0

Entering edit mode

HU Sunny On 2012-03-27 21:24, Sunny Yu Liu wrote: > Great explanation! Thanks! > Have to clarify that when I mention foldchange 2, it is not log2. > For differentiated expression genes, if taking high confident hits for > downstream validation and analysis, it should be more consistent and > reliable. However, I think false positive rate should increase for hits > close to the cutoff line, and I am wondering how much this is > caused/affected by the noise level of the method/data. Is any literature > discussing about this? Say how the noise of data affects cutoff and > false positive rate. If so, can cite them here? Thanks! The whole point of statistical hypothesis testing is to give your guarantees about the false positive rate. If this should be unclear, remind yourself of the exact meaning of the term 'p value'. In genomics, we usually work with false discovery rate (FDR) control: If you adjust your p values with Benjamini-Hochberg (BH) and then cut at, say, .1, this means that the FDR is <= 10%, i.e., the list of all genes with padj < .1 should contain less than 10% false positives. By "should" I mean that a sound hypothesis testing method must provide this if its assumptions are fulfilled. Benjamini and Hochberg's original false discovery rate (FDR) is a property of the whole list of genes with adjusted p value below the chosen threshold and hence some kind of average probability for a gene to be a false positives. More recently, other authors have introduced the concept of "local FDR", which aims to capture that genes in the list have different probability of being a false positives depending on the individual gene's signal-to-noise ratio. I was recently searching for a good review on this that can be recommended to non-statistician readers, but did not find anything suitable. Maybe somebody else on the list has a recommendation? About verification: Be careful to distinguish two situation: (i) In a paper you present a list of differentially expressed genes and wish to convince the reader that this list does not contain more than 10% false positives. To this end you perform verification experiments on a selection of genes. Here, it would be cheating to only attempt to verify high-confidence (low p value) hits, because it does not help defending the claim that your significance cut-off is at the right place if you do not try to genes close to the cut-off. (Once you pay attention to it you will notice that this "cheat" is used quite commonly in genomics papers.) (ii) The purpose of your experiment is a screen to select promising genes for detailed follow-up study. Then, you should, of course, try your luck with the hits with highest confidence. I hope these explanations helped. Simon

ADD REPLY • link 12.1 years ago Simon Anders ★ 3.7k

0

Entering edit mode

Hi Simon, On Tue, Mar 27, 2012 at 1:05 PM, Simon Anders <anders@embl.de> wrote: > > I was recently searching for a good review on this that can be recommended > to non-statistician readers, but did not find anything suitable. Maybe > somebody else on the list has a recommendation? > > Stefanie Scheid's paper is admirably direct and to-the-point: http://bioinformatics.oxfordjournals.org/content/21/12/2921.full.pdf William Noble's paper in Nature Biotech touches on local FDR: http://www.nature.com/nbt/journal/v27/n12/full/nbt1209-1135.html Brad Efron's course at Stanford goes into much much much more depth: http://www-stat.stanford.edu/~omkar/329/ (PDF of book is online) Maybe some of these are somewhat useful? I think that Noble's paper is perhaps the most introductory, then Scheid's, and then Efron's course probably has new insights to offer anyone who hasn't spent a lifetime studying empirical Bayes methods. So, um, the opposite of introductory. Thanks for assembling a great package in DEXseq and also for your very helpful remarks on the list, --t -- *A model is a lie that helps you see the truth.* * * Howard Skipper<http: cancerres.aacrjournals.org="" content="" 31="" 9="" 1173.full.pdf=""> [[alternative HTML version deleted]]

ADD REPLY • link 12.1 years ago Tim Triche ★ 4.2k

0

Entering edit mode

Sunny Yu Liu ▴ 80

@sunny-yu-liu-4982

Last seen 9.6 years ago

Clear enough! Thanks a lot! -----Original Message----- From: Simon Anders [mailto:anders@embl.de] Sent: Tue 3/27/2012 4:05 PM To: Sunny Yu Liu Cc: bioconductor@r-project.org Subject: Re: [BioC] Up & Downregulated genes using DESeq HU Sunny On 2012-03-27 21:24, Sunny Yu Liu wrote: > Great explanation! Thanks! > Have to clarify that when I mention foldchange 2, it is not log2. > For differentiated expression genes, if taking high confident hits for > downstream validation and analysis, it should be more consistent and > reliable. However, I think false positive rate should increase for hits > close to the cutoff line, and I am wondering how much this is > caused/affected by the noise level of the method/data. Is any literature > discussing about this? Say how the noise of data affects cutoff and > false positive rate. If so, can cite them here? Thanks! The whole point of statistical hypothesis testing is to give your guarantees about the false positive rate. If this should be unclear, remind yourself of the exact meaning of the term 'p value'. In genomics, we usually work with false discovery rate (FDR) control: If you adjust your p values with Benjamini-Hochberg (BH) and then cut at, say, .1, this means that the FDR is <= 10%, i.e., the list of all genes with padj < .1 should contain less than 10% false positives. By "should" I mean that a sound hypothesis testing method must provide this if its assumptions are fulfilled. Benjamini and Hochberg's original false discovery rate (FDR) is a property of the whole list of genes with adjusted p value below the chosen threshold and hence some kind of average probability for a gene to be a false positives. More recently, other authors have introduced the concept of "local FDR", which aims to capture that genes in the list have different probability of being a false positives depending on the individual gene's signal-to-noise ratio. I was recently searching for a good review on this that can be recommended to non-statistician readers, but did not find anything suitable. Maybe somebody else on the list has a recommendation? About verification: Be careful to distinguish two situation: (i) In a paper you present a list of differentially expressed genes and wish to convince the reader that this list does not contain more than 10% false positives. To this end you perform verification experiments on a selection of genes. Here, it would be cheating to only attempt to verify high-confidence (low p value) hits, because it does not help defending the claim that your significance cut-off is at the right place if you do not try to genes close to the cut-off. (Once you pay attention to it you will notice that this "cheat" is used quite commonly in genomics papers.) (ii) The purpose of your experiment is a screen to select promising genes for detailed follow-up study. Then, you should, of course, try your luck with the hits with highest confidence. I hope these explanations helped. Simon [[alternative HTML version deleted]]

ADD COMMENT • link 12.1 years ago Sunny Yu Liu ▴ 80

Login before adding your answer.