Question

limma

0

Entering edit mode

Gordon Smyth 50k

@gordon-smyth

Last seen 4 hours ago

WEHI, Melbourne, Australia

Dear Seraya, Wei Shi has put his finger on the issue in a recent reply to this thread. Let me elaborate. Firstly, the true fold change for a gene expressed in one condition but not the other is Infinity. While true, this a bit unhelpful, because an Infinite fold change doesn't tell you whether the gene is highly expressed or just barely expressed in the condition for which it is expressed. limma gets around this problem in the following way (assuming that you're using limma preprocessing as well as linear model fitting). limma offsets all the expression values away from zero, so that all genes get a minimum expression level. If you use the limma neqc() function to normalize Illumina data, the default offset is 16, translating to 4 on the log2 scale. This is why your AveExpr values will never be less than 4. So, when a gene is absent in one condition, and has average expression value x in the other, the fold changes is computed something like: logFC = log2( (x+16) / (0+16) ) This means that limma never returns an infinite fold change. Note also that the denominator is not noise, rather it is the offset plus a very small amount of noise. This means that the estimated fold change is not unstable or highly variable. It is quite stable, but biased. The act of offsetting the expression values away from zero means that the fold changes tend to be underestimated, although the bias is negligible for highly expressed genes. Generally speaking, the gain in noise reduction and statistical power that arises from using a small offset far outways the disadvantage of biasing the fold changes changes. This has been extensively discussed in the recent paper: Shi, W, Oshlack, A, and Smyth, GK (2010). Optimizing the noise versus bias trade-off for Illumina Whole Genome Expression BeadChips. Nucleic Acids Research 38, e204. By the way, you might like to try out the propexpr() function in limma also, see: Shi, W, de Graaf, C, Kinkel, S, Achtman, A, Baldwin, T, Schofield, L, Scott, H, Hilton, D, Smyth, GK (2010). Estimating the proportion of microarray probes expressed in an RNA sample. Nucleic Acids Research 38, 2168-2176. You could say to the reviewer: "limma ensures that all probes are assigned at least a minimum non-zero expression level on all arrays, in order to minimize the variability of log-intensities for lowly expressed probes. Probes that are expressed in one condition but not other will be assigned a large fold change for which the denominator is the minimum expression level. This approach has the advantage that genes can be ranked by fold change in a meaningful way, because genes with larger expression expression changes will always be assigned a larger fold change." Best wishes Gordon > Date: Tue, 05 Apr 2011 18:05:08 +0200 > From: "Seraya Maouche" <seraya.maouche at="" uk-sh.de=""> > To: <bioconductor at="" r-project.org=""> > Subject: [BioC] Limma > > Dear Prof Gordon, dear Bioconductor members, > > I have performed gene expression analysis using Limma (Illumina human > ref8) comparing two types of cells (referred below as cond1 and cond2). > Based on detection call, I filtered out transcripts which are absent in > both types of cells. Transcripts which were expressed only in one cell > type were included in the analysis. > > I have received the comment below from a reviewer who seems not agree to > calculate fold change for genes expressed only in one condition. Would > it be possible to have your opinion about this. > > Thank you in advance for your time, > S Maouche > > "There is a little conceptual difficulty related to the cond1/cond2 > comparisons for genes that are considered not detected. If a gene > product is absent (0) in one cell then no fold change can be computed > (table 2). I don?t know how to circumvent this difficult except by > saying that the ?noise? is considered to reflect low expression. The > terms ?not detected? and ?not expressed? are often used interchangeably > while this is not the same. Detection is based on the definition adopted > and in many places of the manuscript it should be used in place of > expression." > > > > Universit?tsklinikum Schleswig-Holstein > Rechtsf?hige Anstalt des ?ffentlichen Rechts der > Christian-Albrechts-Universit?t zu Kiel und der Universit?t zu L?beck > > Vorstandsmitglieder: Prof. Dr. Jens Scholz (Vorsitzender), Peter > Pansegrau, Christa Meyer > Vorsitzende des Aufsichtsrates: Dr. Cordelia Andre?en > Bankverbindungen: F?rde Sparkasse BLZ 210 501 70 Kto.-Nr. 100 206, > Commerzbank AG BLZ 230 800 40 Kto.-Nr. 300 041 200 > > Diese E-Mail enth?lt vertrauliche Informationen und ist nur f?r die > Personen bestimmt, an welche sie gerichtet ist. Sollten Sie nicht der > bestimmungsgem??e Empf?nger sein, bitten wir Sie, uns hiervon > unverz?glich zu unterrichten und die E-Mail zu vernichten. > Wir weisen darauf hin, dass der Gebrauch und die Weiterleitung einer > nicht bestimmungsgem?? empfangenen E-Mail und ihres Inhalts gesetzlich > verboten sind und ggf. Schadensersatzanspr?che ausl?sen k?nnen. > > > ------------------------------ ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}}

Preprocessing limma Preprocessing limma • 1.6k views

ADD COMMENT • link 13.1 years ago Gordon Smyth 50k

score 0 · Answer 1 · 2011-04-07

Dear Prof Gordon, Many thanks for your time and for very clear explanation which is very helpful for me and for all scientists involved in microarray analysis . Several Bioinformaticians and statisticians I had contacted to discuss this point told me that they would be very interested in your opinion on this question so I will send them the link to read this very useful discussion. Thank you for the two papers, I have read the first one and started the second yesterday. Best wishes, Seraya -----Original Message----- From: Gordon K Smyth [mailto:smyth@wehi.EDU.AU] Sent: Donnerstag, 7. April 2011 03:30 To: Seraya Maouche Cc: Wei Shi; Bioconductor mailing list Subject: limma Dear Seraya, Wei Shi has put his finger on the issue in a recent reply to this thread. Let me elaborate. Firstly, the true fold change for a gene expressed in one condition but not the other is Infinity. While true, this a bit unhelpful, because an Infinite fold change doesn't tell you whether the gene is highly expressed or just barely expressed in the condition for which it is expressed. limma gets around this problem in the following way (assuming that you're using limma preprocessing as well as linear model fitting). limma offsets all the expression values away from zero, so that all genes get a minimum expression level. If you use the limma neqc() function to normalize Illumina data, the default offset is 16, translating to 4 on the log2 scale. This is why your AveExpr values will never be less than 4. So, when a gene is absent in one condition, and has average expression value x in the other, the fold changes is computed something like: logFC = log2( (x+16) / (0+16) ) This means that limma never returns an infinite fold change. Note also that the denominator is not noise, rather it is the offset plus a very small amount of noise. This means that the estimated fold change is not unstable or highly variable. It is quite stable, but biased. The act of offsetting the expression values away from zero means that the fold changes tend to be underestimated, although the bias is negligible for highly expressed genes. Generally speaking, the gain in noise reduction and statistical power that arises from using a small offset far outways the disadvantage of biasing the fold changes changes. This has been extensively discussed in the recent paper: Shi, W, Oshlack, A, and Smyth, GK (2010). Optimizing the noise versus bias trade-off for Illumina Whole Genome Expression BeadChips. Nucleic Acids Research 38, e204. By the way, you might like to try out the propexpr() function in limma also, see: Shi, W, de Graaf, C, Kinkel, S, Achtman, A, Baldwin, T, Schofield, L, Scott, H, Hilton, D, Smyth, GK (2010). Estimating the proportion of microarray probes expressed in an RNA sample. Nucleic Acids Research 38, 2168-2176. You could say to the reviewer: "limma ensures that all probes are assigned at least a minimum non-zero expression level on all arrays, in order to minimize the variability of log-intensities for lowly expressed probes. Probes that are expressed in one condition but not other will be assigned a large fold change for which the denominator is the minimum expression level. This approach has the advantage that genes can be ranked by fold change in a meaningful way, because genes with larger expression expression changes will always be assigned a larger fold change." Best wishes Gordon > Date: Tue, 05 Apr 2011 18:05:08 +0200 > From: "Seraya Maouche" <seraya.maouche at="" uk-sh.de=""> > To: <bioconductor at="" r-project.org=""> > Subject: [BioC] Limma > > Dear Prof Gordon, dear Bioconductor members, > > I have performed gene expression analysis using Limma (Illumina human > ref8) comparing two types of cells (referred below as cond1 and cond2). > Based on detection call, I filtered out transcripts which are absent > in both types of cells. Transcripts which were expressed only in one > cell type were included in the analysis. > > I have received the comment below from a reviewer who seems not agree > to calculate fold change for genes expressed only in one condition. > Would it be possible to have your opinion about this. > > Thank you in advance for your time, > S Maouche >

score 0 · Answer 2 · 2011-04-07

Hi Gordon > .... "limma ensures that all probes are > assigned at least a minimum non-zero expression level on all arrays, in > order to minimize the variability of log-intensities for lowly expressed > probes. Probes that are expressed in one condition but not other will be > assigned a large fold change for which the denominator is the minimum > expression level. This approach has the advantage that genes can be > ranked by fold change in a meaningful way, because genes with larger > expression expression changes will always be assigned a larger fold > change." I am not sure I follow: (i) (20 + 16) / (10 + 16) < (15000 + 16) / (10000 + 16) but (ii) 20 / 10 > 15000 / 10000 You assume that measurements of 20 and 10 are less reliable (or perhaps biologically less important?) than measurements of 20000 and 10000, thus that ranking (i) should be used - but that depends on an error model (which you encode in the pseudocount parameter '16') and a subjective trade-off between precision and effect size. I agree with you that the approach is useful, and also that it is good to provide a very simple recipe for people that either cannot deal with or do not care about the quantitative details. Still, this post is for the people that do :) Cheers Wolfgang > > Best wishes > Gordon > > >> Date: Tue, 05 Apr 2011 18:05:08 +0200 >> From: "Seraya Maouche" <seraya.maouche at="" uk-sh.de=""> >> To: <bioconductor at="" r-project.org=""> >> Subject: [BioC] Limma >> >> Dear Prof Gordon, dear Bioconductor members, >> >> I have performed gene expression analysis using Limma (Illumina human >> ref8) comparing two types of cells (referred below as cond1 and cond2). >> Based on detection call, I filtered out transcripts which are absent in >> both types of cells. Transcripts which were expressed only in one cell >> type were included in the analysis. >> >> I have received the comment below from a reviewer who seems not agree to >> calculate fold change for genes expressed only in one condition. Would >> it be possible to have your opinion about this. >> >> Thank you in advance for your time, >> S Maouche >> >> "There is a little conceptual difficulty related to the cond1/cond2 >> comparisons for genes that are considered not detected. If a gene >> product is absent (0) in one cell then no fold change can be computed >> (table 2). I don?t know how to circumvent this difficult except by >> saying that the ?noise? is considered to reflect low expression. The >> terms ?not detected? and ?not expressed? are often used interchangeably >> while this is not the same. Detection is based on the definition adopted >> and in many places of the manuscript it should be used in place of >> expression." >> >> >> >> Universit?tsklinikum Schleswig-Holstein >> Rechtsf?hige Anstalt des ?ffentlichen Rechts der >> Christian-Albrechts-Universit?t zu Kiel und der Universit?t zu L?beck >> >> Vorstandsmitglieder: Prof. Dr. Jens Scholz (Vorsitzender), Peter >> Pansegrau, Christa Meyer >> Vorsitzende des Aufsichtsrates: Dr. Cordelia Andre?en >> Bankverbindungen: F?rde Sparkasse BLZ 210 501 70 Kto.-Nr. 100 206, >> Commerzbank AG BLZ 230 800 40 Kto.-Nr. 300 041 200 >> >> Diese E-Mail enth?lt vertrauliche Informationen und ist nur f?r die >> Personen bestimmt, an welche sie gerichtet ist. Sollten Sie nicht der >> bestimmungsgem??e Empf?nger sein, bitten wir Sie, uns hiervon >> unverz?glich zu unterrichten und die E-Mail zu vernichten. >> Wir weisen darauf hin, dass der Gebrauch und die Weiterleitung einer >> nicht bestimmungsgem?? empfangenen E-Mail und ihres Inhalts gesetzlich >> verboten sind und ggf. Schadensersatzanspr?che ausl?sen k?nnen. >> >> >> ------------------------------ > > ______________________________________________________________________ > The information in this email is confidential and inte...{{dropped:17}}

score 0 · Answer 3 · 2011-04-09

0

Entering edit mode

Gordon Smyth 50k

@gordon-smyth

Last seen 4 hours ago

WEHI, Melbourne, Australia

Hi Wolfgang, > Date: Thu, 07 Apr 2011 12:07:05 +0200 > From: Wolfgang Huber <whuber at="" embl.de=""> > To: bioconductor at r-project.org > Subject: Re: [BioC] limma > > Hi Gordon > >> .... "limma ensures that all probes are >> assigned at least a minimum non-zero expression level on all arrays, in >> order to minimize the variability of log-intensities for lowly expressed >> probes. Probes that are expressed in one condition but not other will be >> assigned a large fold change for which the denominator is the minimum >> expression level. This approach has the advantage that genes can be >> ranked by fold change in a meaningful way, because genes with larger >> expression expression changes will always be assigned a larger fold >> change." This comment was in the context of genes expressed in one condition and not the other (and was part of a longer post). In this context the estimated fold change is essentially monotonic in the higher expression level, provided the zero value is offset away from zero, so larger expression changes do translate into larger fold changes. In other contexts, it is a question of importance ranking, which I guess is the issue that you're raising below. > I am not sure I follow: > > (i) (20 + 16) / (10 + 16) < (15000 + 16) / (10000 + 16) > > but > > (ii) 20 / 10 > 15000 / 10000 > > You assume that measurements of 20 and 10 are less reliable (or perhaps > biologically less important?) than measurements of 20000 and 10000, thus > that ranking (i) should be used Generally I rank probes by a combination of statistical significance and fold change, not by fold change alone. However, the discussion is in the context of Illumina expression data, and Illumina intensities of 10 and 20 are almost certain to be from non-expressed probes, hence contain no biological signal. So, yes, I would generally view measurements of 20000 and 10000 as both statistically more precise and biologically more important than 20 and 10, and I would therefore want to rank as (i) rather than (ii). I'm pretty sure that you would too. > - but that depends on an error model (which you encode in the > pseudocount parameter '16') I put more faith in experimental evidence than I do in statistical error models. The fact that offsetting the intensities away from zero reduces the FDR is an observation from considerable testing with calibration data sets. The evidence doesn't rely on an error model. Much of the evidence is laid out in the paper that I cited in my earlier email: Shi, W, Oshlack, A, and Smyth, GK (2010). Optimizing the noise versus bias trade-off for Illumina Whole Genome Expression BeadChips. Nucleic Acids Research 38, e204. > and a subjective trade-off between precision and effect size. The fact that the value is chosen from experience with data, rather than as as a parameter estimated from a mathematical model, doesn't make it subjective. As I've said, I take mathematical models with a grain of salt. It's easy to verify experimentally that well known preprocessing algorithms, like RMA for Affy data or vst for Illumina data (you're an author!), also have the effect of offsetting intensities away from zero before logging them. I think it is a useful insight to observe that this offsetting is a good part of why those algorithms have good statistical properties. vst has an effective offset of around 200 (Wei et al, Tables 2 and 3). As far as I know, the offset was not designed into either of the above algorithms. I suspect it was rather a fortuitious but unexpected outcome. The offset that vst seems to have isn't a natural outcome of the variance stabilization model, because it generally turns out to be much larger than the offset that would best stabilize the variance. Anyway, we find that by using more modest offsets in the range 16-50 for Illumina data, we can achieve FDR as good as vst but with less bias, much less contraction of the fold changes. Again, this is a conclusion from testing rather than from modelling. I prefer to make the offset explicit, clearly visible to users, rather than leaving it implicit or unexpected. This approach (neqc etc) isn't the only good way to address noise, bias and variance stabilization issues, but it's the one that seems to work best for me at the moment. Cheers Gordon > I agree with you that the approach is useful, and also that it is good > to provide a very simple recipe for people that either cannot deal with > or do not care about the quantitative details. Still, this post is for > the people that do :) > > Cheers > Wolfgang ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}}

ADD COMMENT • link 13.1 years ago Gordon Smyth 50k

0

Entering edit mode

Hi Gordon I think we agree, my only reason for making this post is to point out that while the choice of '16' might be a good comprise overall, better choices can exist depending on the dataset. E.g. if there are many replicates, then these can compensate the imprecision from using a smaller offset and improved biases may be had. Variance stabilising transformations are of course very related to the offseting that you propose. The "glog" in vsn is glog(x) = log( x+ sqrt(x^2 + c) ) which is in practice similar to the shifted log, i.e. log (x+a). (It even avoids problems that the shifted log has when x approaches -a. See also e.g. PMID 12761059.) Such transformations provide a good way of looking at microarray data, and indeed at high-thoughput sequencing data, for many though not all uses. As you say in your NAR article, there is a continuum of choices along the precision - bias tradeoff axis, which can be parameterized by 'a' or 'c'. Btw, a subtlety of vst that many users seem not to be aware of is that it uses the between beads variance (for the same gene, in the same sample) rather than the more relevant (and larger) between replicates variance for finding its 'c'. Best wishes Wolfgang PS A pet peeve of mine, how come that people are OK to state numbers such as '16' without reference to units: ideally, 'official' SI units, but even an ad hoc definition would do, based on a common agreement on reagent quantities, scanner settings, software settings, to make the oxymoron 'arbitrary fluorescence units' just a little bit less so. Il Apr/9/11 3:35 AM, Gordon K Smyth ha scritto: > Hi Wolfgang, > >> Date: Thu, 07 Apr 2011 12:07:05 +0200 >> From: Wolfgang Huber <whuber at="" embl.de=""> >> To: bioconductor at r-project.org >> Subject: Re: [BioC] limma >> >> Hi Gordon >> >>> .... "limma ensures that all probes are >>> assigned at least a minimum non-zero expression level on all arrays, in >>> order to minimize the variability of log-intensities for lowly expressed >>> probes. Probes that are expressed in one condition but not other will be >>> assigned a large fold change for which the denominator is the minimum >>> expression level. This approach has the advantage that genes can be >>> ranked by fold change in a meaningful way, because genes with larger >>> expression expression changes will always be assigned a larger fold >>> change." > > This comment was in the context of genes expressed in one condition and > not the other (and was part of a longer post). In this context the > estimated fold change is essentially monotonic in the higher expression > level, provided the zero value is offset away from zero, so larger > expression changes do translate into larger fold changes. In other > contexts, it is a question of importance ranking, which I guess is the > issue that you're raising below. > >> I am not sure I follow: >> >> (i) (20 + 16) / (10 + 16) < (15000 + 16) / (10000 + 16) >> >> but >> >> (ii) 20 / 10 > 15000 / 10000 >> >> You assume that measurements of 20 and 10 are less reliable (or perhaps >> biologically less important?) than measurements of 20000 and 10000, thus >> that ranking (i) should be used > > Generally I rank probes by a combination of statistical significance and > fold change, not by fold change alone. However, the discussion is in the > context of Illumina expression data, and Illumina intensities of 10 and > 20 are almost certain to be from non-expressed probes, hence contain no > biological signal. So, yes, I would generally view measurements of 20000 > and 10000 as both statistically more precise and biologically more > important than 20 and 10, and I would therefore want to rank as (i) > rather than (ii). I'm pretty sure that you would too. > >> - but that depends on an error model (which you encode in the >> pseudocount parameter '16') > > I put more faith in experimental evidence than I do in statistical error > models. The fact that offsetting the intensities away from zero reduces > the FDR is an observation from considerable testing with calibration > data sets. The evidence doesn't rely on an error model. Much of the > evidence is laid out in the paper that I cited in my earlier email: > > Shi, W, Oshlack, A, and Smyth, GK (2010). Optimizing the noise versus > bias trade-off for Illumina Whole Genome Expression BeadChips. Nucleic > Acids Research 38, e204. > >> and a subjective trade-off between precision and effect size. > > The fact that the value is chosen from experience with data, rather than > as as a parameter estimated from a mathematical model, doesn't make it > subjective. As I've said, I take mathematical models with a grain of salt. > > It's easy to verify experimentally that well known preprocessing > algorithms, like RMA for Affy data or vst for Illumina data (you're an > author!), also have the effect of offsetting intensities away from zero > before logging them. I think it is a useful insight to observe that this > offsetting is a good part of why those algorithms have good statistical > properties. vst has an effective offset of around 200 (Wei et al, Tables > 2 and 3). As far as I know, the offset was not designed into either of > the above algorithms. I suspect it was rather a fortuitious but > unexpected outcome. The offset that vst seems to have isn't a natural > outcome of the variance stabilization model, because it generally turns > out to be much larger than the offset that would best stabilize the > variance. Anyway, we find that by using more modest offsets in the range > 16-50 for Illumina data, we can achieve FDR as good as vst but with less > bias, much less contraction of the fold changes. Again, this is a > conclusion from testing rather than from modelling. > > I prefer to make the offset explicit, clearly visible to users, rather > than leaving it implicit or unexpected. This approach (neqc etc) isn't > the only good way to address noise, bias and variance stabilization > issues, but it's the one that seems to work best for me at the moment. > > Cheers > Gordon > >> I agree with you that the approach is useful, and also that it is good >> to provide a very simple recipe for people that either cannot deal with >> or do not care about the quantitative details. Still, this post is for >> the people that do :) >> >> Cheers >> Wolfgang > > ______________________________________________________________________ > The information in this email is confidential and inte...{{dropped:14}}

ADD REPLY • link 13.1 years ago Wolfgang Huber ★ 13k

0

Entering edit mode

Dear Wolfgang and Gordon, Thank you both for having this discussion publicly. I have found it very informative about an aspect of the analysis that I had not previously thought much about. --Naomi At 10:33 AM 4/10/2011, Wolfgang Huber wrote: >Hi Gordon > >I think we agree, my only reason for making this post is to point >out that while the choice of '16' might be a good comprise overall, >better choices can exist depending on the dataset. E.g. if there are >many replicates, then these can compensate the imprecision from >using a smaller offset and improved biases may be had. > >Variance stabilising transformations are of course very related to >the offseting that you propose. The "glog" in vsn is > > glog(x) = log( x+ sqrt(x^2 + c) ) > >which is in practice similar to the shifted log, i.e. log (x+a). (It >even avoids problems that the shifted log has when x approaches -a. >See also e.g. PMID 12761059.) > >Such transformations provide a good way of looking at microarray >data, and indeed at high-thoughput sequencing data, for many though >not all uses. > >As you say in your NAR article, there is a continuum of choices >along the precision - bias tradeoff axis, which can be parameterized >by 'a' or 'c'. > >Btw, a subtlety of vst that many users seem not to be aware of is >that it uses the between beads variance (for the same gene, in the >same sample) rather than the more relevant (and larger) between >replicates variance for finding its 'c'. > > Best wishes > Wolfgang > > >PS A pet peeve of mine, how come that people are OK to state numbers >such as '16' without reference to units: ideally, 'official' SI >units, but even an ad hoc definition would do, based on a common >agreement on reagent quantities, scanner settings, software >settings, to make the oxymoron 'arbitrary fluorescence units' just a >little bit less so. > > > >Il Apr/9/11 3:35 AM, Gordon K Smyth ha scritto: >>Hi Wolfgang, >> >>>Date: Thu, 07 Apr 2011 12:07:05 +0200 >>>From: Wolfgang Huber <whuber at="" embl.de=""> >>>To: bioconductor at r-project.org >>>Subject: Re: [BioC] limma >>> >>>Hi Gordon >>> >>>>.... "limma ensures that all probes are >>>>assigned at least a minimum non-zero expression level on all arrays, in >>>>order to minimize the variability of log-intensities for lowly expressed >>>>probes. Probes that are expressed in one condition but not other will be >>>>assigned a large fold change for which the denominator is the minimum >>>>expression level. This approach has the advantage that genes can be >>>>ranked by fold change in a meaningful way, because genes with larger >>>>expression expression changes will always be assigned a larger fold >>>>change." >> >>This comment was in the context of genes expressed in one condition and >>not the other (and was part of a longer post). In this context the >>estimated fold change is essentially monotonic in the higher expression >>level, provided the zero value is offset away from zero, so larger >>expression changes do translate into larger fold changes. In other >>contexts, it is a question of importance ranking, which I guess is the >>issue that you're raising below. >> >>>I am not sure I follow: >>> >>>(i) (20 + 16) / (10 + 16) < (15000 + 16) / (10000 + 16) >>> >>>but >>> >>>(ii) 20 / 10 > 15000 / 10000 >>> >>>You assume that measurements of 20 and 10 are less reliable (or perhaps >>>biologically less important?) than measurements of 20000 and 10000, thus >>>that ranking (i) should be used >> >>Generally I rank probes by a combination of statistical significance and >>fold change, not by fold change alone. However, the discussion is in the >>context of Illumina expression data, and Illumina intensities of 10 and >>20 are almost certain to be from non-expressed probes, hence contain no >>biological signal. So, yes, I would generally view measurements of 20000 >>and 10000 as both statistically more precise and biologically more >>important than 20 and 10, and I would therefore want to rank as (i) >>rather than (ii). I'm pretty sure that you would too. >> >>>- but that depends on an error model (which you encode in the >>>pseudocount parameter '16') >> >>I put more faith in experimental evidence than I do in statistical error >>models. The fact that offsetting the intensities away from zero reduces >>the FDR is an observation from considerable testing with calibration >>data sets. The evidence doesn't rely on an error model. Much of the >>evidence is laid out in the paper that I cited in my earlier email: >> >>Shi, W, Oshlack, A, and Smyth, GK (2010). Optimizing the noise versus >>bias trade-off for Illumina Whole Genome Expression BeadChips. Nucleic >>Acids Research 38, e204. >> >>>and a subjective trade-off between precision and effect size. >> >>The fact that the value is chosen from experience with data, rather than >>as as a parameter estimated from a mathematical model, doesn't make it >>subjective. As I've said, I take mathematical models with a grain of salt. >> >>It's easy to verify experimentally that well known preprocessing >>algorithms, like RMA for Affy data or vst for Illumina data (you're an >>author!), also have the effect of offsetting intensities away from zero >>before logging them. I think it is a useful insight to observe that this >>offsetting is a good part of why those algorithms have good statistical >>properties. vst has an effective offset of around 200 (Wei et al, Tables >>2 and 3). As far as I know, the offset was not designed into either of >>the above algorithms. I suspect it was rather a fortuitious but >>unexpected outcome. The offset that vst seems to have isn't a natural >>outcome of the variance stabilization model, because it generally turns >>out to be much larger than the offset that would best stabilize the >>variance. Anyway, we find that by using more modest offsets in the range >>16-50 for Illumina data, we can achieve FDR as good as vst but with less >>bias, much less contraction of the fold changes. Again, this is a >>conclusion from testing rather than from modelling. >> >>I prefer to make the offset explicit, clearly visible to users, rather >>than leaving it implicit or unexpected. This approach (neqc etc) isn't >>the only good way to address noise, bias and variance stabilization >>issues, but it's the one that seems to work best for me at the moment. >> >>Cheers >>Gordon >> >>>I agree with you that the approach is useful, and also that it is good >>>to provide a very simple recipe for people that either cannot deal with >>>or do not care about the quantitative details. Still, this post is for >>>the people that do :) >>> >>>Cheers >>>Wolfgang >> >>____________________________________________________________________ __ >>The information in this email is confidential and inte...{{dropped:11}}

ADD REPLY • link 13.1 years ago Naomi Altman ★ 6.0k

0

Entering edit mode

Wolfgang, Another connection is that I apply the offset after normexp background correction (using negative control probes), which is itself related to generalized log transformations. This means my x is always positive and strictly floored at zero before logging. Gordon On Sun, 10 Apr 2011, Wolfgang Huber wrote: > Hi Gordon > > I think we agree, my only reason for making this post is to point out that > while the choice of '16' might be a good comprise overall, better choices can > exist depending on the dataset. E.g. if there are many replicates, then these > can compensate the imprecision from using a smaller offset and improved > biases may be had. > > Variance stabilising transformations are of course very related to the > offseting that you propose. The "glog" in vsn is > > glog(x) = log( x+ sqrt(x^2 + c) ) > > which is in practice similar to the shifted log, i.e. log (x+a). (It even > avoids problems that the shifted log has when x approaches -a. See also e.g. > PMID 12761059.) > > Such transformations provide a good way of looking at microarray data, and > indeed at high-thoughput sequencing data, for many though not all uses. > > As you say in your NAR article, there is a continuum of choices along the > precision - bias tradeoff axis, which can be parameterized by 'a' or 'c'. > > Btw, a subtlety of vst that many users seem not to be aware of is that it > uses the between beads variance (for the same gene, in the same sample) > rather than the more relevant (and larger) between replicates variance for > finding its 'c'. > > Best wishes > Wolfgang > > > PS A pet peeve of mine, how come that people are OK to state numbers such as > '16' without reference to units: ideally, 'official' SI units, but even an ad > hoc definition would do, based on a common agreement on reagent quantities, > scanner settings, software settings, to make the oxymoron 'arbitrary > fluorescence units' just a little bit less so. > > > > Il Apr/9/11 3:35 AM, Gordon K Smyth ha scritto: >> Hi Wolfgang, >> >>> Date: Thu, 07 Apr 2011 12:07:05 +0200 >>> From: Wolfgang Huber <whuber at="" embl.de=""> >>> To: bioconductor at r-project.org >>> Subject: Re: [BioC] limma >>> >>> Hi Gordon >>> >>>> .... "limma ensures that all probes are >>>> assigned at least a minimum non-zero expression level on all arrays, in >>>> order to minimize the variability of log-intensities for lowly expressed >>>> probes. Probes that are expressed in one condition but not other will be >>>> assigned a large fold change for which the denominator is the minimum >>>> expression level. This approach has the advantage that genes can be >>>> ranked by fold change in a meaningful way, because genes with larger >>>> expression expression changes will always be assigned a larger fold >>>> change." >> >> This comment was in the context of genes expressed in one condition and >> not the other (and was part of a longer post). In this context the >> estimated fold change is essentially monotonic in the higher expression >> level, provided the zero value is offset away from zero, so larger >> expression changes do translate into larger fold changes. In other >> contexts, it is a question of importance ranking, which I guess is the >> issue that you're raising below. >> >>> I am not sure I follow: >>> >>> (i) (20 + 16) / (10 + 16) < (15000 + 16) / (10000 + 16) >>> >>> but >>> >>> (ii) 20 / 10 > 15000 / 10000 >>> >>> You assume that measurements of 20 and 10 are less reliable (or perhaps >>> biologically less important?) than measurements of 20000 and 10000, thus >>> that ranking (i) should be used >> >> Generally I rank probes by a combination of statistical significance and >> fold change, not by fold change alone. However, the discussion is in the >> context of Illumina expression data, and Illumina intensities of 10 and >> 20 are almost certain to be from non-expressed probes, hence contain no >> biological signal. So, yes, I would generally view measurements of 20000 >> and 10000 as both statistically more precise and biologically more >> important than 20 and 10, and I would therefore want to rank as (i) >> rather than (ii). I'm pretty sure that you would too. >> >>> - but that depends on an error model (which you encode in the >>> pseudocount parameter '16') >> >> I put more faith in experimental evidence than I do in statistical error >> models. The fact that offsetting the intensities away from zero reduces >> the FDR is an observation from considerable testing with calibration >> data sets. The evidence doesn't rely on an error model. Much of the >> evidence is laid out in the paper that I cited in my earlier email: >> >> Shi, W, Oshlack, A, and Smyth, GK (2010). Optimizing the noise versus >> bias trade-off for Illumina Whole Genome Expression BeadChips. Nucleic >> Acids Research 38, e204. >> >>> and a subjective trade-off between precision and effect size. >> >> The fact that the value is chosen from experience with data, rather than >> as as a parameter estimated from a mathematical model, doesn't make it >> subjective. As I've said, I take mathematical models with a grain of salt. >> >> It's easy to verify experimentally that well known preprocessing >> algorithms, like RMA for Affy data or vst for Illumina data (you're an >> author!), also have the effect of offsetting intensities away from zero >> before logging them. I think it is a useful insight to observe that this >> offsetting is a good part of why those algorithms have good statistical >> properties. vst has an effective offset of around 200 (Wei et al, Tables >> 2 and 3). As far as I know, the offset was not designed into either of >> the above algorithms. I suspect it was rather a fortuitious but >> unexpected outcome. The offset that vst seems to have isn't a natural >> outcome of the variance stabilization model, because it generally turns >> out to be much larger than the offset that would best stabilize the >> variance. Anyway, we find that by using more modest offsets in the range >> 16-50 for Illumina data, we can achieve FDR as good as vst but with less >> bias, much less contraction of the fold changes. Again, this is a >> conclusion from testing rather than from modelling. >> >> I prefer to make the offset explicit, clearly visible to users, rather >> than leaving it implicit or unexpected. This approach (neqc etc) isn't >> the only good way to address noise, bias and variance stabilization >> issues, but it's the one that seems to work best for me at the moment. >> >> Cheers >> Gordon >> >>> I agree with you that the approach is useful, and also that it is good >>> to provide a very simple recipe for people that either cannot deal with >>> or do not care about the quantitative details. Still, this post is for >>> the people that do :) >>> >>> Cheers >>> Wolfgang >> >> ______________________________________________________________________ >> The information in this email is confidential and intended solely for >> the addressee. >> You must not disclose, forward, print or use it without the permission >> of the sender. >> ______________________________________________________________________ > > -- > > > Wolfgang Huber > EMBL > http://www.embl.de/research/units/genome_biology/huber > > > ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}}

ADD REPLY • link 13.1 years ago Gordon Smyth 50k